Doing dumb things with Postgres

A couple of months ago I gave a talk about PostgreSQL, and specifically using the Array datatype, at London Ruby User Group and I wanted to take some time to dive into this in more detail.

I’ve been fascinated with the Array datatype from Postgres and have been using it in Rails since it became usable in 2014. At the time I worked for a company that used Postgres for all of the new projects and several came up where it fitted in nicely with the work we were doing. Based on this I decided to give it a go and see what benefits we could gain and also try to work out any major pitfalls.

The array datatype works exactly as expected, instead of storing a string or integer it instead allows you to save arrays of these types. We used these for all sorts of things, from storing article tags and CSS classes to storing multiple associations without the use of a joining table. The latter usage is the area I became most involved and interested in. Given the product we worked on would be mainly used in a read heavy manner, the additional overhead of joining tables could occassionally cause bottlenecks, especially on more complicated projects. By leveraging arrays, Rails caching and some code that I’ve come to regret, we were able to mitigate these.

Example:

You have a series of news articles that can have associated tags that we use to find other articles, normally you would do this by implementing this:

Articles Table -> Taggings joining table -> Tags Table

Using the array datatype you could instead have:

Articles Table

Doing this provides us with a couple of advantages: with caching this can be noticeably faster than joining the 3 separate tables and provides us a clean data-structure that we don’t have to manipulate or pluck from to get the thing we care about, in this case the tags. It also allows us to easily execute some very basic SQL to find articles that have the same tags as the current one being shown to the site user.

If you’ve been using Rails/Postgres over the past couple of years this might be very run of the mill so lets dive into some of the madness you can cause.

Example:

A customer already has a series of one to one relations like Products to Product Categories, but they now want this to be a many to many relation. They’ve already input lots of data and if possible they want to keep this. The obvious solution would be to create a joining table, iterate over the existing products and create entries for each of the associations and then drop the columns. I instead decided to do something a bit more out there. What if we could convert this integer column over to an array column while keeping all the existing data and not have to iterate over everything that’s already there?

To do this you’ll need a really weird migration like this:

There’s a lot going on in this migration so I’ll break it down a bit.

This line takes our existing foreign key column and converts it to an Array type, the SQL being executed recasts the existing data in the correct format so 4 would become {4}. The curly braces are how Postgres denotes an array, when returned by ActiveRecord the data would be represented as a standard array.

This is clearly an index, the important part however is that it’s using “GIN” (full-text indexing) as its index type. GIN is lossless versus GiST, the other index type available, which is lossy, introducing the chance for incorrect results slightly. In practice it probably won’t make a huge amount of difference to you which one you use but if you want to read more on these types I’d recommend the Postgres documentation. One thing to watch out for is that GIN indexes can have long build times but there are ways to mitigate that.

It’s interesting to note that by choosing not to sort this array and restore it would allow the potential of using SELECT unnest(arr[1:array_length(arr, 1)][1]) as id from data or similar to get the first item to create a reversible migration. This is only advisable as a thought exercise and I’d recommend you never do that.

The first snag we’ll hit here is that we no longer have support for any of the Rails niceties around associations like has_many. You can solve this by using code similar to the example above:

This will return all the Product Categories that have been saved in the Products product_category_ids column in the same way as has_many would. One of the biggest issues I have with using arrays like this is the amount of boiler plate code you end up writing due to losing the ActiveRecord association behaviours, but like I said sometimes the speed benefits on a read heavy site outweigh this. If you use the collection based form helpers you’ll also have to flatten or discard empty arrays on a save since they send an empty array item each time.

There’s also the concern that when this is shown to the end user it can introduce a lot of delay to loading if there are lots of associations. The main way I’ve found around this was to update a cache in Redis or similar every time a Product is updated. You can also make this even less blocking if you pass it off to a queuing system rather than blocking the update/create actions.

I’d also like to throw out an honourable mention to the PostgresExt gem for providing native support for querying the array datatype (and others) natively via Arel.

Overall I think there are some amazing things you can do with PostgreSQL’s array datatype in association with Rails but it’s a very sharp and dangerous knife at times and truly requires you to be aware of what you’re trying to do at all times.

Tweet LinkedIn Facebook