An introduction
Storing lots of different data in a single database column does not get much love in traditional relational database approaches, mostly because it can lead to breaking normalization – and someone learning the hard way that there was a reason for all that normalization in the first place. However, the idea became hot again with the raise in popularity of NoSQL data-stores; this, in turn, got toned down a bit after realizing that most of them were not ACID-compliant and lack in many areas that traditional databases excel at – like replication.
After a few years of hype, followed by bad experiences, it turns out that most developers prefer a fully ACID-compliant database, which also allows them to store arbitrary data structures when needed. This is where Postgres shines with its recent work on complex column types.
SQL is not dead, and never was
Most projects I’ve worked on in the last couple of years were e-commerce shops. If you need your data to be consistent, you will require database transactions. NoSQL advocates would say: keep transaction-wise data in an ACID database and the rest in NoSQL. This might sound like a solution, but let’s be pragmatic – this will only add complexity to your app, as you now have to take care of two persistence mechanisms instead of one. And please explain to your customers that you basically need to double the infrastructure costs (on all environments) – just because of some hype.
SQL is far from being dead. Indeed it’s a solution with its own history, but for sure it’s not the same thing as when it was conceived in mid-80s. Postgres is a perfect example of how a mature solution, battle-tested on thousands of running and profiting projects, can incrementally evolve and chase down even the latest trends and findings in the database field.
The herald of change in Postgres was a new column type called hstore. It was introduced in 2006 with version 8.21, so it has been with us for 15 years now. It is definitely stable and mature nowadays. It adds extra flexibility to your database schema – you can think about it as a key-value store in a single column. There are two downsides: all keys and values are strings, plus there is literally no way to add any structure to it. You must stick to a flat structure, but lookups in hstore columns can be indexed for performance.
However, it has since been surpassed in every way by the pair of JSON
and JSONB
datatypes.
Which one do I use?
The JSON type is stored internally as a TEXT BLOB. It preserves insignificant whitespace and even duplicated keys. Only basic format validation is performed on saving, making it fast to store – but it can be slower on retrieval, due to parsing. It’s a great solution when you want to use it as a bucket to store miscellaneous data, which is not strictly validated. But its whitespace- and order-preserving behavior is also useful when you need to calculate checksums, compare JSON data literally, or when the app depends on key ordering for some reason.
Out of the box you get a list of basic operators and functions to create new JSON objects from existing rows, and to traverse and process existing records.
Using JSONB is usually a better idea, in nearly all cases. Internally it converts the input into a valid binary form whenever INSERT
or UPDATE
is performed. This may cost you some performance, but this transformation also ensures key uniqueness (by dropping all repeating keys except the last one) and removes insignificant whitespace and indentation. You are also getting a new superpower on indexing and better support for complicated traversing and filtering of records. One thing it can’t do is store strings that contain the NUL byte (\u0000
) or invalid unicode sequences. If you need that, there are other data types for you (such as bytea
).
Unlike hstore, JSON/JSONB columns actually do support multiple data types (and arbitrary nesting) – all the seven JSON data types. Like hstore, JSONB is indexable (where JSON isn’t).
Querying JSON in action
Let’s make a spaced-repetition flashcard application like Anki. While simple applications of this type only allow text content, we want to be more flexible. We’ll be using JSON types for our flashcard content – and we’ll not discuss any application code, only focusing on the database.
Migration and models
db/migrations/1.rb, app/models/{flashcard,collection}.rb
To seed it, let’s use this dataset of over 200k questions from the long-running Jeopardy! US TV show. It had plenty of international adaptations, so it may have ran in your country too. In Poland, it ran as Va banque, referencing a French phrase va banque. Most questions in there have a category
key, which we’ll use to create collections. Also, we’ll limit ourselves to only the first 100 questions.
db/seeds.rb
After running rails db:setup
we can peek into the database:
Now that’s done, let’s see what we can extract from it.
How are the prize values distributed?
histogram.sql
This is obviously a silly histogram. For it to be readable even with the full dataset, we’d need to normalize by dividing count by the total number of questions, then multiply by a scaling factor to get readable bars.
ActiveRecord allows using this operator in select()
, where()
and other places, by using a string query fragment:
Other operators and functions mentioned from now on are used similarly.
How many questions from 2010 are there?
This one uses the double arrow ->>
operation which extracts a value as text. This is because LIKE doesn’t support JSON columns (and doesn’t throw an error either, just returning an empty set). With the single arrow, an extracted field is still a JSON-typed value, and may need explicit casts to other types.
There are two similar operators – #>
and #>>
– which also extract values (as JSON and text, respectively) – but their right operand is different. They take a path, which is a text array, naming successive keys or indexes in the hierarchy. They are similar to Ruby’s dig, except that they also work on arrays.
Give me all the question text (without answers) from show number 5957
The @>
operator tests for containment. It’s useful when we want to just filter on a value, but are not interested in receving the actual content. It checks whether the JSON object or array on its left contains the right one. Rules of containment are similar to the concept of pattern matching – be sure to check in the docs. This query isn’t the only way to perform that operation, of course.
Other tests are <@
(a mirror version of @>
, which checks if its right operand contains the left), ?
which checks if a string is a top-level key or an array element, ?|
and ?&
that do the same for an array of strings. The latter one requires all strings to be present, and the former needs one or more to match. Think of them like Ruby’s any?
and all?
predicates.
Containment tests work only on JSONB, not JSON, and throw an “operator does not exist” error when executed against JSON.
Did that use an index?
Obviously no, we didn’t add any. Let’s do so now:
db/migrations/2_add_indexes.rb
We must specify a GIN index, because the default B-tree index does not support JSON fields. After running rails db:migrate
we can see the index working via EXPLAIN
. However, the table is very small, and Postgres is clever enough to know that a sequential scan suffices. Let’s ban that first:
Remember, JSONB is indexable, whereas JSON is not. Postgres was able to use an index here because we used @> '{"show_number": "5957"}'::jsonb
instead of content -> 'show_number' = '5957'
. Presence tests (?
, ?|
and ?&
) can also use this index – read more in Postgres’ documentation.
Construct a full object including collection info
Here we used ||
to concatenate JSONB objects. It works like Ruby’s Hash#merge
and overwrites keys of the same name with the right operand’s content. To create a JSONB object in a query, we have a number of choices. Here we used jsonb_build_object
, which takes alternating keys and values. We could use a input array instead: jsonb_object(ARRAY['collection', name])
or separate key and value arrays: jsonb_object('{collection}', ARRAY[name])
. Complementary functions exist for building JSON arrays instead of objects.
To mangle returned objects we can also use jsonb_insert(target, path, new_value)
, which inserts the value at the specified key, or before the specified index in case of an array. With a fourth argument of true
it will insert after in an array. Replacing is done with a very similar jsonb_set()
. Delete items with the plain minus -
operator, which takes a key or an index (or a path like #>
).
Create a view of questions and answers
With jsonb_to_record
we can turn a JSONB object into a database row. This can be very useful because these are all Postgres types, and not just JSON types. The air_date
column is now an actual date column, for example:
This requires specifying the record type in the AS
clause, and as we can see it is very similar to a table column definition list. If we have an array of similar objects stored, jsonb_to_recordset
does the conversion wholesale. Functions jsonb_populate_record
and jsonb_populate_recordset
are similar, but take a rowtype as their first argument.
Beware
Storing lots of complex objects in a JSON array inside a JSONB field is usually a bad idea. While they’re easy to add and query, once you need to modify or reorder them sadness ensues. This improves with each Postgres release, but you may be forced into costly operations involving converting objects to recordsets in temporary tables, modifying or sorting these, then repacking into objects.
Pivot that: a key-value table
Some scenarios can use a key-value set instead:
This can be filtered by adding a WHERE
clause that tests either content
or flashcard
. When used as a CTE or temporary table, it’s easily modifiable by any SQL query. To turn such a set back into a JSON object we have jsonb_object_agg
(and its complementary jsonb_array_agg
):
JSON aggregation is like any other: it can additionally sort, filter or enforce uniqueness with DISTINCT
. This seems to be a lesser known corner of SQL.
Limits
JSON has the same storage limits as a text column. This lets you store up to about 1 GB (bytes, not characters – mind your encodings) in a single column. JSONB is different: a column is limited to 256 MB, minus one byte, in length. This is also the size in bytes – and it’s always UTF-8.
Constraints
So far our content
column will accept anything, even if it doesn’t make sense in our database context. Luckily, we have the full power of SQL available, and can introduce constraints on JSON content. There is no built-in support for any of the JSON schema formats, although there exists a massive pl/SQL function for exactly that.
Adding constraints with ActiveRecord migrations is supported since Rails 6.1. Let’s add them manually this time:
Which immediately fails with ERROR: check constraint "value_in_dollars" of relation "flashcards" is violated by some row
, so we delete the culprits and try again.
This condition uses Postgres’ regular expression match operator ~
– alternatively we could use SIMILAR TO
instead, in which case the condition would be SIMILAR TO '$[[:digit:],]+'
.
Now, inserting random values will fail:
We had to add two conditions here: containment and matching. This is because when value
is absent, the regex match condition evaluates to NULL
, not false
2.
Closing remarks
The availability of complex column types in Postgres should not be an excuse to use it as a single-column document database. Most of the everyday RDBMS features (SQL, indices) work best with “traditional”, relational columns approach. But, as shown above, there are scenarios where flexibility of complex columns outweights the cost of such an approach, and given Postgres’ first-class support for these, we need not be afraid of using them. Identifying whether this is one of these scenarios – well, that is an engineer’s (or a database architect’s) job.