An introduction

Storing lots of different data in a single database column does not get much love in traditional relational database approaches, mostly because it can lead to breaking normalization – and someone learning the hard way that there was a reason for all that normalization in the first place. However, the idea became hot again with the raise in popularity of NoSQL data-stores; this, in turn, got toned down a bit after realizing that most of them were not ACID-compliant and lack in many areas that traditional databases excel at – like replication.

After a few years of hype, followed by bad experiences, it turns out that most developers prefer a fully ACID-compliant database, which also allows them to store arbitrary data structures when needed. This is where Postgres shines with its recent work on complex column types.

SQL is not dead, and never was

Most projects I’ve worked on in the last couple of years were e-commerce shops. If you need your data to be consistent, you will require database transactions. NoSQL advocates would say: keep transaction-wise data in an ACID database and the rest in NoSQL. This might sound like a solution, but let’s be pragmatic – this will only add complexity to your app, as you now have to take care of two persistence mechanisms instead of one. And please explain to your customers that you basically need to double the infrastructure costs (on all environments) – just because of some hype.

SQL is far from being dead. Indeed it’s a solution with its own history, but for sure it’s not the same thing as when it was conceived in mid-80s. Postgres is a perfect example of how a mature solution, battle-tested on thousands of running and profiting projects, can incrementally evolve and chase down even the latest trends and findings in the database field.

The herald of change in Postgres was a new column type called hstore. It was introduced in 2006 with version 8.21, so it has been with us for 15 years now. It is definitely stable and mature nowadays. It adds extra flexibility to your database schema – you can think about it as a key-value store in a single column. There are two downsides: all keys and values are strings, plus there is literally no way to add any structure to it. You must stick to a flat structure, but lookups in hstore columns can be indexed for performance.

However, it has since been surpassed in every way by the pair of JSON and JSONB datatypes.

Which one do I use?

The JSON type is stored internally as a TEXT BLOB. It preserves insignificant whitespace and even duplicated keys. Only basic format validation is performed on saving, making it fast to store – but it can be slower on retrieval, due to parsing. It’s a great solution when you want to use it as a bucket to store miscellaneous data, which is not strictly validated. But its whitespace- and order-preserving behavior is also useful when you need to calculate checksums, compare JSON data literally, or when the app depends on key ordering for some reason.

Out of the box you get a list of basic operators and functions to create new JSON objects from existing rows, and to traverse and process existing records.

Using JSONB is usually a better idea, in nearly all cases. Internally it converts the input into a valid binary form whenever INSERT or UPDATE is performed. This may cost you some performance, but this transformation also ensures key uniqueness (by dropping all repeating keys except the last one) and removes insignificant whitespace and indentation. You are also getting a new superpower on indexing and better support for complicated traversing and filtering of records. One thing it can’t do is store strings that contain the NUL byte (\u0000) or invalid unicode sequences. If you need that, there are other data types for you (such as bytea).

Unlike hstore, JSON/JSONB columns actually do support multiple data types (and arbitrary nesting) – all the seven JSON data types. Like hstore, JSONB is indexable (where JSON isn’t).

Querying JSON in action

Let’s make a spaced-repetition flashcard application like Anki. While simple applications of this type only allow text content, we want to be more flexible. We’ll be using JSON types for our flashcard content – and we’ll not discuss any application code, only focusing on the database.

Migration and models

class CreateFlashcards < ActiveRecord::Migration[5.2]
  def change
    create_table :collections, id: :uuid do |t|
      t.uuid :parent
      t.string :name
      t.timestamps
    end

    create_table :flashcards, id: :uuid do |t|
      t.belongs_to :collection
      # Starting from Rails 5, we can use t.json or t.jsonb
      t.column :content, :jsonb
      t.column :answer, :jsonb
    end
  end
end

class Flashcard < ApplicationRecord
  belongs_to :collection, foreign_key: :collection
end

class Collection < ApplicationRecord
  has_many :flashcards
  has_one :parent, class_name: "Collection"
end

db/migrations/1.rb, app/models/{flashcard,collection}.rb

To seed it, let’s use this dataset of over 200k questions from the long-running Jeopardy! US TV show. It had plenty of international adaptations, so it may have ran in your country too. In Poland, it ran as Va banque, referencing a French phrase va banque. Most questions in there have a category key, which we’ll use to create collections. Also, we’ll limit ourselves to only the first 100 questions.

source = JSON.load(Pathname(__dir__) / 'JEOPARDY_QUESTIONS1.json')

source.first(100).map(&:with_indifferent_access).each do |record|
  category = record[:category] or 'UNKNOWN'
  coll = Collection.find_or_create_by(name: category)
  Flashcard.create(
    collection: coll,
    content: record.slice(:question, :air_date, :value, :show_number),
    # Make some JSON from plaintext answers
    answer: {text: record[:answer]}
  )
end

db/seeds.rb

After running rails db:setup we can peek into the database:

select * from flashcards limit 1 \gx
-[ RECORD 1 ]---------------------------------------------
id         | 30045b32-10da-4a27-8841-77ed61fd3d0d
collection | 75b6ace2-4e15-44b7-a4a5-fc7935232124
content    | {"value": "$200", "air_date": "2004-12-31",
           | "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", 
           | "show_number": "4680"}
answer     | {"text": "Copernicus"}

Now that’s done, let’s see what we can extract from it.

How are the prize values distributed?

select content -> 'value' prize,
repeat('#', count(*)::integer) freq from flashcards group by prize;
  prize   |           freq           
----------|--------------------------
 null     | #
 "$2000"  | ####
 "$600"   | ############
 "$1000"  | ##########
 "$400"   | ########################
 "$200"   | ############
 "$1200"  | ######
 "$2,000" | ###
 "$1600"  | #####
 "$800"   | ######################
 "$3,200" | #
(11 rows)

histogram.sql

This is obviously a silly histogram. For it to be readable even with the full dataset, we’d need to normalize by dividing count by the total number of questions, then multiply by a scaling factor to get readable bars.

ActiveRecord allows using this operator in select(), where() and other places, by using a string query fragment:

# Not the best query. AR doesn't make it easy
irb(main):034:0> Flashcard.select("content -> 'value' AS prize")
                          .group_by(&:prize)
                          .transform_values(&:count)
  Flashcard Load (0.9ms)  SELECT content -> 'value' AS prize FROM "flashcards"
=> {"$200"=>12, "$400"=>24, "$600"=>12, "$800"=>22, "$2,000"=>3, "$1000"=>10, "$1200"=>6, "$1600"=>5, "$2000"=>4, "$3,200"=>1, nil=>1}

Other operators and functions mentioned from now on are used similarly.

How many questions from 2010 are there?

select count(*) from flashcards
where (content ->> 'air_date') like '2010%';
 count 
-------
    44
(1 row)

This one uses the double arrow ->> operation which extracts a value as text. This is because LIKE doesn’t support JSON columns (and doesn’t throw an error either, just returning an empty set). With the single arrow, an extracted field is still a JSON-typed value, and may need explicit casts to other types.

There are two similar operators – #> and #>> – which also extract values (as JSON and text, respectively) – but their right operand is different. They take a path, which is a text array, naming successive keys or indexes in the hierarchy. They are similar to Ruby’s dig, except that they also work on arrays.

Give me all the question text (without answers) from show number 5957

select content -> 'question' from flashcards 
where (content @> '{"show_number": "5957"}'::jsonb);

The @> operator tests for containment. It’s useful when we want to just filter on a value, but are not interested in receving the actual content. It checks whether the JSON object or array on its left contains the right one. Rules of containment are similar to the concept of pattern matching – be sure to check in the docs. This query isn’t the only way to perform that operation, of course.

Other tests are <@ (a mirror version of @>, which checks if its right operand contains the left), ? which checks if a string is a top-level key or an array element, ?| and ?& that do the same for an array of strings. The latter one requires all strings to be present, and the former needs one or more to match. Think of them like Ruby’s any? and all? predicates.

Containment tests work only on JSONB, not JSON, and throw an “operator does not exist” error when executed against JSON.

Did that use an index?

Obviously no, we didn’t add any. Let’s do so now:

class AddIndexes < ActiveRecord::Migration[5.2]
  def change
    add_index :flashcards, :content, using: 'GIN'
  end
end

db/migrations/2_add_indexes.rb

We must specify a GIN index, because the default B-tree index does not support JSON fields. After running rails db:migrate we can see the index working via EXPLAIN. However, the table is very small, and Postgres is clever enough to know that a sequential scan suffices. Let’s ban that first:

-- Don't do this on production
SET enable_seqscan = off;

EXPLAIN select content -> 'question' from flashcards where (content @> '{"show_number": "5957"}'::jsonb);
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flashcards  (cost=12.34..17.00 rows=44 width=32)
   Recheck Cond: (content @> '{"show_number": "5957"}'::jsonb)
   ->  Bitmap Index Scan on index_flashcards_on_content  (cost=0.00..12.33 rows=44 width=0)
         Index Cond: (content @> '{"show_number": "5957"}'::jsonb)
(4 rows)

Remember, JSONB is indexable, whereas JSON is not. Postgres was able to use an index here because we used @> '{"show_number": "5957"}'::jsonb instead of content -> 'show_number' = '5957'. Presence tests (?, ?| and ?&) can also use this index – read more in Postgres’ documentation.

Construct a full object including collection info

select flashcards.content || jsonb_build_object('collection', name)
from flashcards join collections 
on collections.id = flashcards.collection limit 1 \gx
-[ RECORD 1 ]---------------------
?column? | {"value": "$200", "air_date": "2004-12-31",
         | "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", 
         | "collection": "HISTORY", "show_number": "4680"}

Here we used || to concatenate JSONB objects. It works like Ruby’s Hash#merge and overwrites keys of the same name with the right operand’s content. To create a JSONB object in a query, we have a number of choices. Here we used jsonb_build_object, which takes alternating keys and values. We could use a input array instead: jsonb_object(ARRAY['collection', name]) or separate key and value arrays: jsonb_object('{collection}', ARRAY[name]). Complementary functions exist for building JSON arrays instead of objects.

To mangle returned objects we can also use jsonb_insert(target, path, new_value), which inserts the value at the specified key, or before the specified index in case of an array. With a fourth argument of true it will insert after in an array. Replacing is done with a very similar jsonb_set(). Delete items with the plain minus - operator, which takes a key or an index (or a path like #>).

Create a view of questions and answers

CREATE VIEW all_questions AS
SELECT x.*, answer ->> 'text' answer, collections.name collection
FROM flashcards JOIN collections on flashcards.collection = collections.id,
jsonb_to_record(flashcards.content) as x(question text, air_date date, show_number int);

SELECT * FROM all_questions LIMIT 1 \gx
-[ RECORD 1 ]---------------------------------------------------------------------------------------------------
question    | 'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'
air_date    | 2004-12-31
show_number | 4680
answer      | Copernicus
collection  | HISTORY

With jsonb_to_record we can turn a JSONB object into a database row. This can be very useful because these are all Postgres types, and not just JSON types. The air_date column is now an actual date column, for example:

SELECT age(air_date) FROM all_questions LIMIT 1 \gx
-[ RECORD 1 ]---------
age | 16 years 16 days

This requires specifying the record type in the AS clause, and as we can see it is very similar to a table column definition list. If we have an array of similar objects stored, jsonb_to_recordset does the conversion wholesale. Functions jsonb_populate_record and jsonb_populate_recordset are similar, but take a rowtype as their first argument.

Beware

Storing lots of complex objects in a JSON array inside a JSONB field is usually a bad idea. While they’re easy to add and query, once you need to modify or reorder them sadness ensues. This improves with each Postgres release, but you may be forced into costly operations involving converting objects to recordsets in temporary tables, modifying or sorting these, then repacking into objects.

Pivot that: a key-value table

Some scenarios can use a key-value set instead:

SELECT x.* from flashcards, jsonb_each(content) x LIMIT 10;
     key     |          value
-------------|-----------------------------------
 value       | "$200"
 air_date    | "2004-12-31"
 question    | "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'"
 show_number | "4680"
 value       | "$200"
 air_date    | "2004-12-31"
 question    | "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'"
 show_number | "4680"
 value       | "$200"
 air_date    | "2004-12-31"
(10 rows)

This can be filtered by adding a WHERE clause that tests either content or flashcard. When used as a CTE or temporary table, it’s easily modifiable by any SQL query. To turn such a set back into a JSON object we have jsonb_object_agg (and its complementary jsonb_array_agg):

SELECT jsonb_object_agg(a.key, a.value)
FROM 
(SELECT x.* FROM flashcards, jsonb_each(content) x LIMIT 5) a \gx
-[ RECORD 1 ]----|------------------------------------------
jsonb_object_agg | {"value": "$200", "air_date": "2004-12-31", 
                 | "question": "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
                 | "show_number": "4680"}

JSON aggregation is like any other: it can additionally sort, filter or enforce uniqueness with DISTINCT. This seems to be a lesser known corner of SQL.

Limits

JSON has the same storage limits as a text column. This lets you store up to about 1 GB (bytes, not characters – mind your encodings) in a single column. JSONB is different: a column is limited to 256 MB, minus one byte, in length. This is also the size in bytes – and it’s always UTF-8.

Constraints

So far our content column will accept anything, even if it doesn’t make sense in our database context. Luckily, we have the full power of SQL available, and can introduce constraints on JSON content. There is no built-in support for any of the JSON schema formats, although there exists a massive pl/SQL function for exactly that.

Adding constraints with ActiveRecord migrations is supported since Rails 6.1. Let’s add them manually this time:

ALTER TABLE flashcards
ADD CONSTRAINT value_in_dollars
CHECK (content ? 'value' AND content ->> 'value' ~ '^\$[0-9,]+$')

Which immediately fails with ERROR: check constraint "value_in_dollars" of relation "flashcards" is violated by some row, so we delete the culprits and try again. This condition uses Postgres’ regular expression match operator ~ – alternatively we could use SIMILAR TO instead, in which case the condition would be SIMILAR TO '$[[:digit:],]+'.

Now, inserting random values will fail:

INSERT INTO flashcards (content)
VALUES ('{"hello": "world!"}')
ERROR:  new row for relation "flashcards" violates check constraint "value_in_dollars"
DETAIL:  Failing row contains (786f2c20-00d8-4196-a6ca-9f66af61efbf, null, {"hello": "world!"}, null).

We had to add two conditions here: containment and matching. This is because when value is absent, the regex match condition evaluates to NULL, not false2.

Closing remarks

The availability of complex column types in Postgres should not be an excuse to use it as a single-column document database. Most of the everyday RDBMS features (SQL, indices) work best with “traditional”, relational columns approach. But, as shown above, there are scenarios where flexibility of complex columns outweights the cost of such an approach, and given Postgres’ first-class support for these, we need not be afraid of using them. Identifying whether this is one of these scenarios – well, that is an engineer’s (or a database architect’s) job.

  1. Hstore did land in Postgres contrib in 2006. However, it actually dates back to 2003, which makes it older than JSON

  2. Another amusing gotcha of tri-valued SQL booleans.