SQL on Rails part 1: Why

From what I experienced in various Ruby on Rails projects using relational databases, developers often tend to treat the database as a necessary evil. In many cases, especially small apps, it is not harmful, but I think it tends to enforce suboptimal solutions. If your case is the latter or you are just not sure, the chances are this post is for you! In my note, I will be focusing on PostgreSQL, but most of the content will apply to nearly any relational database engine.

SQL is inevitable (sometimes)

The tagline of some of the SQL related tools is “you don’t need to know SQL thanks to our technology”. It bugs me every time I see it because what it means is this: “You don’t have to know SQL unless you want to understand what your code does really”. And you do want to understand your code, right?

While for Rails developer it might be easier or more familiar to write code based on DSL when it comes to interacting with the database, it still comes down to raw SQL when they want to understand what’s happening actually. Tricking yourself into thinking that you write Ruby code doesn’t change the fact that it only builds SQL underneath. In case of anything beyond the basics, it pays to know what happens behind the Ruby code.

Trivial example (kind of)

Take a simple example:

class Account < ApplicationRecord
  def withdraw(amount)
    update_attribute(:balance, self.reload.balance - amount)
  end
end

This code looks innocent, and it will even work most of the time, but it has a significant vulnerability: random effect in concurrent sessions environment. Imagine that this method runs at the same time on two different servers with a balance equal to 100:

# server 1
account = Account.find(1)
account.withdraw(20)

# server 2
account = Account.find(1)
account.withdraw(30)

While a Ruby code in the method withdraw is a one-liner, when translated to SQL it executes two queries. For edge case situations in a concurrent environment, the order of the SQL statements might be following:

SELECT * FROM accounts WHERE accounts.id = 1; -- for reload
SELECT * FROM accounts WHERE accounts.id = 1; -- for reload
-- balance is calculated in ruby code as:
-- 100 - 20 = 80
-- 100 - 30 = 70
UPDATE accounts (balance) SET VALUES (80); -- for update_attribute
UPDATE accounts (balance) SET VALUES (70); -- for update_attribute

Of course, this is wrong: balance should equal 50, in the above example it will be 80 or 70, and you can’t even be sure which one exactly. You may have thought about adding transaction block to fix the problem:

transaction do
  update_attribute(:balance, self.reload.balance - amount)
end

Bad news: it won’t fix the initial problem (at least not in this exact form). Don’t feel too upset if you were wrong; you are not alone. PostgreSQL has four transaction isolation levels, and only SERIALIZABLE level guarantees data integrity in the case presented above:

transaction(isolation: :serializable) do
  update_attribute(:balance, self.reload.balance - amount)
end

There is a simpler way to write this code correctly by using single SQL query that selects the data and updates the record at once.

Account.where(id: 1).update_all("balance = balance - ?", amount)

As a query is an atomic operation, it works well regardless the number of concurrent calls:

UPDATE accounts SET balance = balance - 70 WHERE id = 1;

This problem is well known and described in detail among others in the 2nd Quadrant blog post; you can find more solution to that problem there.

What I am trying to say is that you should make friends with your database server (unless for some weird reason you managed to connect your Rails app to MS Access, nobody wants to be friends with MS Access, ask: http://howfuckedismydatabase.com/). No matter how great your ORM is (and Active Record is outstanding), it often comes down to what is beneath. Sometimes you need to think in SQL and treat Ruby as a tool to build and execute SQL.

More complex example

I’ve seen and wrote Ruby code that should have been written in pure SQL too many times by now. The most vivid example I can think of may not even be considered complex by some of the fellow devs; (spoiler alert) it takes just few lines in Ruby on Rails (and one migration) to implement it. But stay with me, it is a bit deeper than it appears.

Imagine that you write a news platform where people can upvote and downvote articles. You need to track additional data like time or user for each vote, so you created a Vote model with type column and two child classes: Upvote and Downvote. It would be very inefficient to aggregate votes count while retrieving articles from the database, so it is a rather natural thing to use a callback:

class Downvote < Vote
  belongs_to :article, counter_cache: true
end

class Upvote < Vote
  belongs_to :article, counter_cache: true
end

The only thing you need to provide to make this work are two columns called upvotes_count and downvotes_count on articles table. Simple, but there is one catch: data integrity is not guaranteed. Judging by the number of “thank you” comments on a blog post about counter cache inconsistencies it can be a notable problem. As described in the linked article, counter cache is a callback, and for some operations, it may not fire. People came up with a code which reset counter columns based on real database state once in a while.

While for some projects it will be perfectly fine to have small inconsistencies, this is an ideal case for using triggers. Triggers are functions that are executed on certain database operations; this mechanism is very similar to callbacks in Ruby on Rails just on database level. Frankly, I don’t care if I use delete, destroy, delete_all, update, update_attribute or even custom query - I just want consistency between database state and cache. For some of the methods listed above callbacks are called, for others (delete, delete_all, update_attribute) they are not; trigger always fires.

Trigger code replacing counter cache declaration:

CREATE OR REPLACE FUNCTION update_article_votes_count_trigger()
  RETURNS TRIGGER AS $$
DECLARE
  change     INT := 1;
  is_upvote  BOOLEAN := NULL;
  article_id INT := NULL;
BEGIN

  /* Variable, which points to the record that is being modified
   depends on the type of the trigger, so the case statement is
  needed to prepare the data. */
  CASE TG_OP
    WHEN 'INSERT'
    THEN
      is_upvote = (NEW.type = 'Upvote');
      article_id = NEW.article_id;
    WHEN 'DELETE'
    THEN
      is_upvote = (OLD.type = 'Upvote');
      article_id = OLD.article_id;
      change = -1;
  END CASE;

  IF is_upvote
  THEN
    UPDATE articles
    SET upvotes_count = upvotes_count + change
    WHERE id = article_id;
  ELSE
    UPDATE articles
    SET downvotes_count = downvotes_count + change
    WHERE id = article_id;
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER update_article_votes_count
AFTER INSERT OR DELETE
  ON votes
FOR EACH ROW
EXECUTE PROCEDURE update_article_votes_count_trigger();

Some parts of this code might be a bit gibberish to you, but when you focus the body of the function, it is somewhat straightforward with small exceptions:

NEW represents a record after the operation;
OLD represents a record before the operation;
TG_OP represents the type of the operation.

Because OLD is not accessible on INSERT operation and NEW on DELETE operation code getting all of the puzzle pieces is a bit complex, but it is just a setup for the core.

The function and variables declaration may not be the prettiest code you will ever seen, but I think it is rather easy to read: you have to declare variables used by function and return type in PL/pgSQL. In this case function will return trigger, because it will be used to set trigger on certain action.

The most important part is:

  IF is_upvote THEN
    UPDATE articles SET upvotes_count = upvotes_count + change WHERE id = article_id;
  ELSE
    UPDATE articles SET downvotes_count = downvotes_count + change WHERE id = article_id;
  END IF;

All in all, it is just an if operation which executes SQL UPDATE instruction. Yeah, it is more clunky than Ruby code, and there is a bit of a learning curve involved, but it is not that bad.

The last part of the code is CREATE TRIGGER clause, which makes function update_article_votes_count_trigger to be executed on INSERT or DELETE operations fore each inserted or deleted row.

Summing up, this solution may be not as straight forward as Ruby code, but it provides data integrity. No matter how the record is added to or removed from the database, counters should be in sync. Unlike with callbacks which might be run or omitted depending on Active Record usage, you can be confident that it works correctly.

DRY up your SQL code

If I haven’t convinced you to dive into SQL by now, I will try one more use case. Let’s continue with the previous example. It would be a pretty weak app if you couldn’t sort articles by ratings. If you haven’t read excellent blog post How Not To Sort By Average Rating you are likely to write something like this

Item.order("upvotes_count - downvotes_count")

Such code will work (and there are popular sites, which use this kind of measure), but it is not perfect. Imagine an article with 600 positive and 400 negative votes (score: 200, 60% positive) and another article with 5500 positive and 4500 negative votes (score: 1000, 55% positive). Article with less positive ratings ranked higher is probably not what you want. To make it more precise, you should use “lower bound of Wilson score confidence interval for a Bernoulli parameter” (find more information in the linked article). I will allow myself and borrow the SQL code:

SELECT id,
       (
         (upvotes_count + 1.9208) / (upvotes_count + downvotes_count ) - 1.96 * Sqrt(
           ( upvotes_count * downvotes_count ) / ( upvotes_count + downvotes_count ) + 0.9604
         ) / (
           upvotes_count + downvotes_count
         )
       ) / (
         1 + 3.8416 / ( upvotes_count + downvotes_count )
       ) AS ci_lower_bound
FROM   items
WHERE  upvotes_count + downvotes_count > 0
ORDER  BY ci_lower_bound DESC;

Yikes, this looks complex… I tried to format that code the best I could, but it is what it is. Now, let’s say that I put this into a scope in Rails:

scope :by_rating do order(<<-SQL)
  (
   (upvotes_count + 1.9208) / (upvotes_count + downvotes_count ) - 1.96 * Sqrt(
     ( upvotes_count * downvotes_count ) / ( upvotes_count + downvotes_count ) + 0.9604
   ) / (
     upvotes_count + downvotes_count
   )
  ) / (
   1 + 3.8416 / ( upvotes_count + downvotes_count )
  )
SQL
end

There are few problems with that approach though:

it cannot be used for pur pose different than ordering;
it will make logs much harder to read;
it cannot be tested in isolation easily;

Options to reuse this code in Rails are very limited. The only thing I can think of is to create a method that will return a string. But this is far from ideal as any syntax problems would be very obscure. What if this code could be written as:

scope :by_rating { order("ci_lower_bound(upvotes_count, downvotes_count) DESC" }

Much better! Now you can reuse this function in any other query. Using stored procedures is a great way to DRY your SQL code, to simplify it and make it more efficient in some cases. Imagine debugging query with simple function call compared to query having complex equation exposed. It makes a difference.

Code that handles ci_lower_bound stored procedure:

CREATE OR REPLACE FUNCTION ci_lower_bounds(upvotes_count int, downvotes_count int)
  RETURNS FLOAT AS $$
BEGIN
  RETURN (
   (upvotes_count + 1.9208) / (upvotes_count + downvotes_count ) - 1.96 * Sqrt(
     ( upvotes_count * downvotes_count ) / ( upvotes_count + downvotes_count ) + 0.9604
   ) / (
     upvotes_count + downvotes_count
   )
  ) / (
   1 + 3.8416 / ( upvotes_count + downvotes_count )
  );
END;
$$ LANGUAGE plpgsql;

That was actually quite easy, wasn’t it? It has at least few adventages over the ruby code:

it can be used in any part of the query,
it makes your logs much easier to read,
it can be tested without using actual records,
it can be easily used in an index.

In more complex examples, the advantages may be even more visible. Even simple median would be very hard to write as part of SQL query, but it is rather straight forward in PL/pgSQL.

If you are not a fan of the PL/pgSQL syntax, there is a good news for you: there are quite a few procedural languages available in PostgreSQL. The language, which might be interesting for Ruby developers is PL/Python and it is included in base distribution. There is also PL/Ruby on the list, but it is not supported that well.

There is more!

If I wanted to describe all of the PostgreSQL capabilities, I would probably have to write a book; I will stop here, but I hope I encouraged you to dive deeper into SQL. If you are interested in procedural languages in postgres, check PostgreSQL Server Programming book. You might also want to take a look on Awesome Postgres github repo. There are a lot of resources to continue exploring relational databases and PostgreSQL in particular.

The caveat

Right now, you are probably thinking “OK, it looks neat and all, but how do I maintain it in the Rails app?”. What can I say, this is an excellent question. Please go to part 2 of this blog post if you want to find out.

Rebased Team writing about tech we use.

Languages, frameworks, libraries, tools. Certified for 0% fluff.