At Rebased we have an internal time tracking project called Harmonogram. Its backend is built in Ruby on Rails. It started as a playground for new hires to get used to company culture and conventions, but over time grew into a fully-functional tool. When I joined, the application was already being used internally and we were hitting interesting performance problems. One of them was reaching the rate limit of Toggl API calls. And since this was my first week at the company, I was tasked with solving this issue.
The problem
The Toggl documentation recommended sending requests at most every second in order to avoid hitting the limit – which we were already doing in a naive way, so the solution wouldn’t be as easy as adding sleep 1
to every request.
page = 1
loop do
response = client.get(payload)
report_data.concat(response['data'])
page += 1
break if report_data.size >= response['total_count']
sleep 1 # not enough :(
end
After looking around, I’ve realised the code above was called from many places and processes: the web server, manual Rake tasks, background jobs and cron.
This made the sleep 1
solution ineffective.
I needed something that would synchronise all of these callers in a way that:
- doesn’t introduce too many changes to the existing code,
- doesn’t add any new technologies to the current stack,
- guarantees that there’s at least a 1 second delay between calls.
In order to pull this off I’d need to setup a semaphore shared by multiple Ruby processes – and a mechanism that would block API calls until at least a second has passed since the last one. I needed a place to store and share information about those calls. Redis seemed like a perfect solution, mainly because we can setup keys that expire after a given amount of miliseconds – which would allow us to store a key that expires after one second and have other callers wait until it expires. Luckily we were already using Redis for background jobs, so I wouldn’t be adding new dependencies.
The solution
To keep the changes to a minimum I’ve decided to create a new class. It should allow us to execute a block of code with a globally limited rate, but without having to know how that happens. Ideally we’d initialize that “limiter” and inject it where needed, keeping the rate limiting separate from other logic.
page = 1
loop do
# now using rate limiter instead of sleeping
response = rate_limiter.with_limited_rate { client.get(payload) }
report_data.concat(response['data'])
page += 1
break if report_data.size >= response['total_count']
end
If we assume that the rate limiter works as needed, this should solve our problem. But something doesn’t look right. Rate limiting should be a responsibility of the client. It’s there to keep us separated from the low-level details of communicating with the API. That includes rate limiting.
class Client
def initialize(token, rate_limiter: TogglApi::RateLimiter.new)
# ...
@rate_limiter = rate_limiter
end
def get(path:)
response = rate_limiter.with_limited_rate { @client.get(path) }
Response.parse(response)
end
# ...
private
attr_reader :rate_limiter
end
This looks better. Now the API client deals with rate limiting and we can simplify our example from earlier:
def some_method
page = 1
loop do
# no need to worry about rate limiting now
response = client.get(payload)
report_data.concat(response['data'])
page += 1
break if report_data.size >= response['total_count']
end
end
def client
@client ||= Client.new(toggl_api_token, rate_limiter: RateLimiter.new)
end
The rate limiter
Now that we have an idea of how the limiter interface should look like, we can talk about the implementation details.
class RateLimiter
TimedOut = ::Class.new(::StandardError)
REDIS_KEY = "harmonogram_#{Rails.env}_rate_limiter_lock".freeze
def initialize(redis = Redis.current)
@redis = redis
@interval = 1 # seconds between subsequent calls
@timeout = 15 # amount of time to wait for a time slot
end
def with_limited_rate
started_at = Time.now
retries = 0
until claim_time_slot!
if Time.now - timeout > started_at
raise TimedOut, "Started at: #{started_at}, timeout: #{timeout}, retries: #{retries}"
end
sleep seconds_until_next_slot(retries += 1)
end
yield
end
private
attr_reader :redis, :interval, :timeout
The main element is the with_limited_rate
that ends with a yield
call.
It calls the private claim_time_slot!
in a loop until it either succeeds or runs out of time.
We give it a limited amount of time because we don’t want it to hang forever, causing timeouts in other places.
In case of a timeout we throw a custom error with data for debugging.
Inside the loop, there’s a sleep
call with a calculated delay in seconds.
def claim_time_slot!
redis.set(REDIS_KEY, 'locked', px: (interval * 1000).round, nx: true)
end
The claim_time_slot!
method is straightforward.
It calls the Redis instance to set the value 'locked'
on the REDIS_KEY
key.
This value will expire after px
milliseconds and nx: true
means it will only set the value if it doesn’t exist yet.
The return value of redis.set
is truthy when the key was sucessfully created and falsy otherwise.
In other words, if no other instance of the rate limiter called redis.set
in the last second, then claim_time_slot!
will return true and the block passed to with_limited_rate
will be called.
def seconds_until_next_slot(retries)
ttl = redis.pttl(REDIS_KEY)
ttl = ttl.negative? ? interval * 1000 : ttl
ttl += calculate_next_slot_offset(retries)
ttl / 1000.0
end
# Calculates an offset between 10ms and 50ms to avoid hitting the key right before it expires.
# As the number of retries grows, the offset gets smaller to prioritize earlier requests.
def calculate_next_slot_offset(retries)
[10, 50 - [retries, 50].min].max
end
end
The seconds_until_next_slot
is more interesting.
It uses the pttl
method that returns the current TTL (time to live) of a given key, in miliseconds.
At this point we know that another instance of the limiter has claimed a time slot.
And we want to figure out how long we have to wait.
It is possible though that the key no longer exists because it’s just expired.
In that situation the returned value is negative and we replace it with a full interval to avoid unexpected race conditions or going through retries without waiting.
Then we add a small offset to the TTL, convert it to seconds and return the calculated value to be used in sleep
.
Why not ask for seconds in the first place if we’re converting them anyway? Asking Redis for TTL in seconds will return a rounded value. By asking for miliseconds we can convert them to fractions of a second.
What’s the point of calculate_next_slot_offset(retries)
?
This is a trick that will allow us to prioritize calls that have been waiting for a slot longer. A kind of scheduler, if you will.
calculate_next_slot_offset(0) # => 50
calculate_next_slot_offset(5) # => 45
calculate_next_slot_offset(10) # => 40
calculate_next_slot_offset(15) # => 35
calculate_next_slot_offset(50) # => 10
calculate_next_slot_offset(100) # => 10
Calls with higher retries
count will get smaller offset and therefore have a higher chance of “claiming” a time slot before others.
Given a 1s interval and offset values ranging from 10 to 50, we will be able to prioritize callers with timeouts of up to ~40s.
The 10ms minimal value is there to make sure we don’t hit the key just before it expires and miss an empty slot.
Closing thoughts
With the rate limiter implemented and the client configured to use it, we can stop worrying about receiving errors from the Toggl API. But it’s not ready for release just yet. We still have to write unit tests and make sure they don’t take too long to run. Stay tuned for part 2, where we’ll do just that.
Meanwhile – a working demo of the rate limiter!
This solution is a version of a distributed lock, you can read more about that on the Redis website.