July 11, 2024
Photo by Chris Liverani on Unsplash
In this article, we’ll go over the basics of what an API rate limiter is. We’ll answer fundamental questions such as why we need a rate limiter and how we can design and implement it. This will be a hands-on guide where we put our knowledge into practice and build something useful.
Let’s start by answering the very first question:
It is usually considered a service that protects downstream services from being overwhelmed by a single client. For example, when you are entering your OTP during an online payment and you enter three wrong OTPs consecutively, you are no longer allowed to request an OTP or make any transactions for a certain time frame. But why is this done? What good does it do to prevent a user from entering the OTP?
From a hacker's perspective, it is possible they are trying random OTPs to get through the transaction. What if it’s not even a person but a script? Rate limiting is done to safeguard both the user and the services that are sending OTPs to random scripts.
In short, a rate limiter limits how many requests a sender can issue in a specific timeframe. If the number of requests exceeds that limit, it blocks or throttles them.
As explained above, protection from bad clients is one use case. Hackers can run scripts that try to brute force passwords until they gain access to your account. They can also perform a Denial-Of-Service (DoS) attack on a server. Without a rate limiter, services wouldn’t be available to legitimate users. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls. For e.g, Twitter limits the number of tweets a user can send in a 24-hour period. Recently they also announces how many tweets a user can see during a 24 hour window.
Twitter Rate Limiter
In some cases, you may want to sell your services via an API and charge for it. For example, OpenAI allows 4 or 5 free credits to use their new GPT-4 in a day. That is a form of rate limiting. If you pay for their subscription, that limit is removed.
Some businesses use rate limiting to prevent bots from accessing their resources. Some also use machine learning on top of rate limiting to determine if a request is from a genuine user or a bot.
Another use case could be to reduce cost. Rate limiting is extremely important for small start-ups who are charged on a per-call basis by big companies like Amazon for offering their web services. Thus limiting the number of such api calls is essential to reduce costs.
Let’s now understand the requirements and goals of the system:
Rate limiting can be divided into three types:
Let's talk about some algorithm by which we can implement the Rate limiter
Rate limiting can be implemented using different algorithms, and each of them has distinct pros and cons. In the following section we will try to understand each on of them using examples.
In the Fixed Window Algorithm, time is divided into fixed intervals or windows, such as seconds, minutes, or hours. Each window has a maximum number of requests that can be processed. Once the limit is reached within a window, any further requests are blocked or throttled until the next window starts.
Example: If the rate limit is set to 3 requests per minute, and the time window starts at 12:00:00, then between 12:00:00 and 12:00:59, only 3 requests will be accepted. At 12:01:00, the counter resets, and another 3 requests can be accepted.
Fixed Window Algorithm
The Sliding Window Algorithm, also known as the Rolling Window Algorithm, addresses the drawbacks of the Fixed Window Algorithm by allowing a continuous, moving window of time to limit requests. Instead of resetting at fixed intervals, the sliding window checks the rate limit based on the last request timestamp and a rolling count of the requests within the window duration.
Example: If the rate limit is set to 3 requests per minute, the algorithm maintains a rolling count of requests within the last 60 seconds at any given time. For each new request, it checks if the number of requests in the past 60 seconds exceeds the limit. If yes, the request is blocked; if no, the request is processed and the count is updated.
AdvantagesSliding Window Algorithm
Another popular algorithm is the Token Bucket Algorithm, which allows for a certain burstiness in request rates while enforcing a rate limit over a longer period. Tokens are added to the bucket at a constant rate, and each request consumes a token. If there are no tokens left, the request is blocked or throttled.
The token bucket algorithm takes two parameters
Example: If the rate limit is set to 3 requests per minute, tokens are added to the bucket at a rate of 1 token every 20 seconds. The bucket can hold a maximum of 3 tokens. When a request comes in, it consumes a token. If the bucket is empty, the request is blocked.
Token Bucket Algorithm
The Leaky Bucket Algorithm is similar to the Token Bucket Algorithm but focuses on smoothing out the traffic. Requests are added to a queue (bucket) and processed at a fixed rate. If the queue is full, incoming requests are dropped.
It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:
The leaking bucket algorithm takes the following two parameters:
Example: If the rate limit is set to 3 requests per minute, the bucket processes 1 request every 20 seconds. Incoming requests are added to the bucket, and if it is full, new requests are dropped.
Leaky Bucket Algorithm
Let us ask a more general level question, how do we wanna limit the rate? Are we going to limit based on some user_id or just IP? How will the HLD (High level diagram) looks like? Here’s an overly simplified HLD of a API Rate limiter. It sits between the client and the edge services such as API gateway. Now-a-days, Rate limiters are shipped within the API gateway itself. API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc. But for this particular case we’ll assume that API gateway doesn’t provide such functionality.
The basic idea of a rate limiting algorithm is quite simple. At a high-level, all we need is a counter to keep a track of how many requests are sent from the same user, IP address, etc. If the counter is larger than then limit, the request is dropped.
HLD of API Rate limiter
Let's go over the above diagram. When a user wants to access a service, the request first goes to the Rate limiter. Based on the above defined algorithm, Rate limiter decides whether to allow this request or drop it. Now to store the previous states of counter we’d need some sort of storage and for Rate limiter, Key value datastores are the best design choice since the lookup and updation is nearly done in constant time thereby they do not add any additional latencies at their end.
Counters? Well depending on which algorithm we are going to use the data structure will change. For e.g. in case of fixed window we’d keep the following information
1{
2 "user059224": {
3 "count": 2,
4 "timestamp": "12:00:00"
5 },
6 "user059123": {
7 "count": 1,
8 "timestamp": "12:01:00"
9 }
10}
Notice, we are storing the timestamp as well as the count for the user. Now the way our rate limiter will check if we are exceeding the threshold is as follows:
The above implementation can be easily done using an In memory hashmap but in cases we are expecting millions of user, storing these information in memory will definitely break the limiter. For this, we could go with Redis, a popular option to implement rate limiting. It is an in-memory store that offers fast lookups and updates.
What could go wrong in the above implementation?
The HLD doesn't answer about the rate-limiting rules. There can be cases when we want to take different rate limiting depending upon the type of requests
For e.g. we may only allow 5 OTP requests per minute, but we might want to allow 20 posts creation in a day. Notice, here the number as well as the periods are different.
Lyft open-sourced their rate-limiting components. Let's have a look at them.
1---
2domain: post_creation
3descriptors:
4 - key: posts
5 value: account
6 rate_limit:
7 unit: day
8 requests_per_unit: 20
This rule shows that the user can not create more than 20 posts in a day. These rules are often stored on a disk as they rarely change, but we can cache them in memory for faster lookups.
Let’s answer some back of the envelop questions. Say we were to implement this on a single machine, how much memory would we need to store all of the user’s data?
Let’s say the “userId” can take 8 bytes. In java, 1 byte can hold 1 character., Corresponding to that a counter which is an integer that will require 4 bytes, and then we have timestamp as well, timestamps are usually long datatype and in Java that usually take 8 bytes.
So for 1 user, we’d need at least 8 + 4 + 8 = 20 bytes.
We’ll also be maintaining some sort of metadata (that’ll be the internals of hashtable), for that let’s provision 20 bytes more for each record.
Thus, for a single user we’ll need 40 bytes and to scale this for one million user, the total memory we would need would be 40 * 10^6 bytes = 40 MB.
40 MB can easily fit on a single server; however we would definitely not want to route all our traffic through this rate limiter why? Because, let’s say we allow 1 user to hit 10 reqeusts per second this would amount to 10 * 1 million = 10 million RPS (requests per second) which is too much for a single server.
Thus for this sort of requirement it is inevitable to use a distributed cache and for that redis or memcache will be useful.
Here’s an in-memory implementation of Fixed Window
1import java.util.HashMap;
2import java.util.Map;
3
4public class Limiter {
5
6 private static final Map<String, FixedWindow> hashMap = new HashMap<>();
7 // how many request you are going allow
8 private static final int THRESHOLD = 3;
9 // this is the period, if it is 1 it means you are gonna allow 3 request in 1 second
10 private static final int PERIOD_IN_SECONDS = 1;
11
12 static class FixedWindow {
13 public int count;
14 public long timestamp;
15
16 public FixedWindow(int count, long timestamp) {
17 this.count = count;
18 this.timestamp = timestamp;
19 }
20 }
21
22
23 public boolean request(String userId) {
24 // this method returns true or false.
25 // If we have exceeded the limit, it will return false signifying the request was dropped.
26 // otherwise true
27 FixedWindow fixedWindow = hashMap.get(userId);
28 long currentTime = System.currentTimeMillis();
29 if (fixedWindow == null || currentTime - fixedWindow.timestamp > PERIOD_IN_SECONDS * 1000L) {
30 hashMap.put(userId, new FixedWindow(1, currentTime));
31 return true;
32 } else {
33 // check if we have breached the count already
34 if (fixedWindow.count < THRESHOLD) {
35 fixedWindow.count += 1;
36 hashMap.put(userId, fixedWindow);
37 return true;
38 } else {
39 return false;
40 }
41 }
42 }
43
44 public static void main(String[] args) throws InterruptedException {
45 Limiter limiter = new Limiter();
46 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
47 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
48 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
49 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
50 Thread.sleep(1000);
51 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
52 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
53
54 }
55}
In this algorithm we can use a deque associated with a userId. A deque of timestamps.
1{
2 "user059224": [
3 "12:00:00",
4 "12:01:10",
5 "12:01:28"
6 ],
7 "user013571": [
8 "14:01:02",
9 "14:05:11",
10 "14:20:00"
11 ]
12}
When a user hits the API:
1{
2 "user059224": [
3 "12:01:10",
4 "12:01:28"
5 ],
6 "user013571": [
7 "14:01:02",
8 "14:05:11",
9 "14:20:00"
10 ]
11}
1{
2 "user059224": [
3 "12:01:10",
4 "12:01:28",
5 "12:01:30"
6 ],
7 "user013571": [
8 "14:01:02",
9 "14:05:11",
10 "14:20:00"
11 ]
12}
Let’s compute how much memory we’ll need to store all the user data for sliding window. Here again, the userId will take 8 bytes. But this time we are storing list of timestamps, and each timestamp is of long type thus 1 element of that list will take around 8 bytes. And since we are assuming that we will support 10 requests per minute. In the worst case the list will contain 10 elements. Plus let’s have a buffer of 12 bytes for deque implementation and 20 bytes for hashtable overhead.
Thus in this way 1 user will take: 8 (userId) + (8 (timestamp) + 20 (hashtable) + 12 (deque)) * 10 = 408 bytes.
For 1 million users we will need 408 MB which again is still less.
Notice if we increase the window to hours. For e.g. if we say we are allowing 500 requests per hour, in that case we’ll have to store 500 timestamps and that will give the following estimations: 8 + (8 + 20 + 12) * 500 ~ 20,000 bytes = 20KB, and for 1 million it will be 20 GB.
Sliding window takes a lot of memory as compared to the Fixed Window.
Here’s an In-memory implementation
1import java.util.ArrayDeque;
2import java.util.Deque;
3import java.util.HashMap;
4import java.util.Map;
5
6public class Limiter {
7 private static final Map<String, Deque<Long>> hashMap = new HashMap<>();
8 // how many request you are going allow
9 private static final int THRESHOLD = 3;
10 // this is the period, if it is 1 it means you are gonna allow 3 request in 1 second
11 private static final int PERIOD_IN_SECONDS = 1;
12
13 public boolean request(String userId) {
14 Deque<Long> deque = slidingWindowHashMap.get(userId);
15 long currentTime = System.currentTimeMillis();
16 // first check if this is the first request, if that is the case the deque won't be initialized
17 if (deque == null) {
18 deque = new ArrayDeque<>();
19 deque.addLast(currentTime);
20 hashMap.put(userId, deque);
21 return true;
22 }
23 // keep on removing the timestamps that are no longer the t to t - 1s window
24 while (!deque.isEmpty() && currentTime - deque.getFirst() > PERIOD_IN_SECONDS * 1000L) {
25 deque.removeFirst();
26 }
27 // if that window has more than the 3 request, drop it
28 if (deque.size() >= THRESHOLD) {
29 return false;
30 }
31 // otherwise add it
32 deque.addLast(currentTime);
33 return true;
34 }
35
36
37 public static void main(String[] args) throws InterruptedException {
38 Limiter limiter = new Limiter();
39 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
40 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
41 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
42 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
43 Thread.sleep(1000);
44 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
45 System.out.println("Sent to API Gateway Server? " + (limiter.request("1234") ? "Yes" : "No"));
46
47 }
48}
Before we end this article let’s discuss on how should we rate limit the requests. Shall we do it on IP or by User IDs?
Rate limiting can be implemented either by IP address or by user ID, depending on the specific needs of your application. IP-based rate limiting is simpler and effective for public APIs, providing immediate protection against DoS attacks by limiting requests from a single IP. However, it may unfairly restrict users sharing the same IP or allow attackers to bypass limits using proxies.
User ID-based rate limiting ensures fair usage for each authenticated user, offering a consistent experience even if users change IP addresses. This approach, though, requires authentication and is more complex to implement.
For optimal protection, you can combine both methods: primarily limiting by user ID to ensure fairness and secondarily by IP to guard against abusive IP behavior. This hybrid strategy balances user fairness with robust security.
Please note: this implementation is overly simplified and high-level, but it gives an overall understanding of how things work under the hood. I hope this article helped you understand the basics of search engines.
Subscribe to the newsletter to learn more about the decentralized web, AI and technology.
Please be respectful!