A Complete guide to Rate Limiting

Comprehensive Guide to Rate Limiting

‍

Introduction

Rate limiting is a fundamental construct used to control flow. It is used almost everywhere - with queues outside retail stores, to metering cars entering the freeway.

In the digital world, rate limit controls the amount of traffic flowing to a resource -

If we consider bits and bytes, the TCP congestion control algorithm limits the amount of data exchanged between two processes (or ports)
Switches also perform rate limit at the packet level to prevent overwhelming their buffers
At the API level, it may be used to control the number of calls to the API

Excessive requests can cause denial of service and the resource may go down since it cannot keep up with the excessive rate of requests. Rate limit prevents a resource from getting overwhelmed, this resource can be a network equipment, an endpoint on a server, a server or an API.

Excessive requests if not controlled can also make access to a resource unfair - imagine students registering for a class in a semester, and too many requests from a few users can prevent other students from registering. Or imagine you can’t buy tickets to your favorite sport or concert tickets just because there is too much traffic being directed at the ticketing server

Risk of cyberattacks has increased over time, so much that Gartner predicts the investments are on pace to reach over 215B

Rate-limiting can be a key mechanism in protecting from cyber attacks. It can help eliminate or make ineffective attacks by limiting malicious traffic

Why rate limiting is important?

Rate Limiting is simple, but effective to protect resources, be it networks, servers or APIs. It is widely used with several benefits.

Rate limiting helps prevent denial of service attacks. Appropriate rate-limiting can be effective in thwarting denial of service attacks. Often times, to reduce the overhead on a server, rate-limiting can be implemented in hardware eg: programming a Field Programmable Gate Array or FPGA. Such mechanisms allow effective rate-limits without the software having to process requests. Such mechanisms make it difficult for the attacker to overwhelm a server since a lot of unimportant traffic gets dropped in the hardware at line-rate. A more intelligent approach that sends prioritized traffic to the software for better control can also be implemented.

Ensuring fairness or meeting SLAs ensures that user experience is improved. It may prevent unnecessary delays in serving a client. Too much traffic on a system may cause deteriorated service experience eg: delayed checkout from a cart

Meeting SLAs for services is another reason to rate limit. SLAs may be associated with general availability of service, where the service is needed to meet/exceed the SLAs. SLAs may also be associated with different clients requesting the service. Every client, depending on say the tier of service may be eligible for a different SLA. Rate limiting helps you achieve these SLAs

Resource consumption can radically increase if not kept in check. It is necessary to plan the resource requirement for a service. Rate limits can be applied to keep the incoming traffic in check. If no rate limits are applied, resource consumption can increase unchecked. For example, when auto-scaling is enabled, a large amount of traffic can increase the scale factor thereby increasing the costs.

Fairness of serving clients can be achieved using rate limiting. Ensuring the everyone gets a fair chance at accessing the resources while rate limiting the ones which are sending more requests, can help achieve fairness

Efficient utilization of resources is a key benefit from rate limiting. Preventing starvation of resources and ensuring that the resources are available to all users is important.

Rate Limit for Security

Over half the businesses suffered cyber attack in 2023. Record breaking DDoS attacks are reported by Cloudflare, Microsoft and Google.

To fight cyber attacks and Distributed denial of service, rate limiting is an important tool. For instance, rate limiting can be applied to a large amount of traffic from a single IP address. For a distributed denial of service attack, origination from multiple IP address, the traffic can be split and inspected to characterize an attack (as coming from a rogue actor or a geographical location), to circumvent it.

Often times, credentials which are stolen, end up being used by bots. These credentials are stuffed in forms to gain access to resources which are protected using credentials. Rate Limiting can identify such attacks and thwart them.

In another attack, when credentials are not known, they are simply guessed and several attempts are made to gain access to resources. Such brute force attacks can be detected and mitigated using rate limiting. The typical identifying characteristic of these kind of attacks is the number of failures is very high. This results in consumption of lot of resource if the attack isn’t stopped.

For retailers running e-commerce website, scraping is a problem. Bots scrape deals and pricing information from the e-commerce the website. Scraping involves a lot of requests coming from a client. Rate limiting can be used to prevent scraping

Rate Limiting Algorithms

There are several algorithms used for rate-limiting. We discuss a few common ones here -

Sliding Window

The sliding window rate limiting is similar to sliding window protocol used in TCP congestion control. Every client, IP or a user is provided a window of allowed requests that they can make. As requests are completed, the window slides and more requests are allowed. This limits the number of requests that are outstanding at a specific time, not to exceed the size of the window.

The rate of requests can be adjusted by adjusting the size of the window and also for a specific duration. While the window tracks the rate, the window size can also change as time passes. This provides more control over rate limiting.

Leaky Bucket and Token Bucket

Both these forms of rate limiting allow rates on the basis of data or token. For a leaky bucket, it starts with a fixed amount of data, which gets removed as the requests are made. Similarly for token bucket, it starts with a fixed amount of token, which get removed as requests are made.

Both token and data are replenished periodically. This ensure a steady rate of data or tokens flowing through the system.

As the unit of measure is data in leaky bucket, it may not match to fixed number of requests over time. The rate at which requests are made may not be the same as the rate at which data is arriving, which may cause requests to have a longer wait time.

Token bucket may track the number of requests which may be mapped to the number of tokens.

Fixed Window or fixed rate

Fixed window is time based. Time is a sequence of windows with each window allowing only a count of requests. Any requests over the specified count have to wait until the next window.

Weighted token bucket

With a weighted token bucket, every request has a weight. For instance requests that are resource intensive may have a different weight from the ones that have less weight. Depending on the request weight, they may be rate limited differently.

Rate Limit Design

Simple rate limiting may have counting and limiting for bytes, packets or request. We talk more about how rate limiting can meet requirements for different scenarios and how it can be more intelligent in limiting

Characterizing Incoming Traffic

To rate limit, we first need to measure and track the incoming rate. However, tracking the rate is one aspect. Finding out more details, context and characteristics about the arrival rate can help us better understand the incoming traffic flow, allowing us to better control or rate limit it.

At a high level, we can check if the incoming rate and try to characterize it by -

per user
per geography
for a specific application
for a specific API

Also understanding what is being rate limited is important. At a high level we might just be trying to limit the access to a server. But in a publish/subscribe scenario, we may be trying to rate limit the number of messages published, or the number of messages published by a specific client or the number of messages published by a specific client to a specific topic.

Or in case of an API, we want to rate limit the number of requests being made to an API endpoint, or the kind of requests being made to a specific endpoint (eg: limit the number of POST v/s number of GET) or number of requests from a specific user (tracked by IP address or JWT)

Once we are able to clearly characterize the incoming traffic, we are better able to rate limit depending on the use-case.

Rate Limiting System Design Considerations

To implement an efficient rate-limiting mechanism, it’s important to consider the characteristic of incoming traffic and the resource being protected.

Understanding traffic characteristics lets us better understand -

The rate that is to be measured at different levels - be it at layer-4 like IP address or layer-7 and/or application level like APIs
Enforcing appropriate limits on the basis of rate measured in the previous step and considering the resource being protected - eg: rate limits at the network level would be different from rate limits at API level
How to handle requests that exceed the rate - eg: at network level it may just be dropping the requests or at API level, it could be a different level of rate limiting depending on attribute of the API
Appropriately distinguishing between requests to further build context - eg: user that is requesting the API and the resource requirements of the API to the time of the day when the request is made, the geographical location etc.

The above design and the tradeoffs ensure that certain aspects of the system are always met. For instance -

The system should be simple to understand, implement and enforce
At no time should the resource go down and cause a denial of service
Implementing dynamic scaling to match the resource requirements depending on the request rate to match necessary SLAs
Agility of scaling to ensure that scaling up and down is fast enough to match the request rates
Ensuring that there are checks and mitigation strategies in place to prevent attacks
Rate limiting should stay accurate even when there are multiple instances of distributed service

Getting Rate Limiting Right

There are several aspects to consider when getting rate limits right. Here we enumerate on some of the key aspects of it -

Qualifying Excess Traffic

What is identified as excess traffic is rate limited. But what traffic gets qualified as excess traffic? How do you ensure that traffic that should be allowed to pass through isn’t rate limited?

One way is to build appropriate context for the incoming traffic which will let us better identify good traffic from bad. This brings us to the next aspect of rate limiting, which is how much state needs to be stored to appropriately classify state?

Limiting Resources to classify traffic

Inspecting traffic at high rates and classifying it to better rate limit, can add up the amount of state and resources that are needed before any rate limiting action is performed. It is vital to ensure that the system does not exhaust the resources when implementing rate limiting.

With the appropriate tradeoffs in accurately classifying traffic and storing state, it is vital to ensure that good traffic isn’t dropped. Which brings us to another aspect of rate limiting, how to avoid false positives

Accuracy of Rate Limiting

If rate limiting algorithms aren’t accurate in identifying traffic that is bad, it may cause denial of service to legitimate users. Care should be taken that this is monitored and appropriate changes are made to improve the accuracy of identifying malicious traffic.

Handling traffic bursts

Sometimes there is no malicious traffic, just the rate of good traffic is bursty. It is necessary to ensure that rate limiting is able to handle bursty traffic, which may be through having appropriate buffers or queuing to smoothen the bursts. Bursty traffic can give an impression of very high rate even when the rate isn’t too high.

Getting limits right

One of the difficult aspects of rate limiting is getting the limits right. There are several factors that externally determine the rate of traffic - eg: time of the day, the bandwidth availability, the type of service, the number of users etc. If you set the limits too high, it may result in the system not being able to handle the traffic, and may bring the service down. However, if the rate limits are too low, it may result in denial of service.

Artificial Intelligence and dynamic rate limiting is necessary to ensure that rate limit settings are appropriately anticipated and applied.

Characteristics of a Rate Limiting System that works

There are some key considerations that are fundamental to designing a rate limiting system, the most important being understanding the service, system and expectations of users when designing a rate limiting system.

Understanding Requirements: The first step is to identify the SLAs and expectations of users when designing a rate limiting systems. Also anticipating the traffic, the time of the day for the traffic are useful in designing the system
Resource Planning: Next is planning the resource requirements to meet the SLAs defined in the above step.
Appropriate Accuracy in Classification: Depending on the SLAs, and considering resource requirements, traffic may be need to be classified with appropriate accuracy. For instance if resource allocation is low, limited state may be stored for request classification.
Algorithm Choice for rate limit: Considering system requirements, a simple algorithm or a variation can be implemented. However, if the accuracy or SLA requirements are not met, a more sophisticated algorithm may be warranted.
AI For Setting Rate Limits: Setting appropriate rate limits depending on usage, resource, time of the day etc. is important. Using Artificial intelligence may be beneficial to get the limits right
Dynamically Adjusting Rate Limits: If AI is used, the model can suggest appropriate rate limits that can be dynamically adjusted depending on several factors. This may also help in flagging anomalies of traffic flow to make rate limiting more effective against potential denial of service attacks.

Rate Limits in EnRoute and Real World Examples

Understanding how popular services implement rate-limits may serve as a good guide in implementing it and learn from best practices. EnRoute has a flexible and highly configurable rate-limit engine, which is baked into the system. EnRoute can program Envoy's rate limit API and is also integrated with Lyft's rate-limit service through gRPC. No additional configuration is required to enable the service. The rate limit service just works along with other functionality provided by EnRoute, including authentication and authorization.

‍

More information about EnRoute’s rate limit engine and real world rate limit examples can be found in another article - “Why Every API Needs a Clock”

‍