Avoiding Cache Stampede with XFetch

1. Introduction

Imagine we’re making requests to a 3rd party API. It’s very common for the providers to introduce rate limiting on the public API endpoints. Hence, if we’re building an application that is very popular but relies on the external API, then we must introduce cache in front of the external API.

When adding cache, the first question that usually comes to mind is the cache key validity period. However, what is usually not considered, is how the cache behaves if it’s accessed concurrently. Not paying enough attention to this issue leads to cache stampede (aka dog-piling).

In this article, we’re going to look at what exactly cache stampede is and how to avoid it efficiently using XFetch.

2. What is Cache Stampede?

Imagine we have an endpoint that allows retrieving weather information of a specific city from weatherapi.com.

Here’s the REST API endpoint:

app.get('/weather', async (req, res) => {
  const city = req.query.city;
  const weather = await getCachedWeather(city);

  res.send({data: weather});
})

Here’s the getCachedWeather implementation:

const getWeather = async city => { 
  const response = await axios.get(
      `http://api.weatherapi.com/v1/current.json?key=${API_KEY}&q=${city}&aqi=no`
  );

  return response.data;
};

const getCachedWeather = async city => {
  const cacheKey = `weatherapi:${city}`;

  const cachedResult = await getFromCache(cacheKey);

  if (cachedResult) {
    return cachedResult
  }

  const weather = await getWeather(city);

  await setCacheValue({
    key: cacheKey,
    value: weather,
    expirationInSeconds: 20
  });

  return weather;
}

Function getCachedWeather uses Redis as a cache with the expire time argument. If the cache key expires in Redis, then it disappears and therefore cachedResult is null, triggering cache refresh.

Here are getFromCache and setCacheValue functions:

const getFromCache = async key => {
  const cacheResult = await redis.get(key);

  if (!cacheResult) {
    return null;
  }

  return JSON.parse(cacheResult);
}

const setCacheValue = ({key, value, expirationInSeconds}) => redis.set(
    key, JSON.stringify(value), 'ex', expirationInSeconds
);

What happens if 1,000 requests invoke getCachedWeather for the same city concurrently? When the cache is valid, then everything works perfectly.

However, imagine a scenario where the getWeather returns a result in 5 seconds, and a server receives 1,000 concurrent requests less than 5 seconds after the cache expired. In such a case, all the requests act as if there’s no cache and invoke getWeather 1,000 times even though refreshing the cached value just once is sufficient.

The issue we just saw was cache stampede (aka dog-piling).

The outcomes of cache stampede are nasty. In the case of a 3rd party API, we may exceed the rate limit. When the issue occurs inside our own system, then we may kill a service or the database.

3. Avoiding Cache Stampede with Traditional Means

How to avoid cache stampede? As it’s a very common problem to have, then there are lots of different solutions.

3.1. Locking

Probably the most straightforward solution would be to add locking to the cache module. A single process, which updates the cache, takes a lock. The other processes that try to fetch the data from the same cache key have an option to either wait until the lock is released or to return stale data if the data is present.

Using locks eradicates cache stampede completely as it’s possible to ensure strictly that just a single process updates the cache key.

Even though the solution seems perfect, it creates issues that make the scalability of the application very complex.

To begin with, when we acquire a lock for updating a specific cache key, we always need to determine a reasonable time-to-live (TTL) value for the lock. After the TTL value is exceeded, then the lock has to be released automatically. Picking a reasonable TTL ensures that if the cache update failed for some reason, then we don’t block the other processes from updating the lock indefinitely.

In addition, when we need to scale our application server horizontally, we need to be able to share locking information with the other servers. Hence, it’s often not enough to introduce locking on a single machine. Instead, we need to have a distributed locking solution like Redis which takes care of acquiring and releasing distributed locks.

3.2. External Recomputation

An alternative mean of removing cache stampede would be to move the cache recalculation logic to an external process. We just need to ensure that there’s a single updating process running at all times.

However, as the recalculation process is not part of the core application logic, we need to have a mechanism in place to ensure that the process does not die. Therefore, external recomputation adds management overhead.

In addition, we may potentially split the logic of retrieving a value from the cache to multiple different modules or processes. This fact makes the codebase more complex to manage.

Last but not least, when we recalculate the cached values in an external process, we don’t know exactly which cache keys we need to update. Therefore, the solution is very inefficient, as we may end up recalculating values that we don’t actually need at all.

4. Avoiding Cache Stampede with Probabilistic Early Recomputation

As we saw from the previous part, both locking and external recomputation come with serious downsides.

Probabilistic early recomputation is an effective mechanism for avoiding cache stampede. The core idea is that each individual process that accesses the cache, with a certain probability, travels to the future when the cache is expired and updates the cached value. The closer we get to the cache expiration, the higher the chance that the cache update will happen.

Let’s take a look at it more closely. In a normal case, the cache refresh would be triggered when the current timestamp is larger than the cache expiration timestamp:

However, as stated above, the key idea of probabilistic early recomputation is to travel with a certain probability to the future and refresh the cache. So the logic of probabilistic early recomputation could be expressed via the following equation:

Once someRandomVariable is large enough or currentTimestamp is close enough, then we trigger cache refresh. The next question would be how to calculate the value of someRandomVariable.

4.1. Early Recomputation with Uniform Distribution

One possible option would be to choose a random variable from the uniform distribution between 0 and ϵ. The parameter ϵ allows us to control how early do we want to recalculate the cache. The larger ϵ is, the earlier we allow the recomputation to happen. When reducing ϵ, we allow the recompute to happen later, but we increase the chances of a cache stampede happening.

Let’s look at an example. Imagine that an item in the cache expires in 60 seconds, and it takes 1 second to recompute the value. To avoid cache stampede, we need to recalculate the cache at the latest in 59 seconds. Otherwise, we end up causing a cache stampede

If we choose ϵ to be 15, the cache recompute can happen the earliest when there’s just 15 seconds left to the cache item expiration. The closer we get to the cache expiration, the higher the probability of an early recomputation gets. However, if we choose ϵ to be 30, then we immediately allow the earliest recomputation to happen at least 30 seconds before the expiration.

4.2. Early Recomputation with XFetch

Andrea Vattani, Flavio Chicheretti and Keegan Lowenstein have proven in their paper “Optimal Probabilistic Cache Stampede Prevention” that the uniform distribution is suboptimal. In the search for the perfect value for someRandomVariable they prove that exponential distribution is much more effective than the uniform distribution.

They propose the following equation:

delta denotes the time to recompute the value, beta is a parameter which we can tune ourselves, and random() generates a random number between 0 and 1.

We can choose the value of beta . Ìf we set it to be > 1.0, then we favour earlier recomputations, if we set it to be < 1.0, then we favour later recomputations.

The result of the calculation is a negative number which means that the final equation will change a tiny bit:

Here’s the function implementing the algorithm:

const shouldRecompute = ({lastRecomputeDuration, keyTtlInSeconds}) => {
  const delta = lastRecomputeDuration;
  // > 1 favours earlier recomputation, < 1 favours later
  const beta = 2;
  const random = Math.random();
  const xfetch = delta * beta * Math.log(random);

  const currentTimestampInSeconds = Math.round(Date.now() / 1000);

  const cacheExpiresAt = currentTimestampInSeconds + keyTtlInSeconds;

  const isEarlyRecomputeRequired = (currentTimestampInSeconds - xfetch)
      >= cacheExpiresAt;

  return isEarlyRecomputeRequired;
}

Besides just storing the cached value, we also need to store the duration of the last recomputation right next to the cached value.

Compared to avoiding stampede with locking, the probabilistic algorithm does not avoid the chance of two concurrent requests updating the same cache key at the same time. So if the requirement is to avoid any kind of concurrent issues at all costs, then probabilistic cache stampede prevention with XFetch is not a good fit. However, in most cases, XFetch is a suitable algorithm to prevent issues with dog-piling.

5. Conclusion

In this article, we looked at what cache stampede (a.k.a dog-piling) is and the mechanisms for avoiding it.

While locking and external recomputation are both seemingly easy to implement, both implementations have serious downsides. That’s the place where probabilistic early recomputation comes into play.

Here’s the implementation of the algorithm.