Elasticsearch Refresh Interval Defaults and the Weekend Problem

March 6, 2022

While monitoring the Elasticsearch request durations for our app we noticed a "weekend problem" for some larger indexes:

  • On weekdays, when activity was high, searches would perform well.
  • On the weekend, when activity was low, there would be a spike in average search times.

This seemed backwards. When the system was the least burdened, Elasticsearch was performing worse.

Further investigation showed that, during times of low activity, queries would start slow for the first few requests and then they would become reliably fast. Running 10 requests in succession would give response times like:

1088.9ms
1197.9ms
455.2ms
3.3ms
3.1ms
3.0ms
2.7ms
4.0ms
2.7ms
2.5ms

Waiting a few minutes would start the process all over again: a few slow queries would be followed by many fast queries.

The problem was related to the default behavior for the index.refresh_interval:

If this setting is not explicitly set, shards that haven’t seen search traffic for at least index.search.idle.after seconds will not receive background refreshes until they receive a search request. Searches that hit an idle shard where a refresh is pending will wait for the next background refresh (within 1s).

The purpose is to optimize bulk indexing. But bringing the shards back out of their idle state can drastically slow down search performance, particularly for large indexes.

To opt out of this behavior, we set an explicit value for index.refresh_interval. Now the weekend queries perform slightly better than during the week. This also improved weekday performance for certain indexes that are used less frequently.


References