Sidekiq Batches and the Temple of OOM

March 11, 2021

Sidekiq batches are great for managing groups of related jobs. But there's one booby trap to watch out for.

Say you need to process a file. So you set up a Sidekiq job that opens a batch, reads the file, and enqueues each row for processing:

class CreateBatch
  include Sidekiq::Worker

  def perform
    batch = Sidekiq::Batch.new
    batch.on(:success, SuccessCallback)
    batch.jobs do
      CSV.foreach(my_file) do |row|
        args = my_row_args(row)
        RowWorker.perform_async(args)
      end
    end
  end
end

The batch completes successfully.

But later you notice that some unrelated jobs have a problem. The sidekiq.log shows that these other jobs started but never finished. They aren't running and they never went to the dead queue. Uh oh.

After some more digging, you realize that the box ran critically low on memory and the Linux OOM Killer terminated one of your Sidekiq processes by reviewing the output here:

sudo dmesg -T | egrep -i 'killed process'

What happened?

The issue: Enqueuing a large number of jobs within the batch.jobs block can cause your Ruby process to bloat. This can cause the process to be killed at some later point when the Linux OOM Killer decides it needs to intervene.

batch.jobs holds onto all of the args in a Ruby array and waits until the end of the block to submit that info to Redis. The purpose of this is to make pushing jobs onto the batch an atomic action. If a network failure occurs, you don't end up with only some jobs in the batch. This also prevents a race condition where jobs process so quickly that the success callback is fired before everything has been enqueued.

That array of args can get really big.

One way to avoid eating up too much memory is to push a job onto the batch and then let that job finish enqueuing all the other jobs. This is possible because:

Batches can be reopened
batch can be called from within a Sidekiq worker to access its batch

The new code to create your batch might look something like:

class CreateBatch
  include Sidekiq::Worker

  def perform
    batch = Sidekiq::Batch.new
    batch.on(:success, SuccessCallback)
    batch.jobs { AddJobsToBatch.perform_async }
  end
end

class AddJobsToBatch
  include Sidekiq::Worker

  def perform
    raise 'Must be run as part of a Sidekiq batch' if batch.nil?

    CSV.foreach(my_file) do |row|
      # This example enqueues one row at a time but
      # rows could be enqueued in small groups
      batch.jobs do
        args = my_row_args(row)
        RowWorker.perform_async(args)
      end
    end
  end
end

This will prevent your Ruby process from running out of memory on very large datasets. (Note that you'll still need enough space in Redis to hold all the jobs that are enqueued.)

This strategy is similar to the recommendation for huge batches. But it's good practice to do this for all your batches to be ready to handle increased data.

References