-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Problems and Troubleshooting
Read below for tips. If you still need help, you can:
- Ask your question in The Sidekiq Google Group
- Open a GitHub issue. (Don't be afraid to open an issue, even if it's not a Sidekiq bug. An issue is just a conversation, not an accusation!)
You should not email any Sidekiq committer privately. Please respect our time and efforts by sticking to one of the two above. Remember also that Sidekiq is free, open source software: support is not guaranteed, it's best effort according to the availability of the Sidekiq committers. Sidekiq Pro and Enterprise customers get guaranteed support.
Sidekiq is multithreaded so your jobs must be thread-safe.
Most popular Rubygems are thread-safe in my experience. A few exceptions to this rule...
- therubyracer #270, try mini_racer instead
Some gems can be troublesome:
-
scout_apm
:bundle up scout_apm
to get a fixed version. - RMagick (see #338, try mini_magick instead)
- mail (use
Mail.eager_autoload!
per https://github.com/mikel/mail/issues/912) -
oj
andyajl-ruby
both have compatibility issues withjson
that can break Sidekiq. - In Rails 6.0, ActionText can lock up Sidekiq. It is fixed in Rails 6.1 but has not been backported as of Rails 6.0.3.5.
Well-factored code is typically thread-safe without any changes. Use instance variables and methods, never use class variables. Require all necessary classes on startup so you aren't requiring code while executing jobs.
Shelling out to an external program is a common need for things like PDF or image processing but can be very tricky to get right. I suggest using a gem like childprocess or posix_spawn to help with this. Make sure you add timeouts to the commands -- no one likes debugging a hung job.
If Sidekiq or a job is hanging, it's often due to a lack of timeouts. Step 1: use the TTIN signal to get backtraces from Sidekiq, deciphering the output will give you a good idea of what's wrong. Step 2: use the Ultimate Guide to Ruby Timeouts to add timeouts where necessary.
Don't use Ruby's Timeout
module. You can get mysterious stuck or hung processes randomly.
Rails autoloading and eager loading is a frequent cause of problems but remember that all Sidekiq does is boot Rails. If you have a problem, it's between your code and Rails and Rails is about convention over configuration. If you follow the Rails code loading conventions in laying out your classes, your code should load correctly. Read and learn the Rails conventions for code loading and follow them.
Some tips:
-
Foo::Bar
goes in app/models/foo/bar.rb and should be defined in expanded form like:
module Foo
class Bar
- Don't configure extra paths in autoload_paths or eager_load_paths. That's a hack; follow the conventions!
Any directory underneath
app/
may contain Ruby code, you don't need to explicitly configure anything. - A
lib/
directory will only cause pain. Move the code toapp/lib/
and make sure the code inside follows the class/filename conventions. - See common gotchas.
- See more details about the issues with autoloading and thread-safety in Rails.
- Enable Zeitwerk and turn on logging to debug loading issues.
Please don't open a Sidekiq issue for a code loading problem unless you can point to a bug in Sidekiq.
If you poll your model periodically (say, from an ajax request) to determine when your background job has completed, and your background completes in less than a second, you may run into an issue where your job polling logic works in development mode but sporadically in production.
This may be caused by rails' use of Rails.cache
. By default, Model.cache_key
is only precise to the second. Updates that start and finish during the same second may cause your status polling to return a stale record. In databases that support sub-second time values (such as postgres), set config.active_record.cache_timestamp_format = :nsec
in config/application.rb
to increase the cache precision and avoid stale records.
Sidekiq is so fast that it is quite easy to get transactional race conditions where a job will try to access a database record that has not committed yet. One solution is to use an after_commit
callback:
class User < ActiveRecord::Base
after_commit :greet, on: :create
def greet
UserMailer.delay.send_welcome_email(self.id)
end
end
You can also enable transactional push so jobs are only enqueued upon commit.
It's a bit of a hack, but you can also schedule the job to run in a few seconds so the transaction has time to commit:
MyWorker.perform_in(5.seconds, 1, 2, 3)
Either way, Sidekiq's retry mechanism's got your back. The first time might fail with RecordNotFound but the retry will succeed.
You've hit the max number of Redis connections allowed by your plan.
Limit the number of redis connections per process in config/sidekiq.yml. For example, if you're on Redis To Go's free Nano plan and want to use the Sidekiq web client, you'll have to set the concurrency down to 3.
:concurrency: 3
See #117 for a discussion on the topic. See this calculator which can help you determine the right sizing for you.
Sidekiq::Web is built over Rack::Builder. It uses Rack::URLMap to map endpoints. URLMap performs a check between SERVER_NAME and HTTP_HOST headers.
You must be sure that your webserver send you the same value in SERVER_NAME and HTTP_HOST. Nginx, for example may be using the catch-all config instead of setting SERVER_NAME to $http_host. Heroku is already sending the good header.
Rails 6.1 and earlier did not put network timeouts on SMTP connections. A honeypot or malicious SMTP server can lead to lingering network connections and idle jobs. Add this to an initializer:
# gem "mail", ">= 2.7.0"
Mail::SMTP::DEFAULTS[:read_timeout] = 5
Mail::SMTP::DEFAULTS[:open_timeout] = 5
Always put timeouts on every network connection.
If you enable autoloading in production, Sidekiq will lock up. Code reloading is for development only. In config/environments/production.rb
:
config.cache_classes = true
config.eager_load = true
Another common problem is that you might have defined a namespace in Sidekiq.configure_server
but not in Sidekiq.configure_client
or named it something else. Make sure you configure both!
Another issue that some have experienced is caused by rspec-sidekiq
. You need to make sure that rspec-sidekiq
is under the test
group ONLY :
group :test do
gem 'rspec-sidekiq'
end
Related stackoverflow question can be found at http://stackoverflow.com/a/17065723/1965817
If your ActiveRecord connection pool size is smaller than Sidekiq's concurrency, you can easily get connection checkout timeouts. You need to make sure your config/database.yml
's pool
attribute is equal to Sidekiq's concurrency. Remember that database.yml may contain ERB so you can do this:
pool: <%= ENV['RAILS_MAX_THREADS'] || Sidekiq.options[:concurrency] %>
Here's a great Youtube video (and blog post) from @adamlogic on how to resolve these errors:
If you see strange postgres connection errors, try using ActiveRecord's reaper to clean up connections. Add this to your database.yml:
reaping_frequency: 10
Linux's OOM killer might kill Sidekiq if your machine is running low on memory and can't swap. Use dmesg | egrep -i 'killed process'
to search for OOM activity:
[102335.319388] Killed process 6567 (ruby) total-vm:1333004kB, anon-rss:355088kB, file-rss:688kB
The solution is to get more memory or optimize your workers. See Memory Bloat below for tips.
Only two things can cause a Ruby VM to crash: a VM bug or a native gem bug. Sidekiq is pure Ruby and cannot crash the Ruby VM on its own. A couple of notes:
- Ruby can have a bug - make sure you are running the latest Ruby version
- native gem bugs can cause crashes - make sure you are running the latest version of all native gems so you have the latest fixes
- every time the Sidekiq process crashes, any messages being processed are lost. You can avoid this with Sidekiq Pro's Reliability feature.
You can get a list of all native gems in your app with this command:
bundle exec ruby -e 'puts Gem.loaded_specs.values.select{ |i| !i.extensions.empty? }.map{ |i| i.name }'
Often this is due to an old, left-over Sidekiq process that is still running. Make sure old processes are killed. Also, you can have this issue on multi-app server, if you don't properly set redis namespaces for each sidekiq instance.
Each Sidekiq process running on MRI will only use one core, regardless of the number of threads. To get the benefit of multiple cores, you should run several Sidekiq processes. Sidekiq Enterprise will do this automatically with the multi-process feature.
If you have a memory bloating and your Sidekiq process goes from X MB to BIG MB over time, there are many different possible causes. Here's a few I've seen.
It's very easy to write an inefficient query in ActiveRecord which loads 1000s of items unnecessarily. Example:
# See if product search returns no results
# Terrible, do not do this!
return "No results" if Product.search(...).blank?
If the product search returns 10,000 results, this query will create 10,000 objects and then immediately throw them away. This will expand the heap and cause VM bloat. For more information about how the Ruby heap works, check out these slides
The right way:
# See if product search returns no results
# Much faster!
return "No results" if Product.search(...).count == 0
Unfortunately it's up to you to determine which worker and query is causing the bloat. Another example:
Wrong, might load millions of user objects in memory:
User.all.each { |u| u.something }
Right, will iterate through 1000 users at a time:
User.find_each { |u| u.something }
In short, it is really easy to use ActiveRecord inefficiently. Read through your queries and make sure you understand exactly what each will do.
Even when performing batched reads correctly, as above, the ActiveRecord query cache can cause memory bloat by storing query resultsets unnecessarily. Since Rails 5.0, the query cache is enabled by default for background jobs, including Sidekiq workers. If your job performs a large number of batched reads and is still using lots of memory, try disabling the query cache or clear query cache manually:
ActiveRecord::Base.uncached do
User.find_each { |u| u.something }
end
# Clear query cache
User.find_in_batches.each do |users|
users.each { |u| u.something }
ActiveRecord::Base.connection.clear_query_cache
end
Since Rails 6.0.2.1, the query cache is skipped for find_each
, find_in_batches
and in_batches
queries (see https://github.com/rails/rails/pull/28867). Anyway, queries inside the block of these methods will still use the query cache, so depending on your implementation you still may find appropriate to follow the guidelines above.
On Linux, Ruby uses the default glibc implementation for allocating all memory. This implementation is very prone to memory fragmentation and can lead to huge bloat. The simplest thing to do is to add MALLOC_ARENA_MAX=2
to the environment for your Ruby processes. A more complete solution requires you to switch to jemalloc. See my blog post and Nate Berkopec's blog post on the subject.
APM services like Datadog and New Relic can collect detailed trace data for requests and job and upload that trace data after completion. However if you have a long-running job, tracing can collect minutes or even hours of data before uploading the entire massive trace, bloating your process. With Datadog, you can enable partial flushing in an initializer:
Datadog.configure do |c|
c.tracing.partial_flush.enabled = true
end
If none of the above seem to be your issue you can try to investigate further by using the below worker or integrating part of it into one of your own workers.
require 'objspace'
class HeapDumpJob
include Sidekiq::Job
def perform(filename)
File.open(filename, 'w') do |f|
ObjectSpace.dump_all(output: f, full: true)
end
puts "done!"
end
end
Once your output has been written to a file you can then use reap, a tool for parsing Ruby heap dumps, to parse the heap.
If your Sidekiq process is not performing any work, send it the TTIN signal to dump backtraces to the log. That will show you where the threads are stuck. Most commonly a remote network call is hanging:
- DNS lookup - resolving a hostname might stall.
- Net::HTTP - unresponsive remote servers can cause a Net::HTTP call to hang and worker threads to pause for long periods. Set
open_timeout
to ensure your code raises an exception rather than hanging forever.
Rule of thumb: use TTIN to find where the threads are blocked and ensure those calls have proper timeouts set.
If the Sidekiq process is not responding to signals at all (nothing appears in the logs when you send TTIN), you can use GDB to dump backtraces for all threads:
sudo gdb `rbenv which ruby` [PID]
<snip>
(gdb) info threads
Id Target Id Frame
37 Thread 0x7f8b289d8700 (LWP 7994) "ruby-timer-thr" 0x00007f8b27a20d13 in *__GI___poll (fds=<optimized out>, fds@entry=0x7f8b289d7ec0, nfds=<optimized out>,
nfds@entry=1, timeout=timeout@entry=100) at ../sysdeps/unix/sysv/linux/poll.c:87
36 Thread 0x7f8b23eb0700 (LWP 7995) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
35 Thread 0x7f8b23c2e700 (LWP 7996) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
34 Thread 0x7f8b239ac700 (LWP 7997) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
33 Thread 0x7f8b237aa700 (LWP 7998) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
32 Thread 0x7f8b28844700 (LWP 8002) "SignalSender" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
31 Thread 0x7f8b1e1bf700 (LWP 8003) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
30 Thread 0x7f8b1e0be700 (LWP 8006) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
29 Thread 0x7f8b1dd81700 (LWP 8009) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
28 Thread 0x7f8b1dc80700 (LWP 8010) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
27 Thread 0x7f8b1db7f700 (LWP 8011) "ruby" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
...
(gdb) set logging file gdb_output.txt
(gdb) set logging on
(gdb) set height 10000
(gdb) t a a bt
(gdb) quit
Now put the contents of gdb_output.txt into a gist and open a Sidekiq issue.
You can get the Ruby backtrace of the current (hung) thread by running this in GDB. Note: it will print to the process's stdout, which might be a logfile, and will print upside down from the normal Ruby backtrace.
(gdb) call (void)rb_backtrace()
This is an excellent blog post about using GDB with Ruby.
Fixed race condition in heartbeat which could rarely lead to lingering processes on the Busy tab. [#2982]
to clean up lingering processes, modify this as necessary to connect to your Redis. After 60 seconds, lingering processes should disappear from the Busy page.
require 'redis'
r = Redis.new(url: "redis://localhost:6379/0")
# uncomment if you need a namespace
#require 'redis-namespace'
#r = Redis::Namespace.new("foo", redis: r)
r.smembers("processes").each do |pro|
r.expire(pro, 60)
r.expire("#{pro}:workers", 60)
end
If you are using the commercial versions of Sidekiq, you might get Bundler::HTTPError Could not fetch specs from https://<hostname>
. This is usually due to your commercial subscription expiring. The easy solution is to purchase a new subscription at the billing site.
If you believe your subscription is still in good standing, ensure you don't have firewall rules blocking access and enable Bundler debugging to get more data with DEBUG=1 bundle install
. You can also verify that the gem servers are available via the Sidekiq status page.
I don't accept generic memory leak issues for Sidekiq. Memory leaks can be caused by any part of the Ruby VM or gem in your application. Unless you can show evidence that Sidekiq is actually the root problem, please don't open an issue. Things like ActiveRecord's query cache have been shown to cause bloat (see above).
Read Sam Saffron's blog post about memory leaks for how to instrument and track down any leaks in your Sidekiq processes.
Previous: Sharding Next: Testimonials