Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out after 0s while waiting for primed token #601

Closed
joshhubers opened this issue May 5, 2021 · 41 comments · Fixed by #603
Closed

Timed out after 0s while waiting for primed token #601

joshhubers opened this issue May 5, 2021 · 41 comments · Fixed by #603

Comments

@joshhubers
Copy link

joshhubers commented May 5, 2021

Describe the bug
When executing some workers we are getting a Timed out after 0s while waiting for primed token warning.

Expected behavior
Workers should be executed appropriately.

Current behavior
It appears that jobs are started, the warning is thrown, no worker code is executed, and then the job finishes without executing.

Worker class

class PublishWorker
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing
  # This lock allows one instance of this job with a particuar argument
  # to be in the queue and one to be executing.

  def perform(live_object)
    return if cancelled?

    FeedService::Publish.call(live_object)
  end

  def cancelled?
    REDIS.exists?("sidekiq-cancelled-#{jid}")
  end

  def self.cancel!(jid)
    REDIS.setex("sidekiq-cancelled-#{jid}", 86_400, 1)
  end
end

Additional context
This warning and lack of execution is only happen sometimes but not always. Inside of FeedService::Publish.call(live_object) we created nested workers from an iteration. The number of nested workers range from 1~20. They are also until_and_while_executing locks. Usually what happens is the first few nested workers will execute and the rest will not perform. Sometimes even the "top level" worker above however does not perform.

Configuration:

        ::Sidekiq.configure_client do |config|
          config.client_middleware do |chain|
            chain.add SidekiqUniqueJobs::Middleware::Client
            chain.add Sidekiq::Status::ClientMiddleware, expiration: 30.minutes
          end
        end

         ::Sidekiq.configure_server do |config|
          config.client_middleware do |chain|
            chain.add SidekiqUniqueJobs::Middleware::Client
            chain.add Sidekiq::Status::ClientMiddleware, expiration: 30.minutes
          end

          config.server_middleware do |chain|
            chain.add Sidekiq::Status::ServerMiddleware, expiration: 30.minutes
            chain.add SidekiqUniqueJobs::Middleware::Server
            chain.add Sidekiq::Middleware::Server::Statsd
          end

          SidekiqUniqueJobs::Server.configure(config)
        end

version 7.0.9

Any pointers or thoughts would be immensely appreciated! Thank you for your hard work on this spectacular gem!

@mhenrixon
Copy link
Owner

Hi @joshhubers, this is actually a feature, not a bug. Please see this comment #571 (comment).

This happens for locks that have a specific or unique job during execution. If the server process can't achieve a lock in the allowed time, it logs a warning. I am looking into providing a better mechanism for when this occurs.

In general, I am looking at letting the users subscribe to an event and deal with it any way they want. Like I wrote in the linked comment, perhaps the log level is wrong, which people find confusing.

Anyway, I will improve this next. I have just been swamped with paid work for a while.

@joshhubers
Copy link
Author

joshhubers commented May 6, 2021

Thank for for the quick response.

I had read that comment earlier, I guess I'm just struggling to understand why it's happening and what I can do to resolve the issue?

Are there other logs that describe the root of the issue I'm missing? Are the client or server queues full for some reason? Am I submitting multiple jobs with non-unique arguments which cause multiple jobs with the same lock digest?

Just trying to figure out the root of the issue. Would setting lock_timeout value to say even 1 second help?

Thank you for all your hard work.

@mhenrixon
Copy link
Owner

It might help in certain scenarios. It depends on how many duplicates are coming in at the same time etc.

In your example above you'd just slow things down a little.

There is conflict resolution you can configure and such things.

@rasto28
Copy link

rasto28 commented May 7, 2021

Hello,

we have similar issue with this warning logs, but changing lock_timeout in settings does not help. What helps us was setting up lock_ttl (we were using default value).

When logging config value in warn_about_timeout method:

def warn_about_timeout
   log_warn(config.inspect)
   log_warn("Timed out after #{config.timeout}s while waiting for primed token (digest: #{key}, job_id: #{job_id})")
end

I got:

#<SidekiqUniqueJobs::LockConfig:0x00007fd5ea9c3818 @type=:until_executed, 
@worker=PeriodicWorkflowsJob, @limit=1, @timeout=30, @ttl=0, @pttl=0, 
@lock_info=true, @on_conflict=nil, @errors={}, @on_client_conflict=nil, @on_server_conflict=nil>

Then I notice value of @ttl=0. Should this be nil? With value set to 0, will the lock expire immediately?

Changing config to lock_ttl = 2.weeks helps us solve the problem.

@joshhubers
Copy link
Author

joshhubers commented May 7, 2021

I've been looking at this issue all week and I can't seem to find the root of the issue. I've tried logging out the worker arguments that are hashed into the unique lock digest and all the arguments are unique, yet somewhere something is locking the worker out from performing.

From logging I can see I start about 10 workers with unique arguments, it's flaky, but about the first 2/3rds will execute, and then I get the Timed out after ... warning and the rest of the workers don't execute.

I've tried looking in the web interface for anything that would stick out, but I haven't found anything. No exceptions from the workers, no retries occurring, no duplicated arguments or repeated performs. I'm stumped.

I have yet to verify it, but it seems like the issue started when we upgraded from version 6 to 7.

@mhenrixon
Copy link
Owner

#571 (comment)

@mhenrixon
Copy link
Owner

mhenrixon commented May 10, 2021

It should hopefully be solved in v7.0.10 sorry about the trouble @joshhubers + @rasto28 and thank you for the detailed report.

@joshhubers
Copy link
Author

Fantastic! That was fast! No, Thank you @mhenrixon !

@brycemelvin
Copy link

Hi @mhenrixon,
We are still seeing these logs in our production environment after upgrading to v7.0.10 last night - same case where our team was reporting intermittent processing and it seems related to the fact that we do have some nested unique sidekiq jobs (i.e. a order processing job that then triggers another metrics recalculation job both of which are set to until_and_while_executing).

2021-05-20T17:15:25.496Z 28169 TID-osbm440ht Sidekiq::Extensions::DelayedClass JID-b2bc8c59f029ab12c403b4f8 uniquejobs-client DIG-uniquejobs:c9dc4ca3139aa56673ab4de8c692c933 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:c9dc4ca3139aa56673ab4de8c692c933, job_id: 7bb14d6d36716578e48fae43)
2021-05-20T17:15:44.253Z 28392 TID-oscpr4qng PrerenderFbWorker JID-3ffa0aeb224251f9e85fa45c uniquejobs-server DIG-uniquejobs:833987ddbb7f2227771b8a4df3751c18 uniquejobs-server DIG-uniquejobs:833987ddbb7f2227771b8a4df3751c18:RUN PrerenderFbWorker uniquejobs-client DIG-uniquejobs:ea737b56400ad169bf00d0de6762be26 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:ea737b56400ad169bf00d0de6762be26, job_id: c8a0f9a38dbb74cd64c15092)
2021-05-20T17:16:18.095Z 28060 TID-ov6zl7pww PrerenderFbWorker JID-9c37880686f490dd3717e88b uniquejobs-server DIG-uniquejobs:8e1dd960e6fab9604c6e49e7d42cbc01 uniquejobs-server DIG-uniquejobs:8e1dd960e6fab9604c6e49e7d42cbc01:RUN PrerenderFbWorker uniquejobs-client DIG-uniquejobs:4181fe1e2e332735de1773d7b48c86f1 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:4181fe1e2e332735de1773d7b48c86f1, job_id: df542d18cf0d64a4291eb1ca)
2021-05-20T17:18:11.679Z 28392 TID-oscpr4qng PrerenderFbWorker JID-f16fd8f1f9d47859960c67a2 uniquejobs-server DIG-uniquejobs:95da17edacfc5aa8d4ba7268606fbdbb uniquejobs-server DIG-uniquejobs:95da17edacfc5aa8d4ba7268606fbdbb:RUN PrerenderFbWorker uniquejobs-client DIG-uniquejobs:856fbf1387506e4a457943e26866f1b7 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:856fbf1387506e4a457943e26866f1b7, job_id: 7f1c1a8f373806242a39bad3)
2021-05-20T17:18:54.505Z 28392 TID-oscpr4kkw PrerenderFbWorker JID-240eafc8300d72c207dcac52 uniquejobs-server DIG-uniquejobs:75eb03da615cad607bd511d407880320 uniquejobs-server DIG-uniquejobs:75eb03da615cad607bd511d407880320:RUN PrerenderFbWorker uniquejobs-client DIG-uniquejobs:6cef6477a5e55f6ccc8424ff2c4c9f8b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:6cef6477a5e55f6ccc8424ff2c4c9f8b, job_id: 65ea787d950f4d0b3081781a)

Config

SidekiqUniqueJobs.configure do |config|
  config.enabled         = !Rails.env.test?
  config.debug_lua       = Rails.env.prod? ? false : true # true for debugging
  config.lock_info       = Rails.env.prod? ? false : true # true for debugging
  config.max_history     = 500  # keeps n number of changelog entries
  config.reaper          = :ruby # also :lua but that will lock while cleaning
  config.reaper_count    = 1000  # Reap maximum this many orphaned locks
  config.reaper_interval = 600   # Reap every 10 minutes
  config.reaper_timeout  = 150   # Give the reaper 2.5 minutes to finish
end

Worker

  log_payload = Rails.env.prod? ? false : true

  sidekiq_options lock: :until_and_while_executing,
                  log_duplicate: log_payload,
                  queue: Sidekiq::Queues::DEFAULT_QUEUE,
                  lock_args_method: ->(args) { [ args.first ] }

We are still running sidekiq 5.2.9. Any thoughts or suggestions?

@brycemelvin
Copy link

After further investigation, I am seeing this same Timed out after 0s while waiting for primed token error when I am trying to push many jobs which should be unique along with the duplicate logging message as well. So would this time out message occur during the rejection of a duplicate job? If that is the case it seems like it can be safely ignored?

@mhenrixon
Copy link
Owner

@brycemelvin you are correct, it is just a warning ⚠️, not really something to be worried about. It should only occur when you have too many jobs with from the same queue, worker and with the same arguments being pushed or tried at the same time.

@jerryjohnjacob
Copy link

I'm also seeing this warning. However, it doesn't seem to be restricted to the situation when an identical job is getting enqueued. I tried simulating this through a test which enqueues exactly 1 job and ran that test in isolation (flushing the redis cache before and after running the test). Got the warning every time.

Using sidekiq 6.2.1 and sidekiq-unique-jobs 7.0.11.

P.S. thank you for the amazing work!

@jerryjohnjacob
Copy link

Just ran another test where I enqueued 100 jobs that each updated their own database record. In every case where the warning was received, the database record was not updated meaning the job itself didn't run. This does not seem to be an ignorable warning.

@mhenrixon
Copy link
Owner

@jerryjohnjacob thanks for the report. I will do some more debugging on the issue given your test example.

@joshhubers
Copy link
Author

Sorry I never circled back on this issue. Upgrading to the version v7.0.10 worked for us at least. We still receive the warnings, but our execution occured as expected. Not the most helpful I know for others, but I thought I'd say something for reference.

@mhenrixon
Copy link
Owner

@joshhubers That is super helpful feedback! Thank you

@krutovgerman
Copy link

Just ran another test where I enqueued 100 jobs that each updated their own database record. In every case where the warning was received, the database record was not updated meaning the job itself didn't run. This does not seem to be an ignorable warning.

Hey, did you resolve the problem?

@jerryjohnjacob
Copy link

@krutovgerman Not exactly. I had to downgrade to the version I was using before, which is quite old, so not a "solution" I would recommend.

@jerryjohnjacob
Copy link

@mhenrixon Did some additional testing. Looks like the issue was introduced in v7.0.8. I ran similar tests to what I'd previously mentioned and noticed that the issue of the jobs not getting executed started with this version. With v7.0.7, the warnings would appear in cases where there were duplicate jobs being enqueued, but at least one of those jobs would get properly executed. Hope this helps.

@mhenrixon
Copy link
Owner

mhenrixon commented May 31, 2021

@jerryjohnjacob did you also try 7.0.11? never mind, saw you did.

@jerryjohnjacob
Copy link

@mhenrixon: @joshhubers says v7.0.10 solves the problem for him, but it does not for me. I only get the behaviour he described with v7.0.7. So I'm wondering whether the issue is beyond the scope of just sidekiq-unique-jobs. e.g., there might be a dependency on the Sidekiq and/or Ruby versions. For reference, I'm using sidekiq 6.2.1 and ruby-2.7.1. If there is any other information that I can provide, please let me know. This is an amazing library and we appreciate you and all the work you do!

@monsha
Copy link

monsha commented Jun 2, 2021

I'm getting this warning too. I understand this is the expected behaviour when there is already a job in the queue (the same job with the same args), but what I see is the lock is not removed after the job finishes. I see the lock remains in the Locks tab when busy/enqueued/retries/etc is 0. It's the same with until_and_while_executing or until_executed.

Using sidekiq 6.2.0, sidekiq-unique-jobs 7.0.11, Ruby 3.0.1.

P.S. thank you for this amazing gem @mhenrixon

@mhenrixon
Copy link
Owner

@monsha @jerryjohnjacob Thanks for the details. I'll do some experiments this week.

@heaven
Copy link

heaven commented Jun 4, 2021

This looks like debug information to me:

pid=4 tid=oxaqvpq4w class=PageCrawlJob jid=73203f0c06e7f736cbe75331 bid=qXGzBcNaFQP7Ww uniquejobs=client until_executed=uniquejobs:d762c6d561dcdb6387c8a9d6acf41643 until_executed=uniquejobs:d3cd8c187f4b454a583464625dceddd7 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d3cd8c187f4b454a583464625dceddd7, job_id: 338287642716773462df1b28)

There is no way I can use this so the value to me is 0. But our server logs are literally flooded by these. We need a way to silence this completely.

For INFO log level it could show something like a job with args ... already exists failed to get lock within ... second.
For DEBUG it may include what it does currently + unique arguments.

But since this is expected behavior it shouldn't be a warning, I want all duplicate jobs to be silently discarded and I don't need to know about that. It's enough that when a job failed to be queued I'm getting nil instead of jid.

@mhenrixon
Copy link
Owner

@heaven would it work if I set the log level to debug perhaps? There is also a PR to enable silencing of these particular log entries but I'd prefer to find a more suitable solution.

@heaven
Copy link

heaven commented Jun 4, 2021

@mhenrixon there are 2 issues I'm facing:

  • flood in the logs
  • lack of useful information in the output

Regarding the first problem, I think it would make sense to silence the warning when the timeout is 0. Or change the level from WARN to DEBUG. And keep using warnings when timeout is > 0. Or add a config option, such as debug:boolean. I feel that if you silence it completely there will be people complaining that the jobs discarded silently :) So I'd simply add some control to this. Like you did here:

  config.debug_lua       = !Rails.env.prod? # true for debugging
  config.lock_info       = !Rails.env.prod? # true for debugging

For the second problem, I've found that more useful information could be added, e.g. I was missing item["lock_args"]. Currently, we have all sorts of job IDs and unique job keys, which aren't very helpful in production.

@mhenrixon
Copy link
Owner

@heaven I am working on a kind of event system for the gem that instead of logging will give you the event that generated the entry and the item/job itself. That way you can do whatever you want with it and I will remove the logging.

Would something like that work better perhaps?

@heaven
Copy link

heaven commented Jun 4, 2021

@mhenrixon I'm sure that would be awesome.

Well, you can still keep the logging but in this case, make it using the debug level and allow turning it on/off like with the config settings above. Might be useful in some scenarios helping to identify that there's a problem. While the event system will help to deal with it.

But in the meantime, if you can allow us to silence this warning that'd be a huge relief because our logs are bloated.

@mhenrixon
Copy link
Owner

I will reduce the log level to debug immediately @heaven. Sorry about the noise

@mhenrixon
Copy link
Owner

@heaven #610 610

@mhenrixon
Copy link
Owner

Released as v7.0.12

@heaven
Copy link

heaven commented Jun 4, 2021

@mhenrixon amazing, thank you, man!

@mhenrixon
Copy link
Owner

My pleasure @heaven! Always happy to help

@jerryjohnjacob
Copy link

Hi @mhenrixon! Were you able to find a solution for the issue of jobs getting dropped? I retried the same tests with the latest version of the gem and with more records (30000+ instead of 100). The warnings no longer appear, but the issue still exists. Some jobs were dropped and the database records were not updated.

@mhenrixon
Copy link
Owner

@jerryjohnjacob it is on my radar. I've unfortunately had a bad stroke of luck in the last couple of months.

Been sick 3 weeks with the flu, lost my dog and was out another week for tweaking my back.

I'm back now and can't wait to dig into some of these problems.

@jerryjohnjacob
Copy link

jerryjohnjacob commented Sep 18, 2021 via email

@mhenrixon
Copy link
Owner

Version 7.1.7 partly fixes similar issues (#640) that jobs are silently dropped when using replace strategy.

In the original question, no strategy is in use so it is likely also something else.

@jerryjohnjacob
Copy link

@mhenrixon Hope you're doing better now!

Just tried v7.1.7. Still seeing the issue. Here are the stats for a job count of 34493:
Number of successful jobs: 32339
Number of dropped jobs: 2154

I still wonder if it has something to do with the configuration I'm using. Here are the latest details:
Ruby v2.7.4
Rails v6.1.4
Sidekiq v6.2.2
Sidekiq pro v5.2.4

Sidekiq config:

Sidekiq::Client.reliable_push!

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end
end

Sidekiq.configure_server do |config|
  config.reliable_scheduler!
  config.super_fetch!

  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  SidekiqUniqueJobs::Server.configure(config)
end

Worker:

class UpdateRecordWorker
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing, unique_across_queues: true, queue: 'high_priority'

  def perform(record_id)
    Record.find(record_id).update_column(:my_db_column, true)
  end
end

Final script:

Record.update_all(my_db_column: false) # reset
Record.ids.each { |record_id| UpdateRecordWorker.perform_async(record_id) }

This configuration works with v7.0.7 of the gem, the only issue being the warning being printed constantly in the logs, e.g.:

2021-09-21T10:36:20.203Z pid=2323 tid=air uniquejobs=client until_and_while_executing=uniquejobs:48cbec793851a14c97c9cfeba9925afb WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:48cbec793851a14c97c9cfeba9925afb, job_id: 3f8280b500c90a4cbb9ddfdb)

@jerryjohnjacob
Copy link

@mhenrixon Were you able to look into this afterwards? I ran into the issue reported here and have to upgrade, but I'm still facing the issue of workers randomly failing to run. On the latest test, found that 1978 out of 34493 jobs were not run. This only happens when I add the uniqueness options, so I think we can rule out the root cause directly coming from somewhere else.

@mhenrixon
Copy link
Owner

@jerryjohnjacob not yet but I am slowly getting to it. Starting a company and doing full time freelancing isn't easy but I finally have a project again that uses the gem so I'll get to it at some point.

@jerryjohnjacob
Copy link

Wow, mega stuff @mhenrixon! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants