Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.1] [5.2] Multiple queue workers using Redis #8577

Closed
m4tthumphrey opened this issue Apr 27, 2015 · 53 comments
Closed

[5.1] [5.2] Multiple queue workers using Redis #8577

m4tthumphrey opened this issue Apr 27, 2015 · 53 comments
Labels

Comments

@m4tthumphrey
Copy link

Possible cause/result of #8576.

It seems that using several workers (php artisan queue:work --daemon ...) via supervisor cause the jobs to be picked up by multiple workers at the same time, causing all kinds of havoc.

I have 10 workers in my case, that download a file from an FTP server, save the file itself locally, inserts a row in a database and then delete the original file from the FTP site.

I've experienced several times now, that several rows appear in the database, and sometimes I get an error informing me that the file on the FTP site doesn't exist but the row appears in the database meaning that it has already been processed. I am also pummelled with the redis errors found in #8576 (Predis\Transaction\AbortedMultiExecException)

I should say that this doesn't happen all the time, only for some jobs...

FYI, I've temporarily reduced the number of workers to 1 and I haven't had a user error since, altho I do still get the Predis errors.

I've tried to reproduce this locally but cannot, it seems that it works as expected. I don't get the Predis error either. The only difference is that I'm using supervisor in production and 10 terminals on my Mac...

@raja
Copy link

raja commented May 29, 2015

+1

@raja
Copy link

raja commented May 29, 2015

@m4tthumphrey Any luck resolving this issue. I've seen similar behavior with running a large worker farm (+50 workers).

@GrahamCampbell
Copy link
Member

Looks like a limitation with that driver. Perhaps use beanstalk - it's more suited to this use.

@m4tthumphrey
Copy link
Author

@raja Yeh I definitely would not use Redis to power mission critical queues, at least not within Laravel. I switched to Beanstalkd and have had a 100% success rate. Also, the author of redis is working on a dedicated message queue system for Redis, Disque.

@trompx
Copy link

trompx commented Aug 12, 2015

In case of multiple hosts, I would not advise to use beanstalkd as if one machine goes down, you would loose all your jobs on that machine.

Maybe implementing a lock algorithm in the RedisQueue.php driver in the pop function (maybe using blpop like here : http://www.justincarmony.com/blog/2012/01/10/php-workers-with-redis-solo/) or with something like https://github.com/ronnylt/redlock-php.

if I come up with something I'll let you know.

@reshadman
Copy link

I have experienced the problem with Redis, too. Still don't know why the issue has been closed.

If it's related to redis then it should be mentioned in the docs that multiple workers can not be run when the queue driver has been set to redis.

@henriquebim
Copy link

I had this problem, it happens for any job that runs for more than 60 seconds.

I have a thread in Laracasts that explain why this happen, how to avoid it and a lot of information about how Laravel handles Queues since I had to dig the source code a lot to understand how it works.

Check it for more information: https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers

@m4tthumphrey
Copy link
Author

@henriquebim Amazing work man! I knew it was going to be todo with Laravel's implementation and not Redis itself. I think I did actually set the expire time to 0 or null and still encountered the issue.

I'm quite astounded that this has been closed though as it's definitely a bug within Laravel and at the very least should be mentioned in the documentation.

I'm still running with beanstalkd however with no problems whatsoever.

@henriquebim
Copy link

@m4tthumphrey if you set it to 0 you will make it worse, the code checks with is_null and is_null(0) = false. So with it set to 0, virtually all jobs will be able to duplicate (any job that takes more than 0 seconds), but this depends on the amount of jobs and workers running.

Setting it to null or to a really high number (so that no jobs run for more than that number in seconds) fixes the issue, BUT any job that fails will get stuck (won't go to failed_jobs nor run again) or will take that amount of seconds to be retried/put on failed_jobs.

For those that can use beanstalkd, I suggest doing so.

@phoenixgao
Copy link

Will this be fixed or .. will this issue be reopened?

@phoenixgao
Copy link

I agree with @GrahamCampbell , this is not a bug.

But it is defective.

I was also wondering why laravel was not using rpoplpush when I was working on port laravel redisqueue to symfony, since in redis documentation it was supposed to use for a reliable queue. http://redis.io/commands/rpoplpush#pattern-reliable-queue

This is not a bug, but I vote to reopen + 1

@GrahamCampbell GrahamCampbell reopened this Jan 6, 2016
@taylorotwell
Copy link
Member

Does this occur if you use another queue driver or is it particular to Redis?

@taylorotwell
Copy link
Member

I would also be curious to know the values you are using for sleep and timeout on your queue worker configuration.

@GrahamCampbell GrahamCampbell changed the title Multiple queue workers using Redis [5.1] [5.2] Multiple queue workers using Redis Jan 6, 2016
@m4tthumphrey
Copy link
Author

@taylorotwell Thanks for reopening this.

It is particular to the Redis driver. I have been using beanstalkd ever since with no issue. The command that is being run is

php artisan queue:work --daemon --sleep=10 --tries=10 --queue=process_calls

The timeout option is only applicable when using the queue:listen command.

@taylorotwell
Copy link
Member

Hmm, does anyone have an idea of what we can do to fix the issue with the Redis driver?

@m4tthumphrey
Copy link
Author

@taylorotwell Looks like @phoenixgao might be on to something with rpoplpush

@djtarazona
Copy link
Contributor

Yup, rpoplpush looks like a viable solution

@henriquebim
Copy link

To summarize so people don't have to read the whole Laracasts thread posted above (the most complete post there is https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers/replies/60762, and also https://laracasts.com/discuss/channels/laravel/running-multiple-queue-workers/replies/92055):
The problem is that the RedisQueue driver stores all started job at a <queue_name>:reserved queue and before executing any job, it migrate all jobs older than 60 seconds (by default) from the :reserved queue to the main queue. That means, if your job runs for more than 60 seconds and you have more than one worker, the jobs will eventually be put back to the main queue and picked up to run again.

My fix was to set a value of NULL to the expire value (overwriting the 60 seconds default), this means that the jobs are never migrated. The problem with this is that failed jobs will never rerun, they will forever be stuck at :reserved queue. So it's not a fix the framework could have in it source code (maybe a hint at the documentation).

For a permanent fix, I don't have much clue on how to have that fixed, I think it would need to have a way to check if the job crashed (to move to the "retry" queue) or check if the job is running (to avoid rerun), so it would avoid relying on the time. I don't think rpoplpush would help in this situation because the jobs would still use the main, :reserved and :delayed queues in the same way I believe.

I will check how the other drivers handle those things to see if I come with a solution.

@GrahamCampbell
Copy link
Member

That means, if your job runs for more than 60 seconds and you have more than one worker, the jobs will eventually be put back to the main queue and picked up to run again.

Just set the timeout to be much larger then?

@GrahamCampbell
Copy link
Member

I've just checked one of my apps, and i have a timeout set as 86400 rather than 60.

@henriquebim
Copy link

@GrahamCampbell The point of that value is to retry the job (and maybe something else I didn't understand), so that means if any of your jobs fail, they would wait 24h before a retry, it would depend on the use case to know if it makes sense to retry the job after a day (if it was to generate a report for example, the user would just hit the "generate" button again and again until he gives up, and later would receive a bunch of those reports).

For my case it was fine that jobs would never be retried, it was something that should either be sent at the time or never again, so I set it to null and kept an eye for errors on the log, if there was an error I would stop the queues, fix it and start again.

For other use cases, it would become a guessing job to set a time that makes sure no job runs greater than that time and also it retries the jobs fast enough to make sense to retry it.

@GrahamCampbell
Copy link
Member

he point of that value is to retry the job (and maybe something else I didn't understand), so that means if any of your jobs fail, they would wait 24h before a retry

Only if they fail with a fatal error. If they throw an exception, then get put back on the queue again, I think... Ping @taylorotwell.

@phoenixgao
Copy link

@GrahamCampbell Thanks for reopening this, keeping it open does not mean you confirm it is a bug in laravel, it just let people keep an eye on it and can do more investigation on why some people get this "error" and only close it when someone give a solution or explanation or conclusion, so new people who get the same problem can know what they did wrongly. It helps the community.

@henriquebim I think you misunderstand the point of timeout, it's not the time to wait before retrying a failed job (that should be --delay), timeout is the value you think a job should never run longer than that, and job are processed longer than timeout without finishing will be considered died as zombie and will be put back onto the queue to re-run (not retry in the code, retry is retry the jobs threw any exception), if you have a miss-set timeout value (too small for your jobs), that those are picked again and again is exactly expected.

if your jobs normally should run less than 5 secs, but sometimes due to some reason it last more than 2 mins, you should add timeout limitation in your job, e.g., add a timeout in your curl.

In this case even you have only one worker, those timeouted jobs would be picked again and again, they were pushed into the tail of the queue. I think you need investigate more to see if the duplicated jobs are timeout jobs, if they are, then it actually works (for you).

I was suspecting that multiple workers compete to pop the same job at the same time (it is what @m4tthumphrey described in first report), and I also doubt, because in theory it shouldn't be because lpop will get the job and remove it from the list atomically.

When I mentioned rpushlpop is because, laravel worker(client) pop a job, then zadd it to :reversed (reserved?) , it is possible a client dies between pop and zadd, so the job is totally lost. (not reliable enough for some one)

So back to this thread, did any of you guys actually see jobs are picked by different clients at the same time?

@phoenixgao
Copy link

Then let's think more about timeout case. If client B find a job is expired and want to put it back onto the working queue, it should make sure client A is not processing that job, by killing (restarting) A or force A stop that job and move to next one, maybe?

This is not possible atm in laravel queue because the queue manager have no acknowledge about workers (host, pid, etc) but there is space of improvement.

I've done something in symfony to have a list of all clients info I'll see if I can have anything, @GrahamCampbell or maybe you have better ideas :p

@henriquebim
Copy link

@phoenixgao actually I'm talking about another value, called "expire" (you can change it at config/queue.php under redis driver, I didn't find a way to pass it with the command, maybe because it's Redis specific).

When Laravel pops a job from the main queue, it zadd it to the :reserved queue with a score of timestamp + expire. Before processing the queue, Laravel "migrateAllExpiredJobs", which migrates jobs that have a score lower than current timestamp (https://github.com/laravel/framework/blob/5.2/src/Illuminate/Queue/RedisQueue.php#L133).

That's exactly what causes jobs to be duplicated, while the first worker is busy processing the job and the second worker is idle, the job is at :reserved, if the job takes more than 60 seconds (assuming default), the second worker will move it from :reserved to the main queue and then will also start processing it.

@phoenixgao
Copy link

@henriquebim
I was also talking about this. expire in your config is the max time a job
is supposed to run. if your job need more time than 1 minute to finish,
increase that value. It does not mean you need wait longer to retry a
FAILED job. you must understand the differece between a failed job and a
timeout (expired) job.
failed job is an exception. they will be retried immediately(--delay) after
failed.

the expire value is used here
https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FRedisQueue.php#L139

and is reading from config here

https://github.com/laravel/framework/blob/5.2/src%2FIlluminate%2FQueue%2FConnectors%2FRedisConnector.php#L50

For your case, I'm sure they designed it to work like this.

When worker B comes to the queue, it found a job has been running for more than 60 seconds (or other configured expire time), worker B doesn't know the existence or status of other workers. so it think this job is timeout and to make sure this job can be finished at least one time it will migrate it back to queue. This is the expected behavior, if you have jobs run for long time you should increase expire

@laravel-internals
Copy link

Sorry, we don't process requests/proposals here. Please see https://github.com/laravel/internals/issues.

@scottconnerly
Copy link

Opened new thread in the new/proper place. laravel/ideas#30

@taylorotwell taylorotwell reopened this Mar 30, 2016
@taylorotwell
Copy link
Member

This should never have been closed. It's not a proposal. It's a bug.

@GrahamCampbell
Copy link
Member

Sorry about that. I even labeled it with the bug tag too, ages ago.

@halaei
Copy link
Contributor

halaei commented Apr 1, 2016

If you want to be sure that a job in not handled more than once, and you don't care if it fails or not, you can set --tries option to 1. Except that Redis driver does not count an attempt when fatal error occurs. It is because RedisJob class increments attempts counter during release function. For MySQL driver, however, this counter is incrementing via releaseJobsThatHaveBeenReservedTooLong() function of DatabaseQueue class, which will be applied for jobs terminated via fatal errors.

If I am about to post a pull request for this, should it be for 5.1?

@GrahamCampbell
Copy link
Member

Yes, unless the fix is controversial.

@halaei
Copy link
Contributor

halaei commented Apr 29, 2016

Will these changes be OK for everyone?

[Redis queue driver in Laravel 5.3 will] increment attempts counter when reserving job, instead of when releasing it, so that fatal errors be considered as tries.

It means if a worker is killed because timeout is exceeded, it wont be retried for ever.

@stayallive
Copy link
Contributor

Is there already some kind of resolution / cause found?

I just ran into this building a SSH executor that runs for a long time. I use the redis driver and started 16 runners in deamon mode through Forge using a timeout of 900 and set the tries to 1.

I also have this in my failed method on the Job classes:

    /**
     * Handle a job failure.
     *
     * @return void
     */
    public function failed()
    {
        $this->output .= PHP_EOL . '// Failed to execute job!';

        $this->updateStatus(Job::$STATUS_FINISHED_WITH_ERRORS, $this->output);

        $this->job->delete();
    }

So this should rule out thet failed job run multiple times.

So I made this bash script to be executed from the Job which should take around 120 seconds to execute:

echo "Sleeping job $JOB for 60 seconds..."
sleep 60
echo "Hello there $JOB!"

echo "Hello from $JOB" >> /tmp/test.log

echo "Sleeping job $JOB for 60 seconds..."
sleep 60
echo "Hello there $JOB!"

echo "Done!"

Executing this once results in /tmp/test.log containing:

Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0
Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0
Hello from 47dd5355-1d52-4fdf-94d7-1c979e21cae0

So the job is executed 3 times... I am puzzled! Any clues?

@stayallive
Copy link
Contributor

stayallive commented Jun 8, 2016

Looking into this... is it possisble it has something to do with the expire/ttr setting for the drivers?

        'redis' => [
            'driver'     => 'redis',
            'connection' => 'default',
            'queue'      => 'default',
            'expire'     => 60,
        ],

The default is set to 60, so if it runs longer (because your timeout allows it) redis/beanstalk/db will release it back the the pool of open jobs. Correct?

After changing the expire value to match my timeout I can no longer reproduce the error in production it seems.

@taylorotwell
Copy link
Member

Yes. The expire time should be adjusted based on how long you expect your jobs to reasonably run.

@GrahamCampbell
Copy link
Member

Note that in Laravel 5.3 the daemon runner supports timeouts too, so you might want to sync those two times up. ;)

@stayallive
Copy link
Contributor

Maybe I missed it but would be nice if this is documented in the docs. Did not realize what expires does untill this issue :)

@ChitaGideon
Copy link

In my view , a job have an expire time, means ,if a job had executed for a unexpected long time, the job should throw a timeout exception and be force stopped.

That would be easy to understand and handled

@carcinocron
Copy link

carcinocron commented Oct 3, 2016

If you want to be sure that a job in not handled more than once, and you don't care if it fails or not, you can set --tries option to 1. Except that Redis driver does not count an attempt when fatal error occurs. It is because RedisJob class increments attempts counter during release function. For MySQL driver, however, this counter is incrementing via releaseJobsThatHaveBeenReservedTooLong() function of DatabaseQueue class, which will be applied for jobs terminated via fatal errors.

If I am about to post a pull request for this, should it be for 5.1?

That isn't the bug being described here, the bug is that 2 redis queue workers will work 1 job at the exact same time (both are first attempts).

Looking into this... is it possisble it has something to do with the expire/ttr setting for the drivers?

I doubt this, because my 2nd queue worker is starting the job at the same time as the first one. That could still be a seperate bug though.


Has this been fixed in any version of laravel? I'm not on the latest 5.1, but I'm about to update to the latest 5.2. It's been hard to decipher because a lot of people have shown up to report slightly different bugs (or not even read the original bug description). If this has not been fixed, then I'll summarize that the only working solution right now is "use beanstalkd instead of redis for queues". Let me know if I'm wrong.

@JesseObrien
Copy link
Contributor

@InstanceOfMichael I don't believe from reading the thread here that it's been fixed. I think the solution you have is correct atm, "Use beanstalkd or rabbit or something other than redis for multiple of the same queue"

@ghost
Copy link

ghost commented Oct 16, 2016

I find out laravel 5.3 is using LuaScripts.php to call redis pop command.

According to redis doc, I assume this problem (multiple queue workers working on the same job) is resolved?

@carcinocron
Copy link

According to redis doc, I assume this problem (multiple queue workers working on the same job) is resolved?

Maybe it is resolved in Laravel 5.3 then.

Also Redis guarantees that a script is executed in an atomic way: no other script or Redis command will be executed while a script is being executed

@mkarnicki
Copy link

I take there's been no more reports of the issue since Oct 17 (or some of the people moved to using beanstalkd?). I'm interested in using Redis for a queue and just making sure I'm not shooting myself in the foot. Thanks.

@vesper8
Copy link

vesper8 commented May 31, 2017

just started using a redis worker and I swear there is absolutely nothing for this worker to do. It is running on a fairly new Forge instance on php 7.1. There have been no queued jobs and the worker is running in daemon mode and yet it is using 100% cpu non-stop. This is on a 5$ DO instance but still!! It totally crashed my server. Gonna stop it now and try beanstalkd instead

@mkarnicki
Copy link

Thanks for the comment, @vesper8 . I guess I would have shot myself in the foot. I'm using Forge, so I went the easier beanstalkd path.

@ganganimaulik
Copy link

ganganimaulik commented Sep 23, 2017

to fix this I just switched to beanstalkd.

was using database driver and got this problem even when NOT using supervisord... it should be added to docs that database driver have this problem when running multiple workers.

@Numline1
Copy link

Numline1 commented Apr 4, 2018

Okay, so we were kinda facing the same thing here, I guess I'll share my experience:

1 Redis database, 2 queue workers on separate servers (each only having one process). Some jobs we use have a $timeout setting set directly on the job, however there are other jobs like notifications or mailable classes, which did not. Our retry_after was set to default of 90 seconds, however, there was no timeout set on the CLI command run by supervisord. What I assume was happening here is that some job had an issue, where it was processing for more than 90 seconds. The retry_after setting only retries the job if it's already running, hence placing it back into the queue again and not killing it (that's what timeout is for), which resulted in duplicates.

I'll monitor the situation and see what happens in the next few days, but I presume that's what was happening for all the people in this thread who had to migrate to Beanstalkd.

@pihlersimon
Copy link

pihlersimon commented Apr 25, 2018

@Numline1 I've migrated from Redis to Beanstalkd and still have a problem. The job does not fail ( at least failed() is not triggered ) and it duplicates a job until it reaches --tries limit and creates a new record in failed_jobs table:

Illuminate\Queue\MaxAttemptsExceededException: App\Job\RegistrationProcess has been attempted too many times or run too long. The job may have previously timed out. 

@Numline1
Copy link

@pihlersimon did you set up the retry_after and timeout as I've mentioned in my comment? We haven't had this issue since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests