Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pooling? #8

Open
skyshard opened this issue Sep 24, 2013 · 6 comments
Open

Connection pooling? #8

skyshard opened this issue Sep 24, 2013 · 6 comments

Comments

@skyshard
Copy link

Is there a good way to share redis connections across different instances of pyreBloom?

I'm currently using a rotating pool of filters to implement a sort of TTL- newly seen urls get added to the current filter, but all filters are checked for membership. At some set interval (hour/day/etc) the oldest filter gets cleared out and reused for the current one.

This works out pretty well, except it uses up lots of connections. Is there a good way to reuse connections between filters or specify the key name to check? Should I take an entirely different approach to expiring old urls?

@dlecocq
Copy link
Contributor

dlecocq commented Sep 24, 2013

There's not currently a way to pool connections between filters :-/ That said, what you're doing to implement expiring bloom filters is exactly how we do it and is commonly how it's done elsewhere.

How many filters do you have at any one time?

@skyshard
Copy link
Author

Ah, it's good to hear that you do it the same way. I only have 7 filters at once, but this is multiplied by each celery worker making its own connections to the filters- this ends up being a few thousand connections in practice. I'll probably try sharing across different processes

@dlecocq
Copy link
Contributor

dlecocq commented Sep 25, 2013

Is the number of connections problematic at the redis server level? It uses epoll/kqueue as available, so the number of connections shouldn't be an issue on that front. If you're hitting limits, there are both redis-level limits (maxclients) and ulimit open file descriptor limits, and they can be bumped substantially.

Assuming it's not actual networking overhead causing the heartache, at the end of the day, all your celery workers are interacting with this single shared resource. It seems likely that eventually redis' performance may become an issue.

For some context, there are a few projects for which we use pyreBloom (in fact, for URL deduping, too). One of them uses 2 modest machines with 4 redis-server instances each, and processes tens of millions of URLs per day using about 1% CPU average. The other uses 4 m2.xlarges with 4 redis-server instances each to sift through hundreds of millions of URLs per day using about 10% CPU average.

@skyshard
Copy link
Author

Those are good numbers for reference, thanks! Are you partitioning across the different redis-server instances on the same ec2 instances for those performance reasons/did you find that to be better than using individual redis-server instances on each box?

I'm currently at around 45k reads per second without pipelining (on a hosted solution actually, on what appears to be m2.2xlarges) and was somewhat concerned about the number of open connections, but judging by your experiences it shouldn't be a big issue (except for hosted plans with connection limits). Thanks for all the advice!

@dlecocq
Copy link
Contributor

dlecocq commented Sep 27, 2013

We treat the servers as just host:port pairs from the client side, but we have multiple redis-server processes on each. The reason is just that redis-server is single-core (with the exception of its backups).

It may also help to give some context about the bloom capacities. IIRC, we generally have a capacity of about 1e9 for each month partition, and it uses I think 7 or so hashes for each filter.

@skyshard
Copy link
Author

skyshard commented Oct 9, 2013

For what it's worth, apparently there is performance degradation with high connection counts:

Requests/second vs # of open connections, from http://redis.io/topics/benchmarks

As a rule of thumb, an instance with 30000 connections can only process half the throughput achievable with 100 connections.
benchmark graph from redis docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants