-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PySpark performance bottlenecks: counting values #130
Comments
Swapping out
|
Note that countByValue() is a method of the PySpark RDD class that creates a default_dict() out of the entries. It also requires the setting of PYTHONSEED=x to ensure that the smooth running across the cores. If there is a generator function that could perform this better, that would be great. It's possible that using DF counts instead would help. Spark suggests using ReduceByKey. Also for reasons I do not quite understand, the product of the map is a list of pyspark.sql row objects. That again, suggests that we (royal we, so I) should learn to use dataframes instead. |
Today is one of those "working hard and not achieving anything" days. But as you'll see here, I've tried to implement
The SQL approach might be a good one to investigate? |
Spark Dataframes has a function called .explode() which may be able to duplicate the flatMap call above. If we can stay in the DF, I think things may be able to improve. |
I think the ultimate solution is to have the Python code play nicely with our Scala UDFs, but there could be others. Would be better to have someone with stronger programming skills to work on this. |
@dhop is taking this issue on! |
(mea culpa - pressed wrong button) |
Yup! |
One of the core scripts that we use does the following:
Here's the PySpark script:
After doing timing, the
.countByValue()
command accounts for roughly 98% of the time of the calculation, which makes it pretty much useless for any production at scale (i.e. even a few GBs and you're waiting hours).How can we speed this up?
The text was updated successfully, but these errors were encountered: