PySpark performance bottlenecks: counting values #130

ianmilligan1 · 2017-11-30T15:06:46Z

One of the core scripts that we use does the following:

takes a WARC or collection of WARCs;
extracts the hyperlinks, noting where they're from, where they're pointing from, and when the snapshot was taken;
reduces to domains (i.e. liberal.ca/smith becomes liberal.ca) and cleans up cruft (so that www.liberal.ca and liberal.ca get folded into the same);
assembles it together so you get a (SOURCE, TARGET, NUMBER OF LINKS) output for a given year, year-month, or year-month-day. i.e. if the Liberal Party of Canada linked out to the Green Party of Canada;
and provides nice output.

Here's the PySpark script:

import RecordLoader
from DFTransformations import *
from ExtractDomain import ExtractDomain
from ExtractLinks import ExtractLinks
from pyspark.sql import SparkSession
import re

path = "src/test/resources/arc/example.arc.gz"
spark = SparkSession.builder.appName("siteLinkStructureByDate").getOrCreate()
sc = spark.sparkContext


df = RecordLoader.loadArchivesAsDF(path, sc, spark)
fdf = df.select(df['crawlDate'], df['url'], df['contentString'])
rdd = fdf.rdd
rddx = rdd.map (lambda r: (r.crawlDate, ExtractLinks(r.url, r.contentString)))\
 .flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
 .filter(lambda r: r[-1] != None)\
 .map(lambda r: (r[0], re.sub(r'^.*www.', '', r[1]), re.sub(r'^.*www.', '', r[2])))\
 .countByValue()

print([((x[0], x[1], x[2]), y) for x, y in rddx.items()])

After doing timing, the .countByValue() command accounts for roughly 98% of the time of the calculation, which makes it pretty much useless for any production at scale (i.e. even a few GBs and you're waiting hours).

How can we speed this up?

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2017-11-30T15:07:14Z

Swapping out

print(countItems(rddx).filter(lambda r: r[1] > 5).take(10)) for countByValue() produces roughly the same timings.

greebie · 2017-11-30T15:14:48Z

Note that countByValue() is a method of the PySpark RDD class that creates a default_dict() out of the entries. It also requires the setting of PYTHONSEED=x to ensure that the smooth running across the cores.

If there is a generator function that could perform this better, that would be great. It's possible that using DF counts instead would help. Spark suggests using ReduceByKey.

Also for reasons I do not quite understand, the product of the map is a list of pyspark.sql row objects. That again, suggests that we (royal we, so I) should learn to use dataframes instead.

ianmilligan1 · 2017-11-30T21:12:54Z

Today is one of those "working hard and not achieving anything" days.

But as you'll see here, I've tried to implement reduceByKey to no avail. Timings simply aren't improving. Results are identical though, both in terms of output and time (still ~ 180 seconds).

df = RecordLoader.loadArchivesAsDF(path, sc, spark)
fdf = df.select(df['crawlDate'], df['url'], df['contentString'])
rdd = fdf.rdd
rddx = rdd.map (lambda r: (r.crawlDate, ExtractLinks(r.url, r.contentString)))\
 .flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
 .filter(lambda r: r[-1] != None)\
 .map(lambda r: (r[0], re.sub(r'^.*www.', '', r[1]), re.sub(r'^.*www.', '', r[2])))\
 .map(lambda x: (x, 1)).reduceByKey(lambda x,y: x+y).collect()

The SQL approach might be a good one to investigate?

greebie · 2017-12-03T18:02:09Z

Spark Dataframes has a function called .explode() which may be able to duplicate the flatMap call above. If we can stay in the DF, I think things may be able to improve.

greebie · 2018-01-11T14:04:07Z

I think the ultimate solution is to have the Python code play nicely with our Scala UDFs, but there could be others. Would be better to have someone with stronger programming skills to work on this.

ianmilligan1 · 2018-02-06T17:15:47Z

@dhop is taking this issue on!

greebie · 2018-02-07T13:06:51Z

@dhop This issue has the scripts for testing performance and the record of past tests we conducted. Might make things a little easier for testing... #121

greebie · 2018-02-07T13:07:15Z

(mea culpa - pressed wrong button)

ruebot · 2018-05-02T13:45:50Z

Is this issue still relevant with #214 being merged? And, should we also close this in favour of #215?

lintool · 2018-05-02T13:46:22Z

Yup!

ianmilligan1 added the enhancement label Nov 30, 2017

ianmilligan1 assigned greebie, ruebot, lintool, ianmilligan1 and MapleOx Nov 30, 2017

ianmilligan1 mentioned this issue Nov 30, 2017

ExtractLinks running slowly #123

Closed

ianmilligan1 mentioned this issue Dec 3, 2017

Benchmarking Scala vs Python #121

Closed

ianmilligan1 added PySpark and removed enhancement labels Jan 6, 2018

ianmilligan1 added the RA-Task label Jan 11, 2018

dhop mentioned this issue Feb 4, 2018

Register Scala functions for use in Pyspark #148

Closed

ianmilligan1 assigned dhop Feb 6, 2018

greebie closed this as completed Feb 7, 2018

greebie reopened this Feb 7, 2018

lintool closed this as completed May 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySpark performance bottlenecks: counting values #130

PySpark performance bottlenecks: counting values #130

ianmilligan1 commented Nov 30, 2017

ianmilligan1 commented Nov 30, 2017

greebie commented Nov 30, 2017 •

edited

Loading

ianmilligan1 commented Nov 30, 2017

greebie commented Dec 3, 2017

greebie commented Jan 11, 2018

ianmilligan1 commented Feb 6, 2018

greebie commented Feb 7, 2018

greebie commented Feb 7, 2018

ruebot commented May 2, 2018

lintool commented May 2, 2018

PySpark performance bottlenecks: counting values #130

PySpark performance bottlenecks: counting values #130

Comments

ianmilligan1 commented Nov 30, 2017

ianmilligan1 commented Nov 30, 2017

greebie commented Nov 30, 2017 • edited Loading

ianmilligan1 commented Nov 30, 2017

greebie commented Dec 3, 2017

greebie commented Jan 11, 2018

ianmilligan1 commented Feb 6, 2018

greebie commented Feb 7, 2018

greebie commented Feb 7, 2018

ruebot commented May 2, 2018

lintool commented May 2, 2018

greebie commented Nov 30, 2017 •

edited

Loading