Redesign of PySpark DataFrame interface for filtering #120

lintool · 2017-11-23T20:34:00Z

Currently, the PySpark DataFrame interface is something like:

	path = "/Users/Prince/Projects/pyaut/aut/example.arc.gz"
	
	spark = SparkSession.builder.appName("filterByDate").getOrCreate()
	sc = spark.sparkContext

	df = RecordLoader.loadArchivesAsDF(path, sc, spark)
	filtered_df = keepDate(df, "2008", DateComponent.YYYY).filter(df['url'].like("%archive%"))
	rdd = filtered_df.rdd
	rdd.map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString))) \
	   .saveAsTextFile("out/")

Above snipped from https://github.com/MapleOx/aut/blob/bfb2678e8f88c994f94cc0919f352303f2f1d412/src/main/python/scripts/filterByDateScript.py

We should redesign to use standard NOUN.VERB pattern, something like:

.filter(df['url'].like("%archive%"))
.filter(df['date'].in(DateComponent.YYYY(2008))
.filter(df['date'].in(DateRange.YYYY(2007, 2008)))
.filter(df['date'].in(DateRange.YYYYMM(2007, 1, 2008, 2)))
.filter(df['date'].in(DateRange.YYYYMMDD(2007, 1, 2, 2008, 2, 2)))

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2017-11-23T20:37:38Z

FYI we are writing the scripts over here on a branch of the website (documentation): https://github.com/archivesunleashed/archivesunleashed.org/blob/pyspark/content/aut/pyspark.md.

ianmilligan1 · 2018-02-06T17:15:57Z

@dhop is looking into PySpark!

ruebot · 2018-05-02T13:23:16Z

Resolved with 505c47a

ianmilligan1 added PySpark RA-Task labels Jan 6, 2018

ianmilligan1 assigned dhop Feb 6, 2018

ruebot closed this as completed May 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign of PySpark DataFrame interface for filtering #120

Redesign of PySpark DataFrame interface for filtering #120

lintool commented Nov 23, 2017

ianmilligan1 commented Nov 23, 2017

ianmilligan1 commented Feb 6, 2018

ruebot commented May 2, 2018

Redesign of PySpark DataFrame interface for filtering #120

Redesign of PySpark DataFrame interface for filtering #120

Comments

lintool commented Nov 23, 2017

ianmilligan1 commented Nov 23, 2017

ianmilligan1 commented Feb 6, 2018

ruebot commented May 2, 2018