Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractDomain or ExtractBaseDomain? #367

Closed
lintool opened this issue Oct 25, 2019 · 6 comments
Closed

ExtractDomain or ExtractBaseDomain? #367

lintool opened this issue Oct 25, 2019 · 6 comments
Assignees

Comments

@lintool
Copy link
Member

lintool commented Oct 25, 2019

In Scala RDD, we have:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

In Scala DF, we have:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(ExtractBaseDomain($"Url").as("Domain"))
  .groupBy("Domain").count().orderBy(desc("count"))
  .show(20, False)

We should be consistent: ExtractDomain or ExtractBaseDomain?

@ruebot
Copy link
Member

ruebot commented Oct 25, 2019

Let's go with ExtractDomain.

@ruebot
Copy link
Member

ruebot commented Oct 25, 2019

Shall we just rename it to ExtractDomainDF (in the interim period between moving migrating all the RDDs to DFs)?

val ExtractBaseDomain = udf(io.archivesunleashed.matchbox.ExtractDomain.apply(_: String, ""))

@lintool
Copy link
Member Author

lintool commented Oct 25, 2019

Any reason why you want the DF suffix?
I suppose there's the potential for confusion, but I like the beauty of uniformity...

@ruebot
Copy link
Member

ruebot commented Oct 25, 2019

Oh, just keeping it consistent with how we've done the same with the other DF functions like here

def extractHyperlinksDF(): DataFrame = {

...I'm happy to keep it uniform too.

@lintool
Copy link
Member Author

lintool commented Oct 25, 2019

Although per #366 we would rename that to just hyperlinks? To make consistent with Python?

@lintool
Copy link
Member Author

lintool commented Oct 25, 2019

Okay, so stick to ExtractDomain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants