Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Closed
ruebot opened this issue Jan 10, 2019 · 10 comments
Assignees

Comments

@ruebot
Copy link
Member

ruebot commented Jan 10, 2019

Adapting our example NER script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/ruestn/english.all.3class.distsim.crf.ser.gz", "/tuna1/scratch/nruest/geocites/warcs/1/GEOCITIES-20090723023506-00000-crawling08.us.archive.org.warc.gz", "/tuna1/scratch/nruest/geocites/ner/", sc)

Produces the following example output:

(20090723,http://uk.geocities.com/pendock@btinternet.com/index.htm,{"PERSON":["Frampton","Hardwicke","Hardwicke","Martin","Hardwicke","Hardwicke","Hardwicke","Hutchings","Hopkins","Saunders","Butler","Jones","Frampton","Frampton","Hardwicke","Mark Chapple","Mark Medland","Glos"],"ORGANIZATION":["Hardwicke Cricket Club","Hardwicke Cricket Club","Stroud District Cricket Association","EJ Taylor & Sons Eric Vick Transport Club"],"LOCATION":["China","Gloucester","Ireland","Gloucester"]})

This is very similar to WANE output. Is it worth normalizing the output ExtractEntities produces to the documented WANE output?

JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

@ianmilligan1
Copy link
Member

I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road?

@ianmilligan1
Copy link
Member

Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉

ruebot added a commit that referenced this issue Sep 4, 2019
ruebot added a commit that referenced this issue Sep 4, 2019
ruebot added a commit that referenced this issue Sep 18, 2019
@ruebot
Copy link
Member Author

ruebot commented Sep 18, 2019

@ianmilligan1 @lintool

/** Extracts named entities from tuple-formatted derivatives scraped from a website.
*
* @param iNerClassifierFile path of classifier file
* @param inputFile path of file containing tuples (date: String, url: String, content: String)
* from which to extract entities
* @param outputFile path of output directory
* @return an rdd with classification entities.
*/
def extractFromScrapeText(iNerClassifierFile: String, inputFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = {
val rdd = sc.textFile(inputFile)
.map(line => {
val ind1 = line.indexOf(",")
val ind2 = line.indexOf(",", ind1 + 1)
(line.substring(1, ind1),
line.substring(ind1 + 1, ind2),
line.substring(ind2 + 1, line.length - 1))
})
extractAndOutput(iNerClassifierFile, rdd, outputFile)
}

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

ruebot added a commit that referenced this issue Sep 18, 2019
@ruebot
Copy link
Member Author

ruebot commented Sep 18, 2019

After that commit, I just need to sort out PERSON -> persons, etc., and it should finally be done.

@ianmilligan1
Copy link
Member

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files.

@ruebot
Copy link
Member Author

ruebot commented Nov 5, 2019

Mostly resolved with 379cc68.

Still need to do: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:

we're not exactly overriding it, because the corenlp output is not json, we simply take the PERSON class and put it under a key that we call persons

@ruebot
Copy link
Member Author

ruebot commented Nov 8, 2019

@SinghGursimran this one one last item to get to before it's done. If you're interested, or see any easy path, let me know.

ruebot added a commit that referenced this issue Nov 9, 2019
- Update Stanford core NLP
- Format NER output in json
- Add getPayloadDigest to ArchiveRecord
- Add test for getPayloadDigest
- Add payload digest to NER output
- Remove extractFromScrapeText
- Remove extractFromScrapeText test
- TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢
@SinghGursimran
Copy link
Collaborator

@ruebot May I simply replace keywords PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations in the NER output String. That will conform to WANE output.
(Overriding class would be a cleaner solution but I guess we won't be requiring that class at any other place)

@ruebot
Copy link
Member Author

ruebot commented Nov 14, 2019

@SinghGursimran sure, I'm good with that. Curious what you come up with.

@ianmilligan1
Copy link
Member

Fantastic! Great work @SinghGursimran and @ruebot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants