-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Should we align our Named Entity Recognition output with WANE format? #297
Comments
I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road? |
Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉 |
aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala Lines 45 to 63 in 9b3e025
Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing. |
After that commit, I just need to sort out |
I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files. |
Mostly resolved with 379cc68. Still need to do: Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:
|
@SinghGursimran this one one last item to get to before it's done. If you're interested, or see any easy path, let me know. |
- Update Stanford core NLP - Format NER output in json - Add getPayloadDigest to ArchiveRecord - Add test for getPayloadDigest - Add payload digest to NER output - Remove extractFromScrapeText - Remove extractFromScrapeText test - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢
@ruebot May I simply replace keywords PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations in the NER output String. That will conform to WANE output. |
@SinghGursimran sure, I'm good with that. Curious what you come up with. |
Fantastic! Great work @SinghGursimran and @ruebot! |
Adapting our example NER script:
Produces the following example output:
This is very similar to WANE output. Is it worth normalizing the output
ExtractEntities
produces to the documented WANE output?JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".
The text was updated successfully, but these errors were encountered: