Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot · 2019-01-10T14:07:15Z

Adapting our example NER script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

ExtractEntities.extractFromRecords("/home/ruestn/english.all.3class.distsim.crf.ser.gz", "/tuna1/scratch/nruest/geocites/warcs/1/GEOCITIES-20090723023506-00000-crawling08.us.archive.org.warc.gz", "/tuna1/scratch/nruest/geocites/ner/", sc)

Produces the following example output:

(20090723,http://uk.geocities.com/pendock@btinternet.com/index.htm,{"PERSON":["Frampton","Hardwicke","Hardwicke","Martin","Hardwicke","Hardwicke","Hardwicke","Hutchings","Hopkins","Saunders","Butler","Jones","Frampton","Frampton","Hardwicke","Mark Chapple","Mark Medland","Glos"],"ORGANIZATION":["Hardwicke Cricket Club","Hardwicke Cricket Club","Stroud District Cricket Association","EJ Taylor & Sons Eric Vick Transport Club"],"LOCATION":["China","Gloucester","Ireland","Gloucester"]})

This is very similar to WANE output. Is it worth normalizing the output ExtractEntities produces to the documented WANE output?

JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

The text was updated successfully, but these errors were encountered:

ianmilligan1 · 2019-01-10T14:11:16Z

I think this is a good idea – to my mind, there aren't any standardized named-entity formats out there, so if there's a format we might as well try to encourage some standardization around tools down the road?

ianmilligan1 · 2019-01-11T16:40:49Z

Seeing no objections, let's do it – I like the idea of saying we produce "WANE" files which we can the point at the Archive-It page. Maybe we can start a trend towards a standardized NER format. 😉

ruebot · 2019-09-18T20:08:11Z

@ianmilligan1 @lintool

aut/src/main/scala/io/archivesunleashed/app/ExtractEntities.scala

Lines 45 to 63 in 9b3e025

    
             /** Extracts named entities from tuple-formatted derivatives scraped from a website. 
        
               * 
        
               * @param iNerClassifierFile path of classifier file 
        
               * @param inputFile path of file containing tuples (date: String, url: String, content: String) 
        
               *                  from which to extract entities 
        
               * @param outputFile path of output directory 
        
               * @return an rdd with classification entities. 
        
               */ 
        
             def extractFromScrapeText(iNerClassifierFile: String, inputFile: String, outputFile: String, sc: SparkContext): RDD[(String, String, String)] = { 
        
               val rdd = sc.textFile(inputFile) 
        
                 .map(line => { 
        
                   val ind1 = line.indexOf(",") 
        
                   val ind2 = line.indexOf(",", ind1 + 1) 
        
                   (line.substring(1, ind1), 
        
                     line.substring(ind1 + 1, ind2), 
        
                     line.substring(ind2 + 1, line.length - 1)) 
        
                 }) 
        
               extractAndOutput(iNerClassifierFile, rdd, outputFile) 
        
             }

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

ruebot · 2019-09-18T20:37:19Z

After that commit, I just need to sort out PERSON -> persons, etc., and it should finally be done.

ianmilligan1 · 2019-09-23T21:12:28Z

Can I remove that function? The function above it solves the use case of parsing ARCs/WARCs, not sure about parsing a relatively unknown, and under-documented file format. Seems a bit out-of-scope for the toolkit, especially since we removed the Twitter analysis/parsing.

I'd completely forgotten about this! My vote would be to remove this in the PR. It's an outlier in that it takes derivative files as an input and processes them. I think it makes sense to stick to ARC and WARC files as inputs only, and put the emphasis on users to either use notebooks or their own solutions to work with the derivative files.

ruebot · 2019-11-05T18:17:17Z

Mostly resolved with 379cc68.

Still need to do: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

Helper note from @helgeho for when I (or somebody else) loops back around to this in the future:

we're not exactly overriding it, because the corenlp output is not json, we simply take the PERSON class and put it under a key that we call persons

ruebot · 2019-11-08T22:40:27Z

@SinghGursimran this one one last item to get to before it's done. If you're interested, or see any easy path, let me know.

- Update Stanford core NLP - Format NER output in json - Add getPayloadDigest to ArchiveRecord - Add test for getPayloadDigest - Add payload digest to NER output - Remove extractFromScrapeText - Remove extractFromScrapeText test - TODO: PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations (involves writing a new class or overriding NER output 🤢

SinghGursimran · 2019-11-14T17:36:55Z

@ruebot May I simply replace keywords PERSON -> persons, LOCATION -> locations, ORGANIZATION -> organizations in the NER output String. That will conform to WANE output.
(Overriding class would be a cleaner solution but I guess we won't be requiring that class at any other place)

ruebot · 2019-11-14T17:43:04Z

@SinghGursimran sure, I'm good with that. Curious what you come up with.

ianmilligan1 · 2019-11-14T19:17:16Z

Fantastic! Great work @SinghGursimran and @ruebot!

ruebot added the discussion label Jan 10, 2019

ruebot assigned greebie, lintool, ianmilligan1 and SamFritz Jan 10, 2019

ruebot assigned ruebot and unassigned greebie, lintool, ianmilligan1 and SamFritz Jul 17, 2019

ruebot added a commit that referenced this issue Sep 4, 2019

More on #297

8d0eb68

ruebot added a commit that referenced this issue Sep 4, 2019

more on #297

2689670

ruebot added a commit that referenced this issue Sep 18, 2019

#297 payload digest

9896424

ruebot added a commit that referenced this issue Sep 18, 2019

#297 - write a json object per line.

9d074a7

ruebot mentioned this issue Sep 18, 2019

Align NER output to WANE format #361

Merged

SinghGursimran mentioned this issue Nov 14, 2019

Converting output of NER Classifier to WANE Format #378

Merged

ruebot closed this as completed in f9ce826 Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

ruebot commented Jan 10, 2019

ianmilligan1 commented Jan 10, 2019

ianmilligan1 commented Jan 11, 2019

ruebot commented Sep 18, 2019

ruebot commented Sep 18, 2019

ianmilligan1 commented Sep 23, 2019

ruebot commented Nov 5, 2019

ruebot commented Nov 8, 2019

SinghGursimran commented Nov 14, 2019

ruebot commented Nov 14, 2019

ianmilligan1 commented Nov 14, 2019

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Discussion: Should we align our Named Entity Recognition output with WANE format? #297

Comments

ruebot commented Jan 10, 2019

ianmilligan1 commented Jan 10, 2019

ianmilligan1 commented Jan 11, 2019

ruebot commented Sep 18, 2019

ruebot commented Sep 18, 2019

ianmilligan1 commented Sep 23, 2019

ruebot commented Nov 5, 2019

ruebot commented Nov 8, 2019

SinghGursimran commented Nov 14, 2019

ruebot commented Nov 14, 2019

ianmilligan1 commented Nov 14, 2019