Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make and use story-level country geocoding? #94

Open
ahalterman opened this issue Jun 27, 2016 · 2 comments
Open

Make and use story-level country geocoding? #94

ahalterman opened this issue Jun 27, 2016 · 2 comments

Comments

@ahalterman
Copy link
Member

Think about adding a pre-pipeline coding step that geocodes complete articles (rather than sentences) to the country. This would be useful for two things:

  1. Associating actors that don't have country codes with the correct country. E.g., "Egyptian police fired on protesters" -> EGYCOP, ~CVL. If the document is geocoded to one country, we could say with enough confidence that ~CVL = EGYCVL. This process would happen in the pipeline rather than in Petrarch.
  2. Mordecai's country coding works much better at the document level than the sentence level, and its place function can take in a country limit as an argument. Doing a first-pass whole article country-level coding could help the sentence level geocoding.

Because the pipeline operates at the sentence level, actually geocoding the articles would have to happen outside the pipeline. Changes to the pipeline would just be in order to use the new info.

@philip-schrodt
Copy link

Downside is the overhead when you have a large corpus that contains mostly junk (e.g. recently worked with one generated by the data-provider-who-shall-not-be-named with 2.5M stories, only about 2% of which generated events). We could do this just as easily by doing the substitutions in a post-processing phase, right?

@ahalterman
Copy link
Member Author

I had forgotten that the geocoding step in postprocess.py hits the db again. Since that's also where it does the actor splitting, it would be easy to put a step in there where it geocodes the full article and updates the db, but only if an event has already been coded from that article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants