Skip to content

Latest commit

 

History

History
113 lines (88 loc) · 5.84 KB

README.md

File metadata and controls

113 lines (88 loc) · 5.84 KB

About this Project

Forked from Subhasis Dutta's original github pages on this so we have it under the OEDA / openeventdata pages for future reference. The http://bablenet.org files referenced in here are proprietary and can be accessed under agreements via pbrandt@utdallas.edu for OEDA participants.

This project is built to code events about political conflict and cooperation among different political actors in multiple languages (primarily English, Arabic and Spanish).

It is part of "Modernizing Political Event Data for Big Data Social Science Research".

Implemented under the guidance of:

  1. Prof. Patrick Brandt
  2. Prof. Latifur Khan
  3. Prof. Vincent Ng
  4. Prof. Jennifer Holmes

Built at University of Texas at Dallas, funded by U.S National Science Foundation under NSF-SBE-SMA-1539302.

Details here: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1539302&HistoricalAwards=false

Political Name Parser for JRC and CAMEO dataset

This repo contains script to parse the entity file of JRC names, CAMEO actoe data and puts organizes it in a structured format in database.

CAMEO Data Set

Conflict and Mediation Event Observations dataset obtained from https://github.com/openeventdata/Dictionaries

Details about CAMEO: http://phoenixdata.org/description http://eventdata.parusanalytics.com/papers.dir/Gerner.APSA.02.pdf

JRC Dataset

JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across languages (e.g., Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.) for details see:

https://ec.europa.eu/jrc/en/language-technologies/jrc-names.

BableNet

BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of about 14 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages. http://babelnet.org/about

Project Modules

ir-scripts

Contains all the scripts to :

  1. Extract information from CAMEO dataset and load it into MongoDB.
  2. Extract information from JRC Entity dataset and load it into MongoDB.
  3. Find all the relation between the CAMEO and JRC Actors and store it in a relation table.

Details to Run the Scripts are present in ir-scripts/ReadME

mr-script

This package will contain the Map reduce implementation of the join operation between CAMEO and JRC data.
Details present in mr-script/ReadME

spark-script

This package will contain the Spark join implementation for the join operation between CAMEO and JRC data.
Details present in spark-script/ReadME

jrc-classification

This package will contain the Machine Learning based classifier to identify the unidentified language for the different political Actors in JRC entity data set.
Details present in jrc-classification/ReadME

rest-server

Contains a web-server to allow different client systems to access the data from different client systems.
Demo Server : http://35.160.238.107:2121/

Available API:

  1. Check Status :
    <Server IP:port>/ - Should give a response
    Response:
    {"Status": "Server Running", "multilangAPI": "Welcome"}
    Ex - http://35.160.238.107:2121/
  2. Full Raw Output :
    <Server IP:port>/search?query=<Person Name> - Returns all combined search resuls from CAMEO, JRC, BableNET and dbPedia.
    Ex: http://35.160.238.107:2121/search?query=barrak+obama
  3. Filtered Output:
    a. <Server IP:port>/filter?query=<Person Name> - Returns all the translation found for the Person in default
    (Arabic, Spanish). The default is setup in config/config.cnf.
    Eg: http://35.160.238.107:2121/filter?query=donald+trump
    b. <Server IP:port>/filter?query=<Person Name>&source=bablenet - Returns all person name in default language for only one data source. (By default it returns from all data source. Currently support: jrc, bablenet)
    Eg: http://35.160.238.107:2121/filter?query=hillary+clinton&source=jrc
    http://35.160.238.107:2121/filter?query=hillary+clinton&source=bablenet
    c. <Server IP:port>/filter?query=<Person Name>&lang=<Language Code> - Returns all person name in a particular language.

    Eg:
    http://35.160.238.107:2121/filter?query=donald+trump&lang=de (Names in German)
    http://35.160.238.107:2121/filter?query=donald+trump&lang=ja (Names in Japanese)
    http://35.160.238.107:2121/filter?query=donald+trump&lang=hi (Names in Hindi)
    http://35.160.238.107:2121/filter?query=narendra+modi&lang=bn (Names in Bengali)

Details to setup and run is present in rest-server/ReadME

webapp

This package contains a User Interface Client to visualize the data that can be accessed by the API.
Details present in webappn/ReadME

Dependency

The scripts have dependency on PyMongo(For database connection), Python Tornado(for API server), editdistance

To install all dependency run

pip install -r requirements.txt