This repository performs big data analysis using Hadoop on the Global Database of Events, Language and Tone (GDELT). The GDELT project is supported by Google Jigsaw and collates its information from broadcast, print and web news media from around the world. This data is open source and consists of CSV format datasets published daily.
For this project, I will be using the GDELT 1.0 Global Knowledge Graph for the following reasons:
- The GDELT 1.0 GKG database allows for more varied and involved data processing and will also allow me to infer a vast amount of knowledge from its various columns of data.
- In contrast, the GDELT 1.0 Event Database only allows me to count the number of events that occurred in a period of time. For example, multiple events tagged under the keyword KILL might or might not be related. This allows me to have only a broad overview of the events of the world
- The GDELT 2.0 database is updated every 15 minutes. Hence, it is suitable for use with real-time decision making requirements. The Hadoop ecosystem is not natively efficient at performing real-time processing of data and works better with historical data. Hence, GDELT 1.0 is a better fit.
The code here is an attempt to understand and showcase how massive global news databases like GDELT could be used to understand the changing field of politics, perception of the public, disease outbreaks, civil unrests, etc.
The GDELT GKG begins on April 1, 2013, until today (22nd November 2019 at time of writing). The description of the information available is given in the table below. Look at the GDELT GKG 1.0 Data Format Codebook for more details.
Tab Delimited Columns of GEDLT GKG Database | |
---|---|
DATE | In YYYYMMDD format (eg: 20160601) |
NUMARTS | Number of articles related to events |
COUNTS | List of all counts associated with the event, for multiple possible count types (eg: AFFECT, ARREST, KIDNAP, KILL, etc.) |
THEMES | List of all themes found in the article |
LOCATIONS | List of locations associated with event |
PERSONS | List of persons associated with event |
ORGANIZATIONS | List of organizations associated with event |
TONE | List of numbers used to describe the tonality of the article |
CAMEOEVENTIDS | List of integers to describe the event using the Conflict and Mediation Event Observations (CAMEO) IDs |
SOURCES | List of sources |
SOURCEURLS | For web articles, complete URL |
The \Code
folder contains multiple Java
files which can be run using a Hadoop installation (either on a local machine or a cluster). The code performs the following analytics:
ActivePeople.java
/[attr]Count.java
: Counts the number of times each individual has been mentioned in the database (either summed over events / articles). Each[attr]Count.java
counts some attribute of the database, such as people, countries, cities, sources, themes etc.CountryTone.java
: Filters and tabulates the tone of different articles related to each country.CountryNetwork.java
/PeopleNetwork.java
: Crawls through the event database to create a network of countries/people which are mentioned together. Can be used to understand how close two countries are, or how often they are mentioned on popular new media together. In the case of people, can be used to associate groups of people together.PersonTone.java
: Filters the events to look for articles that mention a selected person, and tabulates the tone of those articles. Can be used to understand the public perception about a person, and how it changes. For example, during the 2016 US elections, can be used to see how many articles had a positive/negative tone related to Donald Trump vs Hillary Clinton.DeathToll.java
: Filters through the data to look for events that include the event typeKILL
, effectively counting the number of reported deaths at any given day.Experts.java
: Combs through the database to provide the names of people most commonly associated with a particular theme/subject.
THe following outputs were computed using the given code on the GDELT GKG file corresponding to 1st June 2016. THis was quite close to the 2016 US Presidential Elections, and hence we see this theme being very common in the below results. The visualizations were created using the free public version of Tableau. The network graph of people was created using the open-source graph visualization software Gephi. Sadly, Github does not support iFrames on MarkDown files. Therefore, to try out the interactive Tableau visualizations, just click on the images!
The visualization below shows the number of incidents which were violent in nature occurring all across the world on 2016-06-01.
The visualization below shows treemaps of the ost frequently mentioned people, the most frequently mentioned organizations, and the most prolific sources of the articles collected by GDELT on 2016-06-01.
The graphs below show the distribution of tone of the different articles published on 2016-06-01. This includes the overall distribution for all events, on donald trump, on hilary clinton and on the countries US, UK and India.
The visualization below shows the countries that are most often mentioned together. Since this graph can include a lot of information, you can select the country of your choice, and this will display the connections of the chosen country with other countries. The width and color of those countries depends on how often they are mentioned together. The thicker and darker color the line, the more times they have been mentioned together.
The visualization below shows the most mentioned cities and countries.
The visualization below allows you to select from a list of people and based on your selection creates a word cloud of the themes / topics which are associated with him / her. By default, the person selected is 'Hillary Clinton', keeping in mind that these topics are only for the articles published on 2016-06-01.
The visualization below shows the network of people, with the edges being weighted based on how many times they are mentioned together in the same article.
We see some groups being formed in the network above. These groups can be broadly associated as belonging to some aspect of the public view, shown below.
It is also interesting to see how these different groups bleed into each other. For example, the network of people related to politics in US is closely connected to the network of people related to Hollywood / pop culture (such as Justin Beiber, Kim Kardashian, Johnny Depp etc.)