This guide takes you through the steps to build a simple application starting with a collection of event records. Before continuing, install myDIG by following the instructions in the README.md file. Once the installation completes, visit the page http://localhost:12497/mydig/ui/ to bring up the home page.
The steps include:
- creating a new project
- loading data
- defining fields
- using the Inferlink took to extract data from all pages in a web site
- using glossaries to extract terms from any page
- using rules to extract data based on patterns
- organizing and customizing the appearance of the search page
This guide will show you how to build a KG and search application for an event dataset.
Your objective is to build a KG of event records. The KG will include information about the events (title, date, city, country, description).
From the All Projects
page, click on the +
button to add a new project.
Screenshot 1
Project names should be lowercase, contain only letters, numbers and
_
Click on the SAVE
button to create the project.
Screenshot 2
Click on the project or the >
button to go to the project configuration page.
Screenshot 3
For this guide, we are going to use existing project settings, which include pre defined fields and glossaries.
Click on the Import Project
button from the top right corner menu
Screenshot 4
Select elicit_20181113185140.tar.gz
from the datasets
folder, click SUBMIT
and then click OVERWRITE
button in the confirmation dialog.
Screenshot 5
There should be 5 new fields and 2 glossaries created.
Screenshot 6
The new project screen contains options to configure your project.
Screenshot 7
Use Import JSON lines File
to import data in your project and select the elicit_20.jl
file, which contains a sample of 20 pages from the event dataset.
This file is part of the myDIG installation, in datasets/elicit_20.jl
.
After a few seconds, myDIG will load the data, and show that it has loaded 20 pages:
Screenshot 8
myDIG scans the files you load identifying TLDs and will show the number of documents you have from each TLD (Our file has documents from a single TLD).
Creating a KG is an iterative process. You start with a simple KG and then create information extractors to populate the KG with the data you want. You can can build the KG with the extractors you have, test it in the GUI and then go define more extractors or fine tune the ones you have.
myDIG comes with a set of predefined extractors, so you can build a simple KG before you define any extractors. Before you do that, you need to understand a little bit about the extraction pipleline. Think of the extraction pipeline as a queue. If you put documents in the queue, myDIG will process them a few at a time, incrementally loading them in the kG until the queue is empty.
When you import documents, myDIG doesn't put them in the queue automatically, but rather puts them in a holding area so that you can add them to the queue whenever you want.
Use the Desired
field to tell myDIG how many documents from each TLD you want to have in your KG. You can either update desired documents for all the TLDs or update each TLD individually.
Update the number to 10 in the textbox in the Desired
and click outside the box. If there are multiple TLDs, you could update all of them by clicking the Desired Docs per TLD
button (see Screenshot 7) and clicking Update
The table of TLDs now shows 10 in the Desired
column.
To create the KG, click the red Recreate Knowledge Graph
button.
myDIG will first delete your current KG, it will turn on the extraction pipeline, and it will add the desired number of documents to the pipeline.
You need to wait for about 1 minute before the number of documents in the KG updates.
The reason for the wait is that myDIG needs to load several very large files such as a list of all cities in the world with population over 15,000, and an English language model that includes all words in the English dictionary.
After a couple of minutes, the documents will move through the extraction pipeline, and get loaded in the KG:
Screenshot 9
As soon as the number of documents in the KG is more than zero, you can click on the DIG UI
(see Screenshot 7) button to show the DIG page to search the KG you just created.
Go ahead and click it.
In the DIG UI, click the wrench icon (see Screenshot 10) to open the search form:
Screenshot 10
Enter nigeria
in the Country
field to tell DIG that you want to search for events from Nigeria:
Screenshot 11
Click Search
and take a look at the data.
Your KG so far is simple, but it is already functional.
Not bad for a few minutes of work.
You can easily add more documents to an existing KG.
Enter 20
in the Desired
field, click outside the textbox to tell myDIG that you want more documents, If Pipeline
is in ON
state, myDIG will add 10 more documents to the processing queue. `
After a few seconds the number of documents in the KG starts increasing.
This time it is much faster because you didn't have to recreate the KG and restart the processing pipeline.
By default myDIG does not create any fields in a new project. The fields in your project exist because we imported a project with already defined fields.
Select the Fields
option (see Screenshot 6) to view the fields in your project:
Screenshot 12
Let's add a new dummy field, for demo purposes. Click on the +
button at the top right corner.
Screenshot 13
Fill in the fields as shown in the picture above and click SAVE
. There are more fields in the add field form, we'll discuss these later.
To delete a field, click the delete button on the right.
We don't need the dummy
field in our KG, so go ahead and delete it.
Deleting fields in not undoable.
-
Recreate Knowledge Graph
is used to recreate the index in elastic search and regenerate ETK config. Desired number of data will be added and run automatically. This function will also turn pipeline on. Only use it after you did some incompatible changes.Incompatible changes: upload new glossaries, update fields, update tags, update Landmark rules.
-
Turn on Pipleine
is used to fire up ETK processes with existing etk modules. If you only want to add some new data, use this function. ETK processes will exit after idle for an hour. Then this button will turn into enable.