Skip to content

Latest commit

 

History

History
50 lines (29 loc) · 2.8 KB

README.md

File metadata and controls

50 lines (29 loc) · 2.8 KB

Non-Parametric Class Completeness Estimators for Collaborative Knowledge Graphs

Requirements

Software

System

For fast access on the Wikidata graph we create a binary representation for in-memory access in the observation extraction phase. The creation and the use of this graph can take easily up to 200GB of memory.

Data Pipeline

Data Sources

The following tasks are dependent on each other and can be run without any parameters, the default parameters expect the datasets to be present in the subfolder /data. The necessary origin datasets are not anymore available at the source:

Additionaly we provide the data for every intermediary step as download at https://zenodo.org/record/3268818.

1. Export Edits from Edit History

  • 0_export_edits.sql

    1. Load the XML Dump of Wikidata in a SQL Database (with e.g. MWDumper).
    2. The provided query exports all edits. (The query can be restricted to edits before the timestamp "2018-10-01" to recreate the output presented in the paper.)

2. Data Preparation

3. Calculate Estimates and Convergence

  • 3_calculate_estimates.py: Calculate the Estimates of all Classes.
  • 4_draw_graphs.py: Draw the graphs and calculate the Convergence for all Classes. With -g "" no graph is loaded (which uses much less memory)

Estimator and Metrics

The estimators and metrics are available at estimators.py and metrics.py respectively.

Results

cardinal.exascale.info

For all classes with at least 5000 observations we calculated the convergence metric and draw the graph. Find all classes listed on cardinal.exascale.info.

Additionally we also provide the results as CSV result.csv (tab separated, utf-8) file.