Skip to content

DI2KG 2020 Workshop by VLDB 2020 Tokyo conference : Entity Resolution Task submission

Notifications You must be signed in to change notification settings

anshudaur/DI2KG-Entity-Resolution

Repository files navigation

A proposed solution to the Entity Resolution - Domain Knowledge Category of the 2nd International Workshop on Challenges and Experiences from Data Integration to Knowledge Graphs (Monitor Data, Domain knowledge approach)

Team: SimSkipReloaded

Affiliation: University of Magdeburg, Databases and Software Engineering Workgroup.

For overview of the problem and dataset , refer : http://di2kg.inf.uniroma3.it/2020/#challenge Comments:

ML pipeline for unstructured e-commerce dataset to construct knowledge graph using ML algorithms like KNN and transformers. A solution highly tailored to the domain, with the core strategies of: information propagation for model detection, carefully-tuned brand and model extraction (with domain-specific choices).


Program assumptions (other than requirements):

  1. We assume 2013_monitor_specs to be in the same path.

The domain-specific choices we made are limited to: a) cleaning of site-specific texts for better TF-IDF results, b) non-exhaustive brand (attribute) keywords, extracted by looking at some examples of the data c) brand names that were extracted with a bit of a human-in-the-loop process (where we saw the brand names emerging and collected alternative names), d) a large amount of rules for brand cleaning, resulting from data understanding (this is the less general aspect of our solution)... the amount of hand-crafted configurations really show the amount of time the team spent exploring and understanding the data, e) Rules for extracting the models\nf) Cleaning of false-positive model names.

We consider the hard-coded rules to deter from our generality. However, we include them since they are crucial for finding a straight-forward solution with the limited resources chosen.

One of the steps : Pairwise BERT embedding were generated and KNN classifier was trained on cosine similarity to find similar entities (similiarityUsingBert.ipynb file)

To the best of our knowledge, any recent version of scikit-learn and numpy would be compatible with our submitted solution.

For reference, we installed it with the following

ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
certifi-2020.6.20          |   py38h32f6830_0         151 KB  conda-forge
joblib-0.16.0              |             py_0         203 KB  conda-forge
ld_impl_linux-64-2.34      |       h53a641e_5         616 KB  conda-forge
libblas-3.8.0              |      17_openblas          11 KB  conda-forge
libcblas-3.8.0             |      17_openblas          11 KB  conda-forge
liblapack-3.8.0            |      17_openblas          11 KB  conda-forge
libopenblas-0.3.10         |       h5ec1e0e_0         7.8 MB  conda-forge
numpy-1.18.5               |   py38h8854b6b_0         5.2 MB  conda-forge
python-3.8.3               |cpython_he5300dc_0        71.0 MB  conda-forge
scikit-learn-0.23.1        |   py38h3a94b23_0         7.0 MB  conda-forge
scipy-1.5.0                |   py38h18bccfc_0        18.7 MB  conda-forge
setuptools-49.1.0          |   py38h32f6830_0         911 KB  conda-forge
sqlite-3.32.3              |       hcee41ef_0         2.0 MB  conda-forge
threadpoolctl-2.1.0        |     pyh5ca1d4c_0          15 KB  conda-forge

To run: python EntityResolution.py

About

DI2KG 2020 Workshop by VLDB 2020 Tokyo conference : Entity Resolution Task submission

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published