Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
kPsarakis committed Feb 14, 2024
2 parents 418f3f1 + dd15f95 commit 705339c
Show file tree
Hide file tree
Showing 13 changed files with 659 additions and 341 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ __pycache__/
dist
valentine.egg-info
build
.vscode/
.vscode/
valentine.sublime-workspace
valentine.sublime-project
64 changes: 45 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ After selecting one of the 5 matching methods, the user can initiate the pairwis
matches = valentine_match(df1, df2, matcher, df1_name, df2_name)
```

where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a dictionary storing as keys column pairs from the two DataFrames and as values the corresponding similarity scores.
where df1 and df2 are the two pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input a name for each DataFrame (defaults are "table\_1" and "table\_2"). Function ```valentine_match``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

### Matching DataFrame Batch

Expand All @@ -86,23 +86,48 @@ After selecting one of the 5 matching methods, the user can initiate the batch m
matches = valentine_match_batch(df_iter_1, df_iter_2, matcher, df_iter_1_names, df_iter_2_names)
```

where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a dictionary storing as keys column pairs from the DataFrames and as values the corresponding similarity scores.
where df_iter_1 and df_iter_2 are the two iterable structures containing pandas DataFrames for which we want to find matches and matcher is one of Coma, Cupid, DistributionBased, JaccardLevenMatcher or SimilarityFlooding. The user can also input an iterable with names for each DataFrame. Function ```valentine_match_batch``` returns a MatcherResults object, which is a dictionary with additional convenience methods, such as `one_to_one`, `take_top_percent`, `get_metrics` and more. It stores as keys column pairs from the two DataFrames and as values the corresponding similarity scores.

### Measuring effectiveness

Based on the matches retrieved by calling `valentine_match` the user can use
### MatcherResults instance
The `MatcherResults` instance has some convenience methods that the user can use to either obtain a subset of the data or to transform the data. This instance is a dictionary and is sorted upon instantiation, from high similarity to low similarity.
```python
top_n_matches = matches.take_top_n(5)

top_n_percent_matches = matches.take_top_percent(25)

one_to_one_matches = matches.one_to_one()
```


### Measuring effectiveness
The MatcherResults instance that is returned by `valentine_match` or `valentine_match_batch` also has a `get_metrics` method that the user can use

```python
metrics = valentine_metrics.all_metrics(matches, ground_truth)
metrics = matches.get_metrics(ground_truth)
```

in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold.
in order to get all effectiveness metrics, such as Precision, Recall, F1-score and others as described in the original Valentine paper. In order to do so, the user needs to also input the ground truth of matches based on which the metrics will be calculated. The ground truth can be given as a list of tuples representing column matches that should hold (see example below).

By default, all the core metrics will be used for this with default parameters, but the user can also customize which metrics to run with what parameters, and implement own custom metrics by extending from the `Metric` base class. Some sets of metrics are available as well.

```python
from valentine.metrics import F1Score, PrecisionTopNPercent, METRICS_PRECISION_INCREASING_N
metrics_custom = matches.get_metrics(ground_truth, metrics={F1Score(one_to_one=False), PrecisionTopNPercent(n=70)})
metrics_prefefined_set = matches.get_metrics(ground_truth, metrics=METRICS_PRECISION_INCREASING_N)

```


### Example
The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (as found in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):
The following block of code shows: 1) how to run a matcher from Valentine on two DataFrames storing information about authors and their publications, and then 2) how to assess its effectiveness based on a given ground truth (a more extensive example is shown in [`valentine_example.py`](https://github.com/delftdata/valentine/blob/master/examples/valentine_example.py)):

```python
import os
import pandas as pd
from valentine import valentine_match
from valentine.algorithms import Coma

# Load data using pandas
d1_path = os.path.join('data', 'authors1.csv')
d2_path = os.path.join('data', 'authors2.csv')
Expand All @@ -120,25 +145,26 @@ ground_truth = [('Cited by', 'Cited by'),
('Authors', 'Authors'),
('EID', 'EID')]

metrics = valentine_metrics.all_metrics(matches, ground_truth)
metrics = matches.get_metrics(ground_truth)

print(metrics)
```

The output of the above code block is:

```
{(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313,
(('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037,
(('table_1', 'EID'), ('table_2', 'EID')): 0.8214057}
{'precision': 1.0, 'recall': 1.0, 'f1_score': 1.0,
'precision_at_10_percent': 1.0,
'precision_at_30_percent': 1.0,
'precision_at_50_percent': 1.0,
'precision_at_70_percent': 1.0,
'precision_at_90_percent': 1.0,
'recall_at_sizeof_ground_truth': 1.0}
{
(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.86994505,
(('table_1', 'Authors'), ('table_2', 'Authors')): 0.8679843,
(('table_1', 'EID'), ('table_2', 'EID')): 0.8571245
}
{
'Recall': 1.0,
'F1Score': 1.0,
'RecallAtSizeofGroundTruth': 1.0,
'Precision': 1.0,
'PrecisionTop10Percent': 1.0
}
```

## Cite Valentine
Expand Down
36 changes: 25 additions & 11 deletions examples/valentine_example.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import os
import pandas as pd
from valentine import valentine_match, valentine_metrics
from valentine.algorithms import Coma
from valentine.metrics import F1Score, PrecisionTopNPercent
from valentine import valentine_match
from valentine.algorithms import JaccardDistanceMatcher
import pprint
pp = pprint.PrettyPrinter(indent=4, sort_dicts=False)


def main():
Expand All @@ -13,28 +15,40 @@ def main():
df2 = pd.read_csv(d2_path)

# Instantiate matcher and run
# Coma requires java to be installed on your machine
# If java is not an option, all the other algorithms are in Python (e.g., Cupid)
matcher = Coma(use_instances=False)
matcher = JaccardDistanceMatcher()
matches = valentine_match(df1, df2, matcher)

# MatcherResults is a wrapper object that has several useful
# utility/transformation functions
print("Found the following matches:")
pp.pprint(matches)

print("\nGetting the one-to-one matches:")
pp.pprint(matches.one_to_one())

# If ground truth available valentine could calculate the metrics
ground_truth = [('Cited by', 'Cited by'),
('Authors', 'Authors'),
('EID', 'EID')]

metrics = valentine_metrics.all_metrics(matches, ground_truth)

pp = pprint.PrettyPrinter(indent=4)
print("Found the following matches:")
pp.pprint(matches)
metrics = matches.get_metrics(ground_truth)

print("\nAccording to the ground truth:")
pp.pprint(ground_truth)

print("\nThese are the scores of the matcher:")
print("\nThese are the scores of the default metrics for the matcher:")
pp.pprint(metrics)

print("\nYou can also get specific metric scores:")
pp.pprint(matches.get_metrics(ground_truth, metrics={
PrecisionTopNPercent(n=80),
F1Score()
}))

print("\nThe MatcherResults object is a dict and can be treated such:")
for match in matches:
print(f"{str(match): <60} {matches[match]}")


if __name__ == '__main__':
main()
86 changes: 86 additions & 0 deletions tests/test_matcher_results.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import unittest
import math

from tests import df1, df2
from valentine.algorithms.matcher_results import MatcherResults
from valentine.algorithms import JaccardDistanceMatcher
from valentine.metrics import Precision
from valentine import valentine_match


class TestMatcherResults(unittest.TestCase):
def setUp(self):
self.matches = valentine_match(df1, df2, JaccardDistanceMatcher())
self.ground_truth = [
('Cited by', 'Cited by'),
('Authors', 'Authors'),
('EID', 'EID')
]

def test_dict(self):
assert isinstance(self.matches, dict)

def test_get_metrics(self):
metrics = self.matches.get_metrics(self.ground_truth)
assert all([x in metrics for x in {"Precision", "Recall", "F1Score"}])

metrics_specific = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
assert "Precision" in metrics_specific

def test_one_to_one(self):
m = self.matches

# Add multiple matches per column
pairs = list(m.keys())
for (ta, ca), (tb, cb) in pairs:
m[((ta, ca), (tb, cb + 'foo'))] = m[((ta, ca), (tb, cb))] / 2

# Verify that len gets corrected from 6 to 3
m_one_to_one = m.one_to_one()
assert len(m_one_to_one) == 3 and len(m) == 6

# Verify that none of the lower similarity "foo" entries made it
for (ta, ca), (tb, cb) in pairs:
assert ((ta, ca), (tb, cb + 'foo')) not in m_one_to_one

# Verify that the cache resets on a new MatcherResults instance
m_entry = MatcherResults(m)
assert m_entry._cached_one_to_one is None

# Add one new entry with lower similarity
m_entry[(('table_1', 'BLA'), ('table_2', 'BLA'))] = 0.7214057

# Verify that the new one_to_one is different from the old one
m_entry_one_to_one = m_entry.one_to_one()
assert m_one_to_one != m_entry_one_to_one

# Verify that all remaining values are above the median
median = sorted(list(m_entry.values()), reverse=True)[math.ceil(len(m_entry)/2)]
for k in m_entry_one_to_one:
assert m_entry_one_to_one[k] >= median

def test_take_top_percent(self):
take_0_percent = self.matches.take_top_percent(0)
assert len(take_0_percent) == 0

take_40_percent = self.matches.take_top_percent(40)
assert len(take_40_percent) == 2

take_100_percent = self.matches.take_top_percent(100)
assert len(take_100_percent) == len(self.matches)

def test_take_top_n(self):
take_none = self.matches.take_top_n(0)
assert len(take_none) == 0

take_some = self.matches.take_top_n(2)
assert len(take_some) == 2

take_all = self.matches.take_top_n(len(self.matches))
assert len(take_all) == len(self.matches)

take_more_than_all = self.matches.take_top_n(len(self.matches)+1)
assert len(take_more_than_all) == len(self.matches)

def test_copy(self):
assert self.matches.get_copy() is not self.matches
97 changes: 63 additions & 34 deletions tests/test_metrics.py
Original file line number Diff line number Diff line change
@@ -1,47 +1,76 @@
import unittest
from valentine.metrics import *
from valentine.algorithms.matcher_results import MatcherResults
from valentine.metrics.metric_helpers import get_fp, get_tp_fn

import math
from valentine.metrics.metrics import one_to_one_matches
from copy import deepcopy
class TestMetrics(unittest.TestCase):
def setUp(self):
self.matches = MatcherResults({
(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313,
(('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037,
(('table_1', 'EID'), ('table_2', 'EID')): 0.8214057,
(('table_1', 'Title'), ('table_2', 'DUMMY1')): 0.8214057,
(('table_1', 'Title'), ('table_2', 'DUMMY2')): 0.8114057,
})
self.ground_truth = [
('Cited by', 'Cited by'),
('Authors', 'Authors'),
('EID', 'EID'),
('Title', 'Title'),
('DUMMY3', 'DUMMY3')

matches = {
(('table_1', 'Cited by'), ('table_2', 'Cited by')): 0.8374313,
(('table_1', 'Authors'), ('table_2', 'Authors')): 0.83498037,
(('table_1', 'EID'), ('table_2', 'EID')): 0.8214057,
}
]

ground_truth = [
('Cited by', 'Cited by'),
('Authors', 'Authors'),
('EID', 'EID')
]
def test_precision(self):
precision = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
assert 'Precision' in precision and precision['Precision'] == 0.75

precision_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={Precision(one_to_one=False)})
assert 'Precision' in precision_not_one_to_one and precision_not_one_to_one['Precision'] == 0.6

class TestMetrics(unittest.TestCase):
def test_recall(self):
recall = self.matches.get_metrics(self.ground_truth, metrics={Recall()})
assert 'Recall' in recall and recall['Recall'] == 0.6

recall_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={Recall(one_to_one=False)})
assert 'Recall' in recall_not_one_to_one and recall_not_one_to_one['Recall'] == 0.6

def test_f1(self):
f1 = self.matches.get_metrics(self.ground_truth, metrics={F1Score()})
assert 'F1Score' in f1 and round(100*f1['F1Score']) == 67

f1_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={F1Score(one_to_one=False)})
assert 'F1Score' in f1_not_one_to_one and f1_not_one_to_one['F1Score'] == 0.6

def test_precision_top_n_percent(self):
precision_0 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=0)})
assert 'PrecisionTop0Percent' in precision_0 and precision_0['PrecisionTop0Percent'] == 0

def test_one_to_one(self):
m = deepcopy(matches)
precision_50 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=50)})
assert 'PrecisionTop50Percent' in precision_50 and precision_50['PrecisionTop50Percent'] == 1.0

# Add multiple matches per column
pairs = list(m.keys())
for (ta, ca), (tb, cb) in pairs:
m[((ta, ca), (tb, cb + 'foo'))] = m[((ta, ca), (tb, cb))] / 2
precision = self.matches.get_metrics(self.ground_truth, metrics={Precision()})
precision_100 = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=100)})
assert 'PrecisionTop100Percent' in precision_100 and precision_100['PrecisionTop100Percent'] == precision['Precision']

# Verify that len gets corrected to 3
m_one_to_one = one_to_one_matches(m)
assert len(m_one_to_one) == 3 and len(m) == 6
precision_70_not_one_to_one = self.matches.get_metrics(self.ground_truth, metrics={PrecisionTopNPercent(n=70, one_to_one=False)})
assert 'PrecisionTop70Percent' in precision_70_not_one_to_one and precision_70_not_one_to_one['PrecisionTop70Percent'] == 0.75

# Verify that none of the lower similarity "foo" entries made it
for (ta, ca), (tb, cb) in pairs:
assert ((ta, ca), (tb, cb + 'foo')) not in m_one_to_one
def test_recall_at_size_of_ground_truth(self):
recall = self.matches.get_metrics(self.ground_truth, metrics={RecallAtSizeofGroundTruth()})
assert 'RecallAtSizeofGroundTruth' in recall and recall['RecallAtSizeofGroundTruth'] == 0.6

# Add one new entry with lower similarity
m_entry = deepcopy(matches)
m_entry[(('table_1', 'BLA'), ('table_2', 'BLA'))] = 0.7214057
def test_metric_helpers(self):
limit = 2
tp, fn = get_tp_fn(self.matches, self.ground_truth, n=limit)
assert tp <= len(self.ground_truth) and fn <= len(self.ground_truth)

m_entry_one_to_one = one_to_one_matches(m_entry)
fp = get_fp(self.matches, self.ground_truth, n=limit)
assert fp <= limit
assert tp == 2 and fn == 3 # Since we limit to 2 of the matches
assert fp == 0

# Verify that all remaining values are above the median
median = sorted(set(m_entry.values()), reverse=True)[math.ceil(len(m_entry)/2)]
for k in m_entry_one_to_one:
assert m_entry_one_to_one[k] >= median
def test_metric_equals(self):
assert PrecisionTopNPercent(n=10, one_to_one=False) == PrecisionTopNPercent(n=10, one_to_one=False)
assert PrecisionTopNPercent(n=10, one_to_one=False) != PrecisionTopNPercent(n=10, one_to_one=True)
assert PrecisionTopNPercent(n=10, one_to_one=False) != Precision()
Loading

0 comments on commit 705339c

Please sign in to comment.