GitHub - chdavalas/distributed_noisy_seeds: A map-reduce implementation of the NoisySeeds algorithm as described in the paper of kazemi et al. "Growing a graph from a handful of seeds". An addition to this implementation is an experimental seed generation algorithm for the purpose of solving the problem of finding side information

Distributed NoisySeeds (DiNoiSe)

This is the PySpark implementation of DiNoiSe algorithm as described in the paper:

Charalampos Davalas, Dimitrios Michail, Iraklis Varlamis. Graph matching on social networks without any side information. In Proceedings of IEEE International Conference on Big Data, 2019

DiNoiSe is a map-reduce implementation of the NoisySeeds algorithm as described in the paper of Kazemi et al. "Growing a graph from a handful of seeds". The Distributed NoisySeeds (DiNoiSe) algorithm works upon a cluster of commodity hardware, whilst following the same logic as the original NoisySeeds algorithm. This implementation is written in PySpark (Python API for Apache Spark).

In addition, this implementation also includes an experimental seed generation algorithm, named SeGen, for the purpose of finding an initial subset of connections for the algorithm to kickstart. SeGen is an adaptation of the 1-dimensional Weisfeiler-Lehman graph isomorphism test (also known as Naive Vertex Refinement).

Background

The NoisySeeds is a Percolation Graph Matching (PGM) algorithm. PGM algorithms can provide an approximate solution to the graph matching problem. A common use of these algorithms is Network de-anonymization. More specifically, the algorithm can find shared users between two networks, which have been anonymized.

Instructions

You can use the DiNoiSe algorithm with the help of an isolated instance of python3, the algorithm has been written by using the Python API of Apache Spark (Pyspark)

Note that this algorithm works only for undirected graphs (and so does the majority of PGM algorithms). Therefore DiNoiSe can be used only for social networks which define a mutual relationship between nodes/users (Facebook friends) rather than a one-way relationship (Twitter followers).

Installation

clone project git clone https://github.com/chdavalas/distributed_noisy_seeds.git
change directory to project folder cd my/projects/directory/distributed_noisy_seeds
ensure python-pip has been installed sudo apt-get install python3-pip
ensure virtualenv has been installed pip3 install virtualenv
create new python3 environment which python3; virtualenv -p my/python3/directory environment_name
activate environment source environment_name/bin/activate
install suggested requirements and check if properly installed pip3 install -r requirements.txt; pip3 freeze

Test the algorithm

extract test_data.zip
run testing script and check data

The test_data offers a small set of test graphs with a predifined amount of edge overlap

spark-submit --master local[*] testing_script.py \
             --input test_data/graph_name/G1/{part-*.gz} test_data/graph_name/G2/{part-*.gz} \
             --input_seeds test_data/graph_name/seeds/{*.gz} \
             --bucketing (optional:use this flag if you want to use DiNoiSe with bucketing) \
             --seeds (how many seeds to generate, you should NOT use the "--input seeds" argument) \
             --parts (Apache Spark partitions)

cat results_log.csv; cat seeds_log.csv

The results are in .csv form { graph_name, time elapsed, coverage, accuracy, recall, F1-score }

Use DiNoiSe for finding matchings

Run DiNoiSe

spark-submit --master local[*] dinoise.py (OR dinoise_w_bucketing.py) \
             --input my/graph1/dir my/graph2/dir \
             --input_seeds my/seeds/dir \
             --output_dir my/output/dir\
             --seeds (how many seeds to generate, you should NOT use the "--input seeds" argument) \
             --parts (Apache Spark partitions)

Check for output

ls my/output/dir;

Read part of seeds and a part of matchings

head my/output/dir/segen_seeds/part-*

head my/output/dir/seeded_matching/part-*

OR

head my/output/dir/seedless_matching/part-*

Citation

Thanks in advance for citing us :)

@inproceedings{davalas2019,
   author={Davalas C. and Michail D. and Varlamis I.},
   year={2019},
   booktitle={Proceedings of the 2019 IEEE International Conference on Big Data},
   title={Graph matching on social networks without any side information},
   organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
LICENSE		LICENSE
README.md		README.md
dinoise.py		dinoise.py
dinoise_w_bucketing.py		dinoise_w_bucketing.py
ns_step.png		ns_step.png
requirements.txt		requirements.txt
test_data.zip		test_data.zip
testing_script.py		testing_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed NoisySeeds (DiNoiSe)

Background

Instructions

Citation

About

Releases

Packages

Languages

License

chdavalas/distributed_noisy_seeds

Folders and files

Latest commit

History

Repository files navigation

Distributed NoisySeeds (DiNoiSe)

Background

Instructions

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages