Similarius is a Python library to compare web page and evaluate the level of similarity.
The tool can be used as a stand-alone tool or to feed other systems.
- Python 3.8+
- Requests
- Scikit-learn
- Beautifulsoup4
- nltk
Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
.
$ poetry install
$ poetry shell
$ similarius -h
$ pip3 install similarius
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]
optional arguments:
-h, --help show this help message and exit
-o ORIGINAL, --original ORIGINAL
Website to compare
-w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
Website to compare
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()
# Original
original = get_website(args.original)
if not original:
print("[-] The original website is unreachable...")
exit(1)
original_text, original_ressource = extract_text_ressource(original.text)
for website in args.website:
print(f"\n********** {args.original} <-> {website} **********")
# Compare
compare = get_website(website)
if not compare:
print(f"[-] {website} is unreachable...")
continue
compare_text, compare_ressource = extract_text_ressource(compare.text)
# Calculate
sim = str(sk_similarity(compare_text, original_text))
print(f"\nSimilarity: {sim}")
ressource_diff = ressource_difference(original_ressource, compare_ressource)
print(f"Ressource Difference: {ressource_diff}")
ratio_compare = ratio(ressource_diff, sim)
print(f"Ratio: {ratio_compare}")
The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.