Skip to content

Scraping the wiki pages and find the minimum number of links between two wiki pages

License

Notifications You must be signed in to change notification settings

tranlv/wiki-link

Repository files navigation

wikilink


wikilink is a multiprocessing web-scraping application to scrape the wiki pages, extract urls and find the minimum number of links between 2 given wiki pages.

The project is an implementation of 6-degree of separation in wikipedia that mentioned in Web Scraping with Python, you can find more details of the project in my blog.

The project is currently at version v0.3.0.post1, also see change log for more details on release history.

Build Build Status Coverage Status
Quality Maintainability
Platform python version implementation

Table of contents

  1. Usage
  2. Contribution
  3. License

Usage

Install with pip

$ pip install wikilink

Database support

wikilink needs to access to database to store the extracted urls, it currently supports Mysql and PostgreSQL

API

setup_db(db, username, password, ip="127.0.0.1", port=3306): set up database

Args:
	db(str): Database engine, currently support "mysql" and "postgresql"
	name(str): database username
	password(str): database password
	ip(str): IP address of database (Default = "127.0.0.1")
	port(str): port that databse is running on (default=3306)

Returns:
	None
min_link(source, destination, limit=6, multiprocessing=False): find minimum number of link from source url to destination url within limit 

Args:
	source(str): source wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
	destination(str): Destination wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
	limit(int): max number of links from the source that will be considered (default=6)
	multiprocessing(boolean): enable/disable multiprocessing mode (default=False)

Returns:
	(int) minimum number of sepration between source and destination urls
	return None and print messages if exceeding limits or no path found

Raises:
	DisconnectionError: error connecting to DB

Examples

>>> from wikilink import WikiLink
>>> app = WikiLink()
>>> app.setup_db("mysql", "root", "12345", "127.0.0.1", "3306")
>>> source = "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
>>> destination = "https://en.wikipedia.org/wiki/Lionel_Messi"
>>> app.min_link(source, destination, 6)
1

Contribution

How to contribute

Please follow our contribution convention at contribution instructions and code of conduct.

To set up development environment, simply run:

$ pip install -r requirements.txt

Please check out the issues for list of issues that required helps. Also, feel free to add your name into the list of contributors.

Appreciation

If you like this project, you can buy buy me a pizza to motivate me improve on the project.

You can also put a vote to get the project more visible to others.


License

See the LICENSE file for license rights and limitations (Apache License 2.0).