py-web-miner

Extensible Web Miner to extract information from web pages.

It is based on HTTP Requests library, Beautiful Soup parser, and Selenium WebDriver.

Usage Example

Use Selenium WebDriver Scraper (You must have the chromedriver/geckodriver executable in a folder associated to the environment variable PATH (e.g. /usr/local/bin/):

from py_web_miner.scraping import SeleniumScraper

scraper_obj = SeleniumScraper(
    random_user_agent_flag=True,
    random_screen_resolution_flag=True,
    browser="chrome",  # "chrome" / "firefox"
    bs4_parser="html.parser",
    proxy=None
)

Or, eventually, use Requests Scraper (only HTML parsing, no Javascript execution):

from py_web_miner.scraping import RequestsScraper

scraper_obj = RequestsScraper(
    random_user_agent_flag=True,
    random_screen_resolution_flag=True,
    bs4_parser="html.parser",
    proxy=None
)

Start the scraper, retrieve the HTML source and extract raw text and all external links:

# start the scraper object
scraper_obj.start()

# retrieve the HTML source
html_body = scraper_obj.retrieve_html(
    url="https://github.com/andrealenzi11",
    wait_seconds=1.0,
    delete_cookies_flag=True
)

# extract the raw text
extracted_text = scraper_obj.extract_text(
    html_body=html_body
)

# extract the links
extracted_links = scraper_obj.extract_links(
    html_body=html_body
)

# quit
scraper_obj.quit()

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
py_web_miner		py_web_miner
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_test.txt		requirements_test.txt
version.json		version.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

py-web-miner

Usage Example

About

Packages

Languages

License

andrealenzi11/py-web-miner

Folders and files

Latest commit

History

Repository files navigation

py-web-miner

Usage Example

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Languages

Packages