WDFAP: Web Data Fetcher And Preparer

WDFAP is a Python tool that enables you to fetch, clean, prepare, label, and upload articles from web sources in various formats such as csv, json, xlsx, and parquet. It provides you with a versatile way to access, analyze, and manage diverse sets of data.

Note

Please note that currently, only the fetching feature is available.

Installation

To get started with the WDFAP, follow these simple steps:

Clone this repository to your local machine:

git clone git@github.com:IsmaelMousa/WDFAP.git

Navigate to the WDFAP directory:

cd WDFAP

Setup virtual environment:

python3 -m venv .venv

Activate the virtual environment:

source .venv/bin/activate

Setup WDFAP:

make setup

Usage

The WDFAP provides a user-friendly interface for fetching articles. For now, you can choose to fetch articles from Wikipedia, Google News, or both simultaneously. The fetched data is stored in the data/ directory in different formats for easy access and analysis.

Run the WDFAP:

make start

After that the terminal will ask you a few questions, here is an example with results:

Modules

Here is a summary for the purpose of each major module or component in WDFAP:

Click for more information:

Module	Purpose
`tools`	Provides utility functions and scripts for orchestrating the fetching, cleaning, labeling, and uploading of data from various sources. Initially includes a script for user interaction to fetch articles from Web Sources asynchronously.
`sources`	Provides modules for fetching articles asynchronously from different sources like Google News & Wikipedia.
`data`	Storage Where fetched articles are stored in various formats such as `csv`, `json`, `xlsx` and `parquet`.
`errors`	Prepares and customizes exceptions for handling specific issues.
`utils`	Houses common utilities/logic utilized throughout the project.
`configs`	Contains main configurations for both development and production stages.
`setup.py`	Configures the project metadata and dependencies for streamlined installation.
`main.py`	Serves as the entry point, initiating the project.
`Makefile`	Provides commands for installing dependencies and running the application.
`requierments.txt`	Lists all the required dependencies for running the application.

Dependencies

Here is an overview of the dependencies/packages used in the WDFAP along with their respective usage:

Click for more information:

Dependency	Usage
`beautifulsoup4`	Offers powerful tools for parsing and navigating HTML documents, simplifying the extraction of structured data from web pages.
`newspaper`	Simplifies the extraction and curation of articles from online sources, streamlining the process of gathering news content.
`feedparser`	Parses RSS and Atom feeds, enabling extraction of syndicated content from websites and blogs.
`asyncio`	Facilitates asynchronous I/O operations, allowing for concurrent execution of tasks without blocking the event loop.
`aiohttp`	Provides asynchronous HTTP client/server functionality for asyncio, enabling efficient handling of web requests and responses.
`pandas`	Provides high-performance data manipulation and analysis tools, ideal for working with structured datasets.
`tqdm`	Enhances loops with progress bars, providing visual feedback on the progress of iterative tasks, improving user experience and productivity.
`openpyxl`	Facilitates reading from and writing to Excel files, enabling manipulation of spreadsheet data with Python.
`pyarrow`	Provides tools for working with Apache Arrow data, an in-memory columnar data format, offering efficient data interchange between different systems.
`fastparquet`	Offers efficient reading and writing of Parquet files, a columnar storage format optimized for analytics workloads, enabling high-performance data processing.

Sources

For now the available web sources are:

Contributing

We appreciate your interest in contributing to our project! Your contributions help us improve and grow.

Please check Contributing for the contribution guidelines, and make sure to read CODE_OF_CONDUCT document.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github		.github
configs		configs
docs		docs
errors		errors
sources		sources
tests		tests
tools		tools
utils		utils
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
version.json		version.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WDFAP: Web Data Fetcher And Preparer

Installation

Usage

Modules

Dependencies

Sources

Contributing

About

Releases 1

Packages

Contributors 2

Languages

License

IsmaelMousa/WDFAP

Folders and files

Latest commit

History

Repository files navigation

WDFAP: Web Data Fetcher And Preparer

Installation

Usage

Modules

Dependencies

Sources

Contributing

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages