WDFAP is a Python tool that enables you to fetch, clean, prepare, label, and upload articles
from web sources in
various
formats such as csv
, json
, xlsx
, and parquet
. It provides you with a versatile way to access, analyze, and
manage
diverse sets of data.
Note
Please note that currently, only the fetching feature is available.
To get started with the WDFAP, follow these simple steps:
- Clone this repository to your local machine:
git clone git@github.com:IsmaelMousa/WDFAP.git
- Navigate to the WDFAP directory:
cd WDFAP
- Setup virtual environment:
python3 -m venv .venv
- Activate the virtual environment:
source .venv/bin/activate
- Setup WDFAP:
make setup
The WDFAP provides a user-friendly interface for fetching articles. For now, you can choose to fetch articles from
Wikipedia, Google News, or both simultaneously. The fetched data is stored in the data/
directory in different
formats
for easy access and analysis.
- Run the WDFAP:
make start
- After that the terminal will ask you a few questions, here is an example with results:
Here is a summary for the purpose of each major module or component in WDFAP:
Click for more information:
Module | Purpose |
---|---|
tools |
Provides utility functions and scripts for orchestrating the fetching, cleaning, labeling, and uploading of data from various sources. Initially includes a script for user interaction to fetch articles from Web Sources asynchronously. |
sources |
Provides modules for fetching articles asynchronously from different sources like Google News & Wikipedia. |
data |
Storage Where fetched articles are stored in various formats such as csv , json , xlsx and parquet . |
errors |
Prepares and customizes exceptions for handling specific issues. |
utils |
Houses common utilities/logic utilized throughout the project. |
configs |
Contains main configurations for both development and production stages. |
setup.py |
Configures the project metadata and dependencies for streamlined installation. |
main.py |
Serves as the entry point, initiating the project. |
Makefile |
Provides commands for installing dependencies and running the application. |
requierments.txt |
Lists all the required dependencies for running the application. |
Here is an overview of the dependencies/packages used in the WDFAP along with their respective usage:
Click for more information:
Dependency | Usage |
---|---|
beautifulsoup4 |
Offers powerful tools for parsing and navigating HTML documents, simplifying the extraction of structured data from web pages. |
newspaper |
Simplifies the extraction and curation of articles from online sources, streamlining the process of gathering news content. |
feedparser |
Parses RSS and Atom feeds, enabling extraction of syndicated content from websites and blogs. |
asyncio |
Facilitates asynchronous I/O operations, allowing for concurrent execution of tasks without blocking the event loop. |
aiohttp |
Provides asynchronous HTTP client/server functionality for asyncio, enabling efficient handling of web requests and responses. |
pandas |
Provides high-performance data manipulation and analysis tools, ideal for working with structured datasets. |
tqdm |
Enhances loops with progress bars, providing visual feedback on the progress of iterative tasks, improving user experience and productivity. |
openpyxl |
Facilitates reading from and writing to Excel files, enabling manipulation of spreadsheet data with Python. |
pyarrow |
Provides tools for working with Apache Arrow data, an in-memory columnar data format, offering efficient data interchange between different systems. |
fastparquet |
Offers efficient reading and writing of Parquet files, a columnar storage format optimized for analytics workloads, enabling high-performance data processing. |
For now the available web sources are:
We appreciate your interest in contributing to our project! Your contributions help us improve and grow.
Please check Contributing for the contribution guidelines, and make sure to read CODE_OF_CONDUCT document.