rss-scraper

RSS Scraper is a test application n which saves RSS feeds to a database and lets a user view and manage feeds they've added to the system through an API.

Description

Feeds (and feed items) updates in a background task, asynchronously, periodically and in an unattended manner.

Also, there is a notification service that supports any kind of notification type (can be extended, only email supports now)

Technically, it's a flask application with a PostgreSQL database and celery+redis broker for asynchronous tasks.

Please keep in mind some points of my realisation that I made to avoid wasting time because, as I said before, I have full-time work now:

I didn't represent all fields of feeds and items; Model expanding is easy but needs time
I didn't cover all endpoints with tests but covered some of them to give you an understanding of my approach
I didn't have time to finish the test of the background update service, but the service worked well. If it will be a decisive point, I can do it later.

Quick Start (Production mode, docker environment)

This repo provides quick start docker-compose file for hassle-free usage.

Requirements:

docker engine (https://docs.docker.com/engine/install/)
docker-compose (https://docs.docker.com/compose/install/)

Instruction:

Clone repo:

git clone https://github.com/ilyashatalov/rss-scraper.git

Start application:
```
docker-compose up -d
```

Don't mind that app and celery services will be errored at start of command. They will start as soon as the container is builded

It will:

Pull Redis and PostgreSQL images from the docker hub.
Create persistent docker volume for PostgreSQL,
Build container for the main app, celery worker and celery beat.
Configure services between each other
Forward ports (for main app port 80). Postgresql and Redis ports are forwarded too, but you can disable them by commenting on these lines in the docker-compose file—application using docker-network for these purposes by default.

You can provide your configuration through the docker-compose file. Pass it as an environment variable to the docker service (this way has the highest priority). You can see an example with

app:

SQLALCHEMY_DATABASE_URI=<sql alchemy url>

celery worker and beat (default values are here):

# SCHEDULER
MAX_RETRIES=3
SCHEDULE_INTERVAL_SEC = 10

# Celery
CELERY_BROKER_URL=redis://localhost:6379/0

# Notifier
NOTIFICATION=False
NOTIFICATION_TYPE=email
SMTP_SERVER=
SMTP_PORT=
SMTP_LOGIN=
SMTP_PASSWORD=

API Reference

There are integrated API references in the application that are available on page http://127.0.0.1/apidocs/ after the start of the main app.

Let's try some URLs

Follow feed

curl -X POST localhost/feeds/follow -H 'Content-Type: application/json' -d '{"name": "first", "url": "https://feeds.feedburner.com/tweakers/mixed"}'

Result

{
  "message": {
    "id": 1,
    "name": "first",
    "url": "https://feeds.feedburner.com/tweakers/mixed"
  },
  "success": true
}

If you try to send this request again there are will be an error (url and name should be uniq)

{
  "message": "name or url already exists",
  "success": false
}

Update items for a feed from a remote source. There are two ways - wait for auto-updating or force update. Follow feed method doesn't pull items from feed during /feed/follow request because it could be long or contain errors. The client should manage it.
```
curl localhost:5000/feeds/1/update -X POST
```
Result:
```
{
  "success": true
}
```

Get items for this feed

curl localhost:5000/feeds/1/items

Result:

{
  "message": [
    {
      "id": 40,
      "last_updated": "2023-03-10 14:20:33",
      "title": "Videokaart Best Buy Guide - Update maart 2023",
      "unread": true,
      "url": "https://tweakers.net/reviews/10946/videokaart-best-buy-guide-update-maart-2023.html"
    },
    {
      "id": 39,
      "last_updated": "2023-03-10 14:20:33",
      "title": "Tweakers Podcast #259 - Roltelefoons, early access-scams en socialemediaverboden",
      "unread": true,
      "url": "https://tweakers.net/geek/207410/tweakers-podcast-259-roltelefoons-early-access-scams-en-socialemediaverboden.html"
    }, ...
  ],
"success": true
}

All items are unread. Let's read one

curl -X PATCH localhost/items/1 -d '{"unread": "false"}' -H 'Content-Type: application/json'

Result:

{
  "success": true
}

And now we can filter our items with unread flag

curl localhost/feeds/1/items?unread=false

Result

{
  "message": [
    {
      "feed_id": 1,
      "id": 1,
      "last_updated": "2023-03-10 15:17:35",
      "title": "Nintendo toont nieuwe trailer The Super Mario Bros. Movie",
      "unread": false,
      "url": "https://tweakers.net/geek/207496/nintendo-toont-nieuwe-trailer-the-super-mario-bros-movie.html"
    }
  ],
  "success": true
}

Follow another feed and force update

curl -X POST localhost/feeds/follow -H 'Content-Type: application/json' -d '{"name": "nu.nl", "url": "http://www.nu.nl/rss/Algemeen"}'
curl localhost/feeds/2/update -X POST

Read one item from new feed:

curl -X PATCH localhost/items/41 -d '{"unread": "false"}' -H 'Content-Type: application/json'

Get all already read items (from all feeds):

 curl "localhost/items?unread=false"

Result

 {
   "message": [
     {
       "feed_id": 2,
       "id": 41,
       "last_updated": "2023-03-10 15:32:32",
       "title": "Kabinet schrapt per direct de laatste coronaregels",
       "unread": false,
       "url": "https://www.nu.nl/coronavirus/6252775/kabinet-schrapt-per-direct-de-laatste-coronaregels.html"
     },
     {
       "feed_id": 1,
       "id": 1,
       "last_updated": "2023-03-10 15:17:35",
       "title": "Nintendo toont nieuwe trailer The Super Mario Bros. Movie",
       "unread": false,
       "url": "https://tweakers.net/geek/207496/nintendo-toont-nieuwe-trailer-the-super-mario-bros-movie.html"
     }
   ],
   "success": true
 }

Development setup

Requirements

Firstly, you need redis and postgresql servers, so you can use provided docker-compose file again. Don't forget to change variables in compose-file, e.g. database credentials

docker-compose up redis db -d

Install requirements for project and pytest

pip install -r requirements.txt pytest

Configure environment variables

cd app
cp env .env
vim .env
cd worker
cp env .env
vim .env

Start app

Start app in dev mode

flask run --reload

Start celery worker and beat

celery  -A worker.celery worker -B -l debug

Tests

Take care, database will be cleaned up after every test. So, it's better to use another database for test runs. It's simple with dotenv configuration, just pass it as the environment variable (it will be prioritized). But don't forget to create it before

SQLALCHEMY_DATABASE_URI="postgresql://rssapp:rssapp@localhost:5432/rssscraper_test" pytest -s -W ignore::DeprecationWarning worker/tests.py

Result

====================================== test session starts ==============================
platform darwin -- Python 3.10.1, pytest-7.2.2, pluggy-1.0.0
rootdir: /Users/ish/Git/github.com/ilyashatalov/rss-scraper
collected 5 items                                                                                                                                                                                                                      

app/tests.py .....

===================================== 5 passed in 0.42s ==================================

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
worker		worker
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rss-scraper

Description

Quick Start (Production mode, docker environment)

Requirements:

Instruction:

API Reference

Let's try some URLs

Development setup

Requirements

Start app

Tests

About

Releases

Packages

Languages

ilyashatalov/rss-scraper

Folders and files

Latest commit

History

Repository files navigation

rss-scraper

Description

Quick Start (Production mode, docker environment)

Requirements:

Instruction:

API Reference

Let's try some URLs

Development setup

Requirements

Start app

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages