Skip to content

An extensible Scrapy project collecting Real Estate data from the web, piping it into a Postgresql database

License

Notifications You must be signed in to change notification settings

tuw-eeg/real-estate-scrapers

Repository files navigation

Real Estate Data Scraper Python Package

This Python package serves to collect real estate data from the web. It takes advantage of Scrapy to scrape data from the web, in combination with web-poet and scrapy-poet to decouple the scraping logic (the HTML structure of different real estate websites) abd the spider logic (the logic to collect the data).

Docker Setup

Configure the environment variables to be used by the Docker services:

cp .env.example .env

You can either continue with the default variable values, or modify them to your liking.

Build the images using docker-compose:

docker-compose build

Start the services:

docker-compose up db -d && docker-compose run --rm scraper bash

A Bash session inside the container will be opened, in which you can interact with the scrapy project. For example, to start the scraper on all support websites, you can execute the following command:

scrapy crawl real_estate_spider

Local Development Setup

Installation

This project uses Poetry to manage Python packaging and dependencies. To install poetry itself, please refer to the official docs.

Kindly install the dependencies using the following command:

$ poetry install

Configuration

In order to persist the scraped items into a PostgreSQL database, please create src/db.cfg with the following contents:

[connection]
database =
host =
port =

[credentials]
user =
password =

If you decide not to use the PostgreSQL pipeline, kindly edit src/real_estate_scrapers/settings.py accordingly:

# src/real_estate_scrapers/settings.py
ITEM_PIPELINES = {
    # "real_estate_scrapers.pipelines.PostgresPipeline": 300,
}

Usage

As this package is a valid Scrapy project at its core, you can use it as you would use any other Scrapy project.

For the concrete use-case of our organization, we use the following command to run the project:

make run

This will run the project locally, and will persist the scraped items into the configured PostgreSQL database.

Supported Real Estate Websites

The currently supported real estate websites are:

Adding support for a new website

Thanks to web-poet and scrapy-poet, it is possible to add support for a new website with minimal effort. One needs to create a new .py file in the src/real_estate_scrapers/concrete_items directory, and implement the RealEstateListPage and RealEstatePage classes. That's it! The registration of the implementation to the spider is done auto-magically.

Crawling items only from a specific website

In order to avoid re-running the crawling for every single supported website, one can pass the -a only_domain=<domain> argument to the spider. For example, if one wants to crawl items only from the immowelt.at website, then the command to be executed from the src directory is:

scrapy crawl real_estate_spider -a only_domain=immowelt.at

About

An extensible Scrapy project collecting Real Estate data from the web, piping it into a Postgresql database

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published