Real Estate Data Scraper Python Package

This Python package serves to collect real estate data from the web. It takes advantage of Scrapy to scrape data from the web, in combination with web-poet and scrapy-poet to decouple the scraping logic (the HTML structure of different real estate websites) abd the spider logic (the logic to collect the data).

Docker Setup

Configure the environment variables to be used by the Docker services:

cp .env.example .env

You can either continue with the default variable values, or modify them to your liking.

Build the images using docker-compose:

docker-compose build

Start the services:

docker-compose up db -d && docker-compose run --rm scraper bash

A Bash session inside the container will be opened, in which you can interact with the scrapy project. For example, to start the scraper on all support websites, you can execute the following command:

scrapy crawl real_estate_spider

Local Development Setup

Installation

This project uses Poetry to manage Python packaging and dependencies. To install poetry itself, please refer to the official docs.

Kindly install the dependencies using the following command:

$ poetry install

Configuration

In order to persist the scraped items into a PostgreSQL database, please create src/db.cfg with the following contents:

[connection]
database =
host =
port =

[credentials]
user =
password =

If you decide not to use the PostgreSQL pipeline, kindly edit src/real_estate_scrapers/settings.py accordingly:

# src/real_estate_scrapers/settings.py
ITEM_PIPELINES = {
    # "real_estate_scrapers.pipelines.PostgresPipeline": 300,
}

Usage

As this package is a valid Scrapy project at its core, you can use it as you would use any other Scrapy project.

For the concrete use-case of our organization, we use the following command to run the project:

make run

This will run the project locally, and will persist the scraped items into the configured PostgreSQL database.

Supported Real Estate Websites

The currently supported real estate websites are:

https://www.immowelt.at/

Adding support for a new website

Thanks to web-poet and scrapy-poet, it is possible to add support for a new website with minimal effort. One needs to create a new .py file in the src/real_estate_scrapers/concrete_items directory, and implement the RealEstateListPage and RealEstatePage classes. That's it! The registration of the implementation to the spider is done auto-magically.

Crawling items only from a specific website

In order to avoid re-running the crawling for every single supported website, one can pass the -a only_domain=<domain> argument to the spider. For example, if one wants to crawl items only from the immowelt.at website, then the command to be executed from the src directory is:

scrapy crawl real_estate_spider -a only_domain=immowelt.at

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Estate Data Scraper Python Package

Docker Setup

Local Development Setup

Installation

Configuration

Usage

Supported Real Estate Websites

Adding support for a new website

Crawling items only from a specific website

About

Releases

Packages

Languages

License

tuw-eeg/real-estate-scrapers

Folders and files

Latest commit

History

Repository files navigation

Real Estate Data Scraper Python Package

Docker Setup

Local Development Setup

Installation

Configuration

Usage

Supported Real Estate Websites

Adding support for a new website

Crawling items only from a specific website

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages