Data Pirates challenge

This repository contains the solution developed to solve the Data Pirates challenge proposed by Neoway. The challenge consists to collect 500 titles (for each available genre) from The Internet Movie Database (IMDb), sorted by 'Rating' and output it as jsonl files (one for each genre).

Requirements

Docker

Description

The solution was developed in Python 3.7 using the Scrapy framework. Scrapy is an application framework for writing web spiders that crawl websites and extract data from them [Scrapy FAQ].

The spider TopRatedByGenreSpider is responsible to fetch the IMDb Top Rated Movies pages (navigate in the pagination system until getting 500 titles for each genre), collect the necessary data (the titles) and yield the crawled items in the processing pipeline. This spider receives two inputs: genre (the name of the genre to fetch) and fetch_size (number of items that must be fetched, in descending order, by 'Ratings'). In a next step, an item pipeline named JsonWriterPipeline receives the objects crawled from IMDb and save it in jsonl files.

I chose to run that application inside a Docker container in order to facilitate the development environment setup and to avoid version mismatch of Python or libraries used.

Usage

Build the docker image:

docker build -t imdb_crawler .

Run the image to crawl the data:

docker run -v "$(pwd)/output:/usr/src/app/output" imdb_crawler

As we are running the crawler inside the Docker container, the output files are created inside it. We need to move this outside the container. To do that, I created a shared volume at the root folder of the application named output and mapped it to the container folder /usr/src/app/output (this is where the application stores the jsonl files inside the container). The $(pwd)/output at the docker run command specifies the folder where we want to get the output, at the host (our machine in that case).

Tests

Unit tests using Scrapy is a subject of discussion. Some people say that test against fake HTML response it's not suitable to guarantee that the spiders are crawling correctly, once you need to check the real HTML state.

The Scrapy offers an integrated way of testing your spiders by the means of contracts [Scrapy Contracts]. In this solution I used some contracts at the TopRatedByGenreSpider to check if the crawling is working good with real data. At the Dockerfile you can uncomment a CMD that runs the scrapy check command.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
imdb_crawler		imdb_crawler
output		output
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
run_crawler.py		run_crawler.py
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pirates challenge

Requirements

Description

Usage

Build the docker image:

Run the image to crawl the data:

Tests

About

Releases

Packages

Languages

renancunha/neoway-datapirates

Folders and files

Latest commit

History

Repository files navigation

Data Pirates challenge

Requirements

Description

Usage

Build the docker image:

Run the image to crawl the data:

Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages