About

This is Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
     """Processes start urls.

     Args:
         response (HtmlResponse): scrapy HTML response object.
     """
     yield scrapy.Request(
         'http://target.com',
         callback=self.parse_target,
         meta={
           'save_html': True,
         }
     )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
        """
        Args:
            request (scrapy.http.request.Request): request that produced the
                response.
        """
        return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.
save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.

Sample:

HTML_STORAGE = {
    'gzip_output': True,
    'save_html_on_status': [200, 202]
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
requirements		requirements
scrapy_html_storage		scrapy_html_storage
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.rst		CHANGELOG.rst
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Usage

Configuration

About

Releases

Packages

Contributors 2

Languages

License

povilasb/scrapy-html-storage

Folders and files

Latest commit

History

Repository files navigation

About

Usage

Configuration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages