scrapy-impersonate
is a Scrapy download handler. This project integrates curl_cffi to perform HTTP requests, so it can impersonate browsers' TLS signatures or JA3 fingerprints.
pip install scrapy-impersonate
Replace the default http
and/or https
Download Handlers through DOWNLOAD_HANDLERS
DOWNLOAD_HANDLERS = {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
}
Also, be sure to install the asyncio-based Twisted reactor:
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Set the impersonate
Request.meta key to download a request using curl_cffi
:
import scrapy
class ImpersonateSpider(scrapy.Spider):
name = "impersonate_spider"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_impersonate.ImpersonateDownloadHandler",
"https": "scrapy_impersonate.ImpersonateDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
}
def start_requests(self):
for browser in ["chrome110", "edge99", "safari15_5"]:
yield scrapy.Request(
"https://tls.browserleaks.com/json",
dont_filter=True,
meta={"impersonate": browser},
)
def parse(self, response):
# ja3_hash: 773906b0efdefa24a7f2b8eb6985bf37
# ja3_hash: cd08e31494f9531f560d64c695473da9
# ja3_hash: 2fe1311860bc318fc7f9196556a2a6b9
yield {"ja3_hash": response.json()["ja3_hash"]}
You can pass any necessary arguments to curl_cffi
through impersonate_args
. For example:
yield scrapy.Request(
"https://tls.browserleaks.com/json",
dont_filter=True,
meta={
"impersonate": browser,
"impersonate_args": {
"verify": False,
"timeout": 10,
},
},
)
The following browsers can be impersonated
This project is inspired by the following projects:
- curl_cffi - Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.
- curl-impersonate - A special build of curl that can impersonate Chrome & Firefox
- scrapy-playwright - Playwright integration for Scrapy