aiohttp-like interface to chromium
based on selenium_driverless to bypass cloudflare
working prototype
aiohttp_chromium
is a drop-in replacement for aiohttp
import asyncio
#import aiohttp
import aiohttp_chromium as aiohttp
async def main():
async with aiohttp.ClientSession() as session:
async with session.get('http://httpbin.org/get') as resp:
print(resp.status)
print(await resp.text())
asyncio.run(main())
see also
handling file downloads with selenium
is too verbose,
and too complex to integrate into selenium
,
so this is a wrapper for selenium
i wanted a "stupid http client",
so it has the same interface as aiohttp.client
,
and handling web pages has lower priority,
so the selenium
interface is hidden in response._driver
when creating new tabs, or when switching between tabs, the chromium window is grabbing focus
this is an issue with the window manager
workaround for the KDE plasma desktop: move the chromium window to a different desktop, and focus some window
chromium seems to have no command line switch to disable this focus-grabbing
possible solutions
- run chromium in a LD_PRELOAD wrapper
- binary patching of the chromium executable
- configure the window manager
- remove tempfiles on session close and on error
- add support for streams: request streams, response streams
- currently,
session.get
only works for "short and small" requests and responses, but not for infinite streams - implementing this is non-trivial, because chromium does not expose streams over the Chrome DevTools Protocol (CDP)
- kaliiiiiiiiii/Selenium-Driverless#123
- i guess this is very deliberate sabotage, to prevent "abusing" chromium as a generic http client, which is pretty much what we are trying to do here...
- wkeeling/selenium-wire#656 (comment)
- sounds like we need either/or: a patched version of chromium, or a dynamic analysis tool like frida to insert hooks into the chromium binary ... to pipe all requests and responses through a local http proxy, for passive tracing and active intercepting of https traffic
- tracing https traffic with frida
- https://gaiaslastlaugh.medium.com/frida-as-an-alternative-to-network-tracing-5173cfbd7a0b
- https://andydavies.me/blog/2019/12/12/capturing-and-decrypting-https-traffic-from-ios-apps/
- https://stackoverflow.com/questions/46711786/android-hooking-https-traffic-using-frida
- https://frida.re/docs/frida-trace/
- https://groups.google.com/g/chrome-debugging-protocol/c/w65z0cMqgvc - Fetch.fulfillRequest and (very) long body
- there's no streaming support for Fetch network interception
- there is Fetch.takeResponseBodyAsStream and IO.read, but not Fetch.giveResponseBodyAsStream and IO.write
- there is Network.takeResponseBodyForInterceptionAsStream and IO.read, but not Network.giveResponseBodyForInterceptionAsStream and IO.write
- google has hidden the discussion: "You don't have permission to access this content. For access, try contacting the group's owners and managers"
- see snapshot from archive.org 2024-06-23
- hey google? thanks for reminding us that google is a bunch of fascists, engaging in sabotage and censorship
- https://issues.chromium.org/issues/332570739 - Streaming body for Fetch.fulfillRequest() CDP API
- Fetch.fullfillRequest() only provides an option to set the 'body' response as a base64-encoded string. Of course, this does not work well for larger response body. Similar to the streaming takeResponseBodyAsStream(), it would be great if there was a fullfillRequest() option with a stream, fullfillRequestWithStream()
- Perhaps this could be done by expanding the IO APIs to have a IO.write() option that allows sending a streaming data to the browser. I realize this is probably fairly low-priority, but would make Fetch request interception more efficient, especially when dealing with larger responses/chunked response of unknown size, etc...
- The feature request makes sense but currently it is a low priority for us.
- see snapshot
- Fetch.fullfillRequest() only provides an option to set the 'body' response as a base64-encoded string. Of course, this does not work well for larger response body. Similar to the streaming takeResponseBodyAsStream(), it would be great if there was a fullfillRequest() option with a stream, fullfillRequestWithStream()
- currently,
- graphical interface where the user can solve challenges: captchas, unexpected responses, ...
- integration with captcha solving services
- remove unfree dependencies
- selenium_driverless - cc by-nc-sa license
selenium_driverless
is a high-level wrapper for the Chrome DevTools Protocol (CDP)- NOT based on chromedriver binary, because chromedriver is detected by cloudflare
- see also Awesome Chrome DevTools # Libraries for driving the protocol (or a layer above)
- https://github.com/pyppeteer/pyppeteer - 3K stars
- https://github.com/fake-name/ChromeController - 200 stars
- https://github.com/chazkii/chromewhip - 120 stars
- selenium_driverless - cc by-nc-sa license
grep -r -w FIXME src/
grep -r -w TODO src/
- web scraper
- chromium
- aiohttp
- web scraping
- asyncio
- bypass cloudflare
- headful scraper
- headful web scraper
- headful chromium
- gui scripting
- headful webscraper
- selenium driverless