Timeout issues with httpx.AsyncClient on a large number of requests (around 100k domains) #3338
-
Hello, I am encountering a problem with httpx.AsyncClient when making requests to a large number of different domains and subdomains (about 100,000). Many of these requests fail with timeouts, but when I reduce the number of requests to a smaller subset, they succeed without issues (status code 200). Context: async with self.client:
tasks = (self.check_single_domain(domain) for domain in domains)
return await asyncio.gather(*tasks)
async def check_single_domain(self, domain: str) -> list[DomainCheckResult]:
tasks = (
self._get(domain, f"{protocol}://{subdomain}{domain}")
for protocol in self.PROTOCOLS
for subdomain in self.SUBDOMAINS
)
return await asyncio.gather(*tasks)
async def _get(self, domain: str, url: str) -> DomainCheckResult:
try:
response = await self.client.get(url)
response.raise_for_status()
except (httpx.HTTPError, httpx.InvalidURL, httpx.CookieConflict, httpx.StreamError) as exc:
# Specific error handling
...
except Exception as exc:
# Generic error handling
...
else:
return something What I’ve already tried:
async with self.semaphore:
response = await self.client.get(url) Potential causes I’m considering:
I would greatly appreciate any suggestions on how to diagnose and fix this issue. Are there any specific configurations I should explore, or advice on how to better handle this load with httpx.AsyncClient? Thanks for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
I wasn't entirely convinced about the efficiency of using a single Here is an excerpt from the updated code, where the semaphore is initialized when the class is instantiated: self.semaphore = asyncio.Semaphore(1000) Next, here is the async def check_multiple_domains(self, domains: pl.Series) -> list[list[DomainCheckResult]]:
tasks = (self.check_single_domain(domain) for domain in domains)
return await asyncio.gather(*tasks) In async def check_single_domain(self, domain: str) -> list[DomainCheckResult]:
async with httpx.AsyncClient(
follow_redirects=True,
verify=False,
headers=self.HEADERS,
timeout=self.TIMEOUT,
limits=self.LIMITS
) as client:
tasks = (
self._get(domain, client, f"{protocol}://{subdomain}{domain}")
for protocol in self.PROTOCOLS
for subdomain in self.SUBDOMAINS
)
return await asyncio.gather(*tasks) Finally, here is the async def _get(self, domain: str, client: httpx.AsyncClient, url: str) -> DomainCheckResult:
try:
async with self.semaphore:
response = await client.get(url)
response.raise_for_status()
except:
# Exception handling It seems that does improve things a little and the code seems to run faster. I would be happy to hear your feedback on this approach. |
Beta Was this translation helpful? Give feedback.
I continued investigating. After ruling out errors potentially caused by the operating system, I can confirm that it was indeed a rate-limiting policy imposed by the DNS servers. Aiohttp allows specifying a list of nameservers (via AsyncResolver passed to TCPConnector), which helps bypass the machine's DNS resolution settings and distribute the load across multiple servers.
This experience made me realize that httpx could improve by offering clearer error messages, especially for DNS-related issues, and by allowing users to specify custom DNS servers for resolution.