Skip to content

Commit

Permalink
Crawl using Node.js fetch function directly (#1300)
Browse files Browse the repository at this point in the history
Through `fetch-filecache-for-crawling`, Reffy used to depend on `node-fetch` to
send HTTP requests. The `fetch-filecache-for-crawling` library now uses the
native implementation of `fetch` in Node.js v18 and above. This update bumps the
version of `fetch-filecache-for-crawling`, making Reffy use the native
implementation of `fetch` in Node.js instead.

This update also removes the dependency on the `AbortController` polyfill, since
it is no longer needed.

On top of a couple of changes needed to account for slight differences in the
way headers are handled, main changes in this update are test-related, to
replace the `nock` library, which can no longer be used because it cannot
intercept `fetch` requests, with the mock functions of `undici` (which provides
the implementation of `fetch` in Node.js).

From a crawling perspective, this update is a no-op, although it should be noted
that, when certain specs are crawled (MathML Core in practice), Node.js may
report memory leak warnings such as:

```
MaxListenersExceededWarning: Possible EventTarget memory leak detected. 101
abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase
limit
```

These are due to the fact that the same `AbortController` is (rightly) connected
to all pending HTTP requests linked to the spec being crawled, and the MathML
Core draft references over 100 embedded resources. In other words, that's all
normal!

The `AbortSignal` implementation in Node.js does not directly inherit from
`EventEmitter`. As far as I can tell, there is no direct way to call
`setMaxListeners()` as a result.

If that proves annoying over time, we could rather adjust
`events.defaultMaxListeners` but that is a global setting:
https://nodejs.org/dist/latest-v18.x/docs/api/events.html#eventsdefaultmaxlisteners

(Alternatively, it may be worth digging into MathML Core to check whether we
actually need to download all of these resources).
  • Loading branch information
tidoust committed May 30, 2023
1 parent 7574e98 commit e0d13fe
Show file tree
Hide file tree
Showing 7 changed files with 821 additions and 282 deletions.
Loading

0 comments on commit e0d13fe

Please sign in to comment.