Crawl using Node.js fetch function directly #1300
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Through
fetch-filecache-for-crawling
, Reffy used to depend onnode-fetch
to send HTTP requests. Thefetch-filecache-for-crawling
library now uses the native implementation offetch
in Node.js v18 and above. This update bumps the version offetch-filecache-for-crawling
, making Reffy use the native implementation offetch
in Node.js instead.This update also removes the dependency on the
AbortController
polyfill, since it is no longer needed.On top of a couple of changes needed to account for slight differences in the way headers are handled, main changes in this update are test-related, to replace the
nock
library, which can no longer be used because it cannot interceptfetch
requests, with the mock functions ofundici
(which provides the implementation offetch
in Node.js).From a crawling perspective, this update is a no-op, although it should be noted that, when certain specs are crawled (MathML Core in practice), Node.js may report memory leak warnings such as:
These are due to the fact that the same
AbortController
is (rightly) connected to all pending HTTP requests linked to the spec being crawled, and the MathML Core draft references over 100 embedded resources. In other words, that's all normal!The
AbortSignal
implementation in Node.js does not directly inherit fromEventEmitter
. As far as I can tell, there is no direct way to callsetMaxListeners()
as a result.If that proves annoying over time, we could rather adjust
events.defaultMaxListeners
but that is a global setting: https://nodejs.org/dist/latest-v18.x/docs/api/events.html#eventsdefaultmaxlisteners(Alternatively, it may be worth digging into MathML Core to check whether we actually need to download all of these resources).