Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl using Node.js fetch function directly #1300

Merged
merged 1 commit into from
May 30, 2023
Merged

Crawl using Node.js fetch function directly #1300

merged 1 commit into from
May 30, 2023

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented May 27, 2023

Through fetch-filecache-for-crawling, Reffy used to depend on node-fetch to send HTTP requests. The fetch-filecache-for-crawling library now uses the native implementation of fetch in Node.js v18 and above. This update bumps the version of fetch-filecache-for-crawling, making Reffy use the native implementation of fetch in Node.js instead.

This update also removes the dependency on the AbortController polyfill, since it is no longer needed.

On top of a couple of changes needed to account for slight differences in the way headers are handled, main changes in this update are test-related, to replace the nock library, which can no longer be used because it cannot intercept fetch requests, with the mock functions of undici (which provides the implementation of fetch in Node.js).

From a crawling perspective, this update is a no-op, although it should be noted that, when certain specs are crawled (MathML Core in practice), Node.js may report memory leak warnings such as:

MaxListenersExceededWarning: Possible EventTarget memory leak detected. 101
abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase
limit

These are due to the fact that the same AbortController is (rightly) connected to all pending HTTP requests linked to the spec being crawled, and the MathML Core draft references over 100 embedded resources. In other words, that's all normal!

The AbortSignal implementation in Node.js does not directly inherit from EventEmitter. As far as I can tell, there is no direct way to call setMaxListeners() as a result.

If that proves annoying over time, we could rather adjust events.defaultMaxListeners but that is a global setting: https://nodejs.org/dist/latest-v18.x/docs/api/events.html#eventsdefaultmaxlisteners

(Alternatively, it may be worth digging into MathML Core to check whether we actually need to download all of these resources).

Through `fetch-filecache-for-crawling`, Reffy used to depend on `node-fetch` to
send HTTP requests. The `fetch-filecache-for-crawling` library now uses the
native implementation of `fetch` in Node.js v18 and above. This update bumps the
version of `fetch-filecache-for-crawling`, making Reffy use the native
implementation of `fetch` in Node.js instead.

This update also removes the dependency on the `AbortController` polyfill, since
it is no longer needed.

On top of a couple of changes needed to account for slight differences in the
way headers are handled, main changes in this update are test-related, to
replace the `nock` library, which can no longer be used because it cannot
intercept `fetch` requests, with the mock functions of `undici` (which provides
the implementation of `fetch` in Node.js).

From a crawling perspective, this update is a no-op, although it should be noted
that, when certain specs are crawled (MathML Core in practice), Node.js may
report memory leak warnings such as:

```
MaxListenersExceededWarning: Possible EventTarget memory leak detected. 101
abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase
limit
```

These are due to the fact that the same `AbortController` is (rightly) connected
to all pending HTTP requests linked to the spec being crawled, and the MathML
Core draft references over 100 embedded resources. In other words, that's all
normal!

The `AbortSignal` implementation in Node.js does not directly inherit from
`EventEmitter`. As far as I can tell, there is no direct way to call
`setMaxListeners()` as a result.

If that proves annoying over time, we could rather adjust
`events.defaultMaxListeners` but that is a global setting:
https://nodejs.org/dist/latest-v18.x/docs/api/events.html#eventsdefaultmaxlisteners

(Alternatively, it may be worth digging into MathML Core to check whether we
actually need to download all of these resources).
@tidoust tidoust merged commit e0d13fe into main May 30, 2023
@tidoust tidoust deleted the drop-node-fetch branch May 30, 2023 05:48
tidoust added a commit that referenced this pull request Jun 7, 2023
New feature:
- Crawl using Node.js fetch function directly (#1300)

Dependencies bumped:
- Bump rollup from 3.23.0 to 3.24.0 (#1305)
- Bump puppeteer from 20.3.0 to 20.5.0 (#1299, #1302)
- Bump web-specs from 2.59.0 to 2.59.1 (#1304)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants