Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lychee don't panic, invalid URL's are fine. #539

Closed
vipulgupta2048 opened this issue Mar 3, 2022 · 25 comments
Closed

Lychee don't panic, invalid URL's are fine. #539

vipulgupta2048 opened this issue Mar 3, 2022 · 25 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed waiting-for-feedback

Comments

@vipulgupta2048
Copy link

From the pleasant discussion with @mre here lycheeverse/lychee-action#82, I intend to work on trying to parse the input URL before passing it to reqwest in order to prevent lychee from panicking, skip the invalid URL and move on to the next one.

Bonus:

  • If we can track these invalidated URL's and show them to users at the end so that they are able to fix or ignore them for future runs

Looking for feedback and @vipulgupta2048 will work on this soon (The email will remind me to work on this)

@mre
Copy link
Member

mre commented Mar 3, 2022

Go for it! Just post questions here and we'll try to help. I like the bonus task, but don't stress it too much if it takes longer than expected. We can always to a follow-up PR. 😉

@lebensterben
Copy link
Member

If the cause of panic is known (reqwest), we can use catch-unwind somewhere in the code.

https://doc.rust-lang.org/stable/std/panic/fn.catch_unwind.html

@mre
Copy link
Member

mre commented Mar 3, 2022

Oh that's a really good idea! There's probably no performance overhead except in case of panic, which is perfect for our use-case. @vipulgupta2048 I don't know how far you got already, but could you try wrapping the client call in a catch_unwind as @lebensterben mentioned? 🥺

@vipulgupta2048
Copy link
Author

I will give it a shot soon, haven't checked it out yet.

@mre mre added bug Something isn't working good first issue Good for newcomers labels Mar 10, 2022
@dend
Copy link

dend commented Mar 17, 2022

Heya folks @mre @vipulgupta2048 - any luck with the changes? Looks like the panic is still a thing (and the PR is taking some time to be reviewed). Is there a temporary workaround one can apply?

@vipulgupta2048
Copy link
Author

I don't mean to hold this up, I was expecting to get some time to write a patch for this but haven't been lucky enough. I will unassign this from me in order to unblock it for others to try out till then.

@mre
Copy link
Member

mre commented Mar 18, 2022

@vipulgupta2048, no worries and thanks for looking into it.

Originally I thought this was possible after the discussion with @lebensterben above:

 let result = panic::catch_unwind(|| self.reqwest_client.execute(request));

let fut = match result {
    Ok(s) => s,
    Err(_) => return Status::Error(ErrorKind::Panic),
};

match fut.await {
    Ok(ref response) => Status::new(response, self.accepted.clone()),
    Err(e) => e.into(),
}

but I'm getting an error:

   --> lychee-lib/src/client.rs:482:22
    |
482 |         let result = panic::catch_unwind(|| self.reqwest_client.execute(request));
    |                      ^^^^^^^^^^^^^^^^^^^ `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
    |

So I searched around a bit and this was recommended:

match self.reqwest_client.execute(request).catch_unwind().await? {
    Ok(ref response) => Status::new(response, self.accepted.clone()),
    Err(e) => e.into(),
}

It polls the future in the background.
But I'm getting another error:

error[E0277]: the type `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary
   --> lychee-lib/src/client.rs:482:22
    |
482 |         let result = panic::catch_unwind(|| self.reqwest_client.execute(request));
    |                      ^^^^^^^^^^^^^^^^^^^ `UnsafeCell<tokio::util::linked_list::Pointers<tokio::time::driver::entry::TimerShared>>` may contain interior mutability and a reference may not be safely transferrable across a catch_unwind boundary

There is comment in the original thread:

This works when may_panic, as shown in the example is a future created in this module. However it does not work if the code in may_panic waits on another future which was returned by another method where futures:FutureExt was not brought into scope. In that case, catch_unwind complains that the future is not transferrable across a unwind boundary because the future returned by the async function itself waits on that other future which is not transferrable. Any ideas what can be done to resolve this?

(emphasis mine)

No clue how to resolve this. :/

@mre
Copy link
Member

mre commented Mar 18, 2022

Is there a temporary workaround one can apply?

@dend, the only workaround we have at the moment is to exclude the panicking URLs. 😞

@lebensterben
Copy link
Member

At this moment we need to wait for the upstream to merge the fix or use a patched dependency.

@dend
Copy link

dend commented Mar 18, 2022

ACK - thank you so much, @mre and @lebensterben. There are hundreds of URLs that panic for me (mostly ones from Archive.org that are of the form https://web.archive.org/some_stub/https://my_archived_url), so I'll wait for the upstream change to be merged for now. Thank you for all your hard work and help!

@mre
Copy link
Member

mre commented Mar 18, 2022

Well, the other alternative is to parse the URIs like reqwest would do and return an error before passing it to reqwest. That would be a waste of resources and be useless for most URLs, but at least it wouldn't panic. That was our plan in the beginning.

@mre
Copy link
Member

mre commented Mar 18, 2022

I've created a PR to work around the panics. Would love to get some feedback.

@dend,
@lebensterben mentioned that URLs like https://web.archive.org/some_stub/https://my_archived_url don't panic and I can confirm that.
Perhaps you're seeing a normal status error?
Can you provide a URL that panics for you and attach the lychee output here?

@dend
Copy link

dend commented Mar 19, 2022

@mre here is an example that I saw pop up today:

thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(InvalidUriChar)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/reqwest-0.11.8/src/into_url.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Cannot send request: SendError(Ok(Request { uri: Uri { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("cdn.den.dev")), port: None, path: "/images/postmedia/intercepting-ios-traffic/trust-warning.png", query: None, fragment: None } }, source: FsPath("./dennisdel/content/blog/intercepting-iphone-traffic-mac-for-free.md"), element: Some("img"), attribute: Some("src") }))', lychee-bin/src/commands/check.rs:85:14

Or another one:

thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(InvalidUriChar)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/reqwest-0.11.8/src/into_url.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Cannot send request: SendError(Ok(Request { uri: Uri { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("docs.microsoft.com")), port: None, path: "/dotnet/api/system.diagnostics.debug.print", query: None, fragment: None } }, source: FsPath("./dennisdel/content/blog/using-the-google-weather-api-pros-and-cons-so-far.md"), element: Some("a"), attribute: Some("href") }))', lychee-bin/src/commands/check.rs:85:14

@lebensterben
Copy link
Member

I don't understand why the URLs are invalid.

@dend
Copy link

dend commented Mar 19, 2022

I don't understand why the URLs are invalid.

Absolutely the same - they are resolving properly and work with no issue in the browser.

@mre
Copy link
Member

mre commented Mar 19, 2022

Yeah they look fine. Maybe the issue is with a different link that was sent before. It panics and then Tokio closes the channel and after that it tries to send the legit links above but the channel is already closed.

@dend
Copy link

dend commented Mar 22, 2022

Here's an example of how this is panicking with an Archive.org URL today:

thread 'tokio-runtime-worker' panicked at 'a parsed Url should always be a valid Uri: InvalidUri(InvalidUriChar)', /cargo/registry/src/github.com-1ecc6299db9ec823/reqwest-0.11.8/src/into_url.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'Cannot send request: SendError(Ok(Request { uri: Uri { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("web.archive.org")), port: None, path: "/web/20070312014327/https://developer.yahoo.com/weather/", query: None, fragment: None } }, source: FsPath("./dennisdel/content/blog/using-the-google-weather-api-pros-and-cons-so-far.md"), element: Some("a"), attribute: Some("href") }))', lychee-bin/src/commands/check.rs:85:14
+ exit_code=101

@mre
Copy link
Member

mre commented Mar 22, 2022

Weird, it works for me.

> lychee --version
lychee 0.9.0
> echo 'https://web.archive.org/web/20070312014327/https://developer.yahoo.com/weather/' | lychee -
🔍 1 Total ✅ 1 OK 🚫 0 Errors

Can you try that?

@lebensterben
Copy link
Member

btw, this reminds us we should no longer ignore panic docs. tbh I don't even realize it could panic here.

@mre
Copy link
Member

mre commented Mar 22, 2022

@lebensterben do you mean like this? #561

@lebensterben
Copy link
Member

@mre
Exactly. We should stop ignoring any missing panic docs or error docs in lychee-lib before it gets to 1.0.

@dend
Copy link

dend commented Mar 22, 2022

BTW, it's entirely possible that my GitHub action is just not configured properly because I am trying to exclude a folder from being checked:

- name: Load Excludes
  run: |
    LYCHEE_FILE_STACK=$(find . -not \( -path ./dennisdel/content/blog/drafts -prune \) -name '*.md' -print0 |xargs --null)
    echo "LYCHEE_FILE_STACK=$LYCHEE_FILE_STACK" >> $GITHUB_ENV
- name: Link Checker
  uses: lycheeverse/lychee-action@master
  with:
    args: --verbose --accept=200,429 --format markdown --exclude-file .lycheeignore --exclude-mail -- ${{ env.LYCHEE_FILE_STACK }}
  env:
    GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}

I don't think that's the root cause, but maybe it's part of the challenge?

My exclude file: https://gist.github.com/dend/5da053ec400004776f24b6d2872f8475

@lebensterben
Copy link
Member

I don't think that's the root cause

I don't think that's the cause. The panics consistently results from making a invalid request. though I've no idea why they are invalid...

@mre mre added the help wanted Extra attention is needed label Jul 11, 2022
@mre
Copy link
Member

mre commented Oct 25, 2022

@dend do you still encounter this issue with the latest lycheeverse/lychee-action@master? If so I'll keep this issue open.

@mre
Copy link
Member

mre commented Nov 5, 2022

Closing. Feel free to reopen if the problem still exists.

@mre mre closed this as completed Nov 5, 2022
mre added a commit that referenced this issue Apr 25, 2024
With the upgrade to `reqwest` 0.12, we can finally handle a long-standing
issue, when Urls could not be parsed to Uris. Previously, we would panic, but
we can now handle that situation gracefully and return an error instead.

I've also renamed `Status::is_failure` to `Status::is_error`, because the
notion of failures no longer exists in the codebase and we use the term "error"
consistently throughout the codebase instead. This is technically a breaking
change in the API, but it's fine since we have not released a stable version
yet.

More information about the URI parsing issue:
- #539
- seanmonstar/reqwest#668
mre added a commit that referenced this issue Apr 25, 2024
With the upgrade to `reqwest` 0.12, we can finally handle a long-standing
issue, when Urls could not be parsed to Uris. Previously, we would panic, but
we can now handle that situation gracefully and return an error instead.

I've also renamed `Status::is_failure` to `Status::is_error`, because the
notion of failures no longer exists in the codebase and we use the term "error"
consistently throughout the codebase instead. This is technically a breaking
change in the API, but it's fine since we have not released a stable version
yet.

More information about the URI parsing issue:
- #539
- seanmonstar/reqwest#668
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed waiting-for-feedback
Projects
None yet
Development

No branches or pull requests

4 participants