Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob("*") does not support matching non-utf8 filenames #11916

Closed
2 tasks
lilyball opened this issue Jan 29, 2014 · 7 comments
Closed
2 tasks

glob("*") does not support matching non-utf8 filenames #11916

lilyball opened this issue Jan 29, 2014 · 7 comments
Labels
A-unicode Area: Unicode E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion.

Comments

@lilyball
Copy link
Contributor

glob::glob() does not have any support right now for matching non-utf8 filenames. Not only are its patterns restricted to strings, but it also explicitly skips any non-utf8 filenames it encounters (which should at least be able to match a * pattern).

Tasks that need to be done:

  • glob() needs to accept both strings and byte-vectors. It can do this using std::path::BytesContainer
  • glob() needs to process its pattern as a byte vector instead of a string, which will allow it to process filenames as byte vectors. This includes matching non-utf8 filenames against * and ? tokens (for the latter, matching a single byte is appropriate; ideally, it would match however many bytes are supposed to be consumed to create a U+FFFD REPLACEMENT CHARACTER as per the unicode standard)

This is a sub-task of #9639.

@lilyball
Copy link
Contributor Author

An alternative approach is to wait until std::str::from_utf8() has the capability of replacing invalid byte sequences with U+FFFD REPLACEMENT CHARACTER, then simply using that to match against the string pattern. This is deficient in two ways:

  1. You cannot specify a pattern that intentionally wants to match against a particular non-utf8 sequence, and
  2. Patterns that embed literal U+FFFD REPLACEMENT CHARACTERs should not match arbitrary non-utf8 sequences.

For this reason, the approach outlined in the issue description is recommended.

@lilyball
Copy link
Contributor Author

Test case from @flaper87:

#[test]
#[cfg(not(windows))]
fn test_non_utf8_glob() {
    let dir = tempfile::TempDir::new("").unwrap();
    let p = dir.path().join(&[0xFFu8]);
    fs::mkdir(&p, S_IRWXU as u32);

    let pat = p.with_filename("*");
    assert_eq!(glob(pat.as_str().expect("tmpdir is not utf-8")).collect::<~[Path]>(), ~[p])
}

This also needs to be disabled on OS X, although perhaps we should do the opposite and simply enable it for linux.

@flaper87
Copy link
Contributor

@kballard Thanks for putting this together. As discussed on IRC, I'm setting the mentor tag on you 😄

@flaper87
Copy link
Contributor

@kballard I'll work on this

@ghost ghost assigned flaper87 Jan 30, 2014
@pzol pzol added the A-unicode label Feb 26, 2014
@flaper87 flaper87 removed their assignment Apr 6, 2014
@alexcrichton
Copy link
Member

cc @nick29581, could this move to rust-lang/globs?

@rust-highfive
Copy link
Collaborator

This issue has been moved to the RFCs repo: rust-lang/glob#23

@alexcrichton
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-unicode Area: Unicode E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion.
Projects
None yet
Development

No branches or pull requests

5 participants