Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[path] Also match parent dirs in include/exclude #4256

Closed
wants to merge 1 commit into from

Conversation

behnam
Copy link
Contributor

@behnam behnam commented Jul 7, 2017

NOTE: This is a major change in pattern matching for include and
exclude fields, and can result in additional inclusion/exclusion for
some patterns.

Previously, for inclusion/exclusion matters, Cargo only works with paths
of files in a package/repository, and glob pattern matching has been
applying only to these file paths.

The old behavior results in some unexpected behavior. For example,
having:

exclude = ["data"]

in a manifest next to a data directory, it will not exclude the
directory. To make it work, a pattern must be provided that matches the
files under this directory, like:

exclude = ["data/*"]

To make Cargo's inclusion/exclusion behavior more intutional, and bring
it on par with similar systems, like gitignore, we need to also match
these patterns with the directories. The directories are seen
internally as parents of the files. Therefore, this diff expands the
pattern matching to files and their parent directories.

Now, it's easier to exclude all data files:

exclude = ["data"]

or include only the src directory:

include = ["src"]

Fixes #3578

NOTE: This is a major change in pattern matching for `include` and
`exclude` fields, and can result in additional inclusion/exclusion for
some patterns.

Previously, for inclusion/exclusion matters, Cargo only works with paths
of files in a package/repository, and glob pattern matching has been
applying only to these file paths.

The old behavior results in some unexpected behavior. For example,
having:

```toml
exclude = ["data"]
```

in a manifest next to a `data` directory, it will not exclude the
directory. To make it work, a pattern must be provided that matches the
*files* under this directory, like:

```toml
exclude = ["data/*"]
```

To make Cargo's inclusion/exclusion behavior more intutional, and bring
it on par with similar systems, like `gitignore`, we need to also match
these patterns with the *directories*. The directories are seen
internally as *parents* of the files. Therefore, this diff expands the
pattern matching to files and their parent directories.

Now, it's easier to exclude all data files:

```toml
exclude = ["data"]
```

or include only the `src` directory:

```toml
include = ["src"]
```

Fixes <rust-lang#3578>
@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @matklad (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@behnam
Copy link
Contributor Author

behnam commented Jul 7, 2017

CC @alexcrichton, @mbrubeck, @llogiq

@matklad
Copy link
Member

matklad commented Jul 7, 2017

Making exclude work exactly as gitignore seems good to me! This is a potentially backwards incompatible change, but it only affects publishing, so I think it's OK to do this. However, I think that ideally we should do matching using both old and new algorithm, and print a warning if there's any difference.

However, I am worried a bit that we write a bit of a custom matching algorithm here, so there's a high chance that it will work differently from .gitignore for some corner cases. Could you investigate what ripgrep does here? My understanding is that interprets .gitignore exactly right, so we might want just to steal its implementation (I think it uses globset crate instead of glob).

@behnam
Copy link
Contributor Author

behnam commented Jul 7, 2017

I tested out globset and I've got bad news: it agrees with glob on the behavior! And, both of them disagree with gitignore.

Spec for gitignore: https://git-scm.com/docs/gitignore#_pattern_format

Here's my test:
https://github.com/behnam/rust-globs-test/blob/master/src/lib.rs

The lines marked with XXX are the disagreements with gitignore matching.

Basically, there are a few major differences between glob/globset and gitignore matching:

  1. Leading slash does not match with the location of the spec file in glob/globset. In gitignore, /foo matches foo dir/file that's next to the spec. (Do we want to change this? IIUC, there's no reason to support absolute paths outside of the repo in Cargo packaging.)

  2. Directories of a file are not matched, but only the whole file path is matched. That's why foo doesn't match any file under foo/. (This is what this diff changes.)

  3. Trailing slash does not match anything in glob/globset. In gitignore, it enforces match to a dir, but not a file. (This is the case that I added test for in this diff, but have not fixed it. The fix shouldn't be hard.)

Do you think we should put the logic for these cases into one of the upstream repos, maybe behind their Options? @BurntSushi, @alexcrichton?

Can we start with this diff, that puts the logic in one function here in Cargo, and later switch to the upstream functionality when available? If so, I can add to this implementations for both cases 2 and 3, to complete it.

@behnam
Copy link
Contributor Author

behnam commented Jul 7, 2017

I have added a real Cargo.toml test and output of the packing to the test repo:

@matklad
Copy link
Member

matklad commented Jul 8, 2017

Awesome investigation @behnam!

Looks like ripgrep does quite a bit of massaging of .gitignore globs, before passing them to globset: https://github.com/BurntSushi/ripgrep/blob/9b3921098a6804879a4425716bfb72a8a003a313/ignore/src/gitignore.rs#L355-L440. In particular, issue with "foo matches only directoris" seems to be handled here. And looks like this functionality is public, so we probably could use it: https://docs.rs/ignore/0.2.0/ignore/gitignore/index.html :)

Taking a step back, current doc say

The syntax of each element in this array is what rust-lang/glob accepts

So the current behavior matches documentation. What we actually want to do here is to change the behavior to "The syntax is the same as in .gitignore". So this change is not exactly a bug-fix.

@behnam
Copy link
Contributor Author

behnam commented Jul 8, 2017

So that's what you meant, @matklad, by ripgrep interpreting .gitignore. I didn't know about that part. Thanks for the pointers.

And good point about the behavior being documented. Although, it still doesn't explain why glob should not match directories in the first place. I didn't see any explanation about the features in glob/globset docs, and looks like a major application is also referring to the existing behavior blankly.

So, we have these options now, I suppose:

  1. Do nothing and close the PR. We can add a note to the docs about the pattern only matching file paths, and say the for directory matching a /* suffix is needed.

  2. Continue this patch the way it is (maybe also hard-code the trailing slash case), and update docs to say that the glob matching (as specified by the rust-lang/glob crate) is applied to both directories and files. We can do so in the glob side, or in here, either way backwards incompatible for Cargo, and in the former case, all users of glob.

  3. Replace the logic in this patch with the ripgrep implementation. This would make a big backwards incompatible change, although side-effects should be minimal, but still can be surprising for many users.

  4. And, finally, replace include/exclude options with a new option, like publish_files (for positive form) or ignore_files/ignore (for negative form), which works as closely to gitignore as possible (gitignore being a negative form). This new option will error if the old options exist, and we start deprecating the old options slowly in a near future. Also:

    • Is fully backwards compatible.

    • Allows both inclusion and exclusion in a deterministic way. At the moment, users can only include, or exclude, but not both. (I suppose the reason for mutual exclusive lists was to prevent confusion on the order the rules apply.)

    • The syntax is already known to users, reducing falling into traps (like the original reported issue), and making it easier to reason about (being one list, not two separate lists).

    • Being a new option, we can also add-in inheritance for it from a package.workspace Cargo.toml, to make it easier to maintain for large projects.

IMHO, the last option is better than the rest, because not only solves the current problem, also uses a more mainstream solution for the specific problem (since gitignore format is more specific to inclusion/exclusion than UNIX glob). And if we go with that, then we can move the gitignore logic of ripgrep into its own crate, to make it more available, like glob/globset. (UDPATE: Actually the libraries already exists: https://crates.io/crates/gitignore (based on glob), https://crates.io/crates/ignore, https://crates.io/crates/pathmatch.)

What do you think?

@matklad
Copy link
Member

matklad commented Jul 8, 2017

I think we should ask @alexcrichton opinion here :)

First of all, a small clarification:

And if we go with that, then we can move the gitignore logic of ripgrep into its own crate, to make it more available, like glob/globset.

It already lives in a separate crate: https://crates.io/crates/ignore

I think we do want to switch from "The syntax of each element in this array is what rust-lang/glob accepts" which is not very user friendly to "the syntax is the same as in .gitignore".

Between 3 and 4, I like 3 somewhat better: I don't think that backwards compatibility is super important here, because publish is a rare and usually manual operation (having a warning if the set of included files change won't hurt though), and 3 has an advantage that we don't carry legacy options forward.

@behnam
Copy link
Contributor Author

behnam commented Jul 8, 2017

Agreed, @matklad. So let's wait for @alexcrichton now.

Also, if we go with option 3, I would like to propose two additions:

  • For now, patterns cannot be in negation form (start with !) in neither of include or exclude options. This would prevent any confusions about the separate lists, as well as ordering of the rules.

    Later, after the dust is settles down from this change, we can allow negation form in exclude option, making it more similar to gitignore, and without disabling the default SCM-based initial inclusion logic.

  • And, orthogonal to that, allowing both include and exclude to be set in the following form:

    • If include is empty, everything that's not SCM-ignored will be included, otherwise, include sets all the paths for inclusion.

    • Then, exclude will apply on top of that to filter out paths based on exclusion patterns.

@alexcrichton
Copy link
Member

Thanks for all the investigation here! I personally really like the idea of moving to literally "gitignore syntax" as it's intuitive and well spec'd already. I think we should move to the ignore crate (that's the "gitignore syntax interpreter", right?).

I agree though with @matklad that we probably want to do this gradually. I think we should start by warning if globs exclude more than gitignore directives, and then eventually we can switch to gitignore directives warning if globs are different, and then finally later we can remove the glob crate.

@behnam
Copy link
Contributor Author

behnam commented Jul 10, 2017

Awesome!

I filed a new issue here to track the steps and progress: #4268 I wrote it based on all we discussed here, but please check it if you can and make edits if needed.

So I'll close this and submit a new PR for the new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants