Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cut: allow non utf8 characters for delimiters #6037

Merged
merged 5 commits into from
Mar 10, 2024

Conversation

zhitkoff
Copy link
Contributor

@zhitkoff zhitkoff commented Mar 1, 2024

This PR refactors how cut processes delimiters and allows non UTF8 values for those options (-d/--delimiter , --output-delimiter) to align with GNU behavior.
It fixes 8bit-delim test from GNU tests/cut/cut.pl set of tests.
NOTE: There is a TODO left in for when MSRV for the project is bumped to 1.74.0+

@@ -337,6 +347,88 @@ fn cut_files(mut filenames: Vec<String>, mode: &Mode) {
}
}

// This is temporary helper function to convert OsString to &[u8] for unix targets only
// TODO Remove this function and re-implement the functionality in each place that calls it
// for all targets using https://doc.rust-lang.org/nightly/std/ffi/struct.OsStr.html#method.as_encoded_bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_encoded_bytes isn't suitable for this, the results for invalid Unicode on Windows aren't meaningful for anything other than passing back into from_encoded_bytes_unchecked. (Specifically they're WTF-8, which doesn't help users.)

Your implementation is the right way to do it, I think we should keep it (and not unwrap).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! could not get back to it before it got merged - will update in the next one as there are still few things to be fixed in cut for full GNU tests compatibility

// for all targets using https://doc.rust-lang.org/nightly/std/ffi/struct.OsStr.html#method.as_encoded_bytes
// once project's MSRV is bumped up to 1.74.0+ so that function becomes available
// For now - support unix targets only and on non-unix (i.e. Windows) will just return an error if delimiter value is not UTF-8
fn os_string_as_bytes(os_string: &OsString) -> UResult<&[u8]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use &OsStr for this (and also get access to 'static lifetimes that way).

.get_one::<OsString>(options::OUTPUT_DELIMITER)
.map(|os_string| {
if os_string.is_empty() || os_string == "''" {
"\0".as_bytes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\0".as_bytes()
b"\0"

Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/chown/preserve-root is no longer failing!

@sylvestre sylvestre merged commit 156d3f7 into uutils:main Mar 10, 2024
62 checks passed
zhitkoff added a commit to zhitkoff/coreutils that referenced this pull request Mar 12, 2024
* ci: use codecov token in CICD/GnuTests workflows

* tee: fail test if string setup fails

* sort: add skipped test for combined flags

Now that clap#2624 has been resolved, we can and should test both variants.

* cat: don't flake even on exotic pipe buffer sizes

See also 9995c63.

There is a race condition between the writing thread and the command.
It is easily possible that on the developer's machine, the writing
thread is always faster, filling the kernel's buffer of the stdin pipe,
thus succeeding the write. It is also easily possible that on the busy
CI machines, the child command runs first for whatever reason, and exits
early, thus killing the pipe, which causes the later write to fail. This
results in a flaky test. Let's prevent flaky tests.

* numfmt: don't flake even on exotic pipe buffer sizes

* split: don't flake even on exotic pipe buffer sizes

* simulate terminal utility (squash)

* workaround: run builds with retry (a)

* added configurable terminal size

* chore(deps): update rust crate rayon to 1.9

* cargo: fix feature = "cargo-clippy" deprecation

* tests/printf: Fix char_as_byte test, add char and string padding tests

* printf: Change get_char and write_padded to handle bytes instead of chars

* uucore/format: add padlen to spell-checker:ignore

* tests/printf: Verify the correct error behavior of printf when provided with '%0c' or '%0s'

* printf: Raise error on '%0c' and '%0s' formats

* cp: fix flaky test test_cp_arg_interactive_update, document adjacent bug

* chore(deps): update rust crate walkdir to 2.5

* cat: permit repeating command-line flags

* cat: fix -b and -n anti-symmetry

* cat: ignore -u flag, just like GNU does

* tests/common/util.rs: add cfg(feature = "env")

* cat: prefix two test fns with "test_"

* Bump mio from 0.8.10 to 0.8.11

* extend error message for case when writer instanciation fails second time

* Bump chrono from 0.4.34 to 0.4.35

* ls: use chrono::TimeDelta::try_seconds instead of

deprecated chrono::TimeDelta::seconds

* touch: replace use of deprecated chrono functions

* chmod: slightly adjust error message when preserve-root is triggered

One of the GNU tests checks for the exact error message.

* chgrp+chown: also trigger preserve-root during dirwalking, fix error message

This is explicitly tested in the GNU tests.

* uucore: drop unused function resolve_relative_path

This function is by necessity ill-defined: Depending on the context,
'..' is either the logical parent directory, sometimes the physical
parent directory. This function can only work for the latter case,
in which case `Path::canonicalize` is often a better approach.

* split: close as much fds as needed for opening new one

* use std::command::pre_exec() to set limits on child before exec

* chore(deps): update softprops/action-gh-release action to v2

* dd: treat arg as bytes if it contains 'B'

* Fix clippy warnings

* tr: stream output instead of buffering

This should lower memory consumption, and fixes OOM in some scenarios.

* shuf: fix and test off-by-one errors around ranges

* shuf: fix error message text on negative-sized ranges

Found by @cakebaker:
uutils#6011 (comment)

* chcon: allow overriding between --dereference and --no-dereference

* chcon: allow repeated flags and arguments

* touch: Respect -h when getting metadata (uutils#5951)

* Add tests that stat symlinks

* Check follow first in stat

* Don't run tests on FreeBSD

It would be possible to get them to run on FreeBSD by avoiding
get_symlink_times, but the behavior we're testing is not
platform-specific, so it's fine to not test it on FreeBSD.

---------

Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>

* pr: fix deprecation warnings & remove comment

* chgrp: fix clippy warning

* cut: allow non utf8 characters for delimiters (uutils#6037)

* cp: improve the support of --attributes-only (uutils#6051)

* cp: improve the support of --attributes-only

* remove useless comments

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>

---------

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>

* cp: Split the copy_file function a bit

* parser: if closing square bracket not found, stop looking for it again

This solves uutils#5584, where the fuzzing would take hours without this.

* Fix install: invalid link at destination
also remove some FixMEs for FreeBsd

* Bump nix from 0.27 to 0.28

* uucore/pipes: adapt to new return type of nix fn

nix 0.28 changed the return type of unistd::pipe() from Result<(RawFd, RawFd), Error> to Result<(OwnedFd, OwnedFd), Error>

* tty: unistd::ttyname takes AsFd instead of RawFd

change introduced by nix 0.28

* stty: remove ofill output flag

flag was removed from nix::sys::termios::OutputFlags in nix 0.28

* cat: adapt to type change of unistd::write()

nix 0.28 changed "write(fd: RawFd, buf: &[u8]) -> Result<usize>" to "write<Fd: AsFd>(fd: Fd, buf: &[u8]) -> Result<usize>"

* chore(deps): update rust crate blake3 to 1.5.1

---------

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>
Co-authored-by: Ben Wiederhake <BenWiederhake.GitHub@gmx.de>
Co-authored-by: Ulrich Hornung <hornunguli@gmx.de>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
Co-authored-by: Dimitris Apostolou <dimitris.apostolou@icloud.com>
Co-authored-by: Dorian Péron <dorianperon.i@gmail.com>
Co-authored-by: Terts Diepraam <terts.diepraam@gmail.com>
Co-authored-by: mhead <mtrxhead@protonmail.com>
Co-authored-by: Yash Thakur <45539777+ysthakur@users.noreply.github.com>
Co-authored-by: Zoltan Kiss <121870572+cj-zoltan-kiss@users.noreply.github.com>
zhitkoff added a commit to zhitkoff/coreutils that referenced this pull request Mar 12, 2024
* ci: use codecov token in CICD/GnuTests workflows

* tee: fail test if string setup fails

* sort: add skipped test for combined flags

Now that clap#2624 has been resolved, we can and should test both variants.

* cat: don't flake even on exotic pipe buffer sizes

See also 9995c63.

There is a race condition between the writing thread and the command.
It is easily possible that on the developer's machine, the writing
thread is always faster, filling the kernel's buffer of the stdin pipe,
thus succeeding the write. It is also easily possible that on the busy
CI machines, the child command runs first for whatever reason, and exits
early, thus killing the pipe, which causes the later write to fail. This
results in a flaky test. Let's prevent flaky tests.

* numfmt: don't flake even on exotic pipe buffer sizes

* split: don't flake even on exotic pipe buffer sizes

* simulate terminal utility (squash)

* workaround: run builds with retry (a)

* added configurable terminal size

* chore(deps): update rust crate rayon to 1.9

* cargo: fix feature = "cargo-clippy" deprecation

* tests/printf: Fix char_as_byte test, add char and string padding tests

* printf: Change get_char and write_padded to handle bytes instead of chars

* uucore/format: add padlen to spell-checker:ignore

* tests/printf: Verify the correct error behavior of printf when provided with '%0c' or '%0s'

* printf: Raise error on '%0c' and '%0s' formats

* cp: fix flaky test test_cp_arg_interactive_update, document adjacent bug

* chore(deps): update rust crate walkdir to 2.5

* cat: permit repeating command-line flags

* cat: fix -b and -n anti-symmetry

* cat: ignore -u flag, just like GNU does

* tests/common/util.rs: add cfg(feature = "env")

* cat: prefix two test fns with "test_"

* Bump mio from 0.8.10 to 0.8.11

* extend error message for case when writer instanciation fails second time

* Bump chrono from 0.4.34 to 0.4.35

* ls: use chrono::TimeDelta::try_seconds instead of

deprecated chrono::TimeDelta::seconds

* touch: replace use of deprecated chrono functions

* chmod: slightly adjust error message when preserve-root is triggered

One of the GNU tests checks for the exact error message.

* chgrp+chown: also trigger preserve-root during dirwalking, fix error message

This is explicitly tested in the GNU tests.

* uucore: drop unused function resolve_relative_path

This function is by necessity ill-defined: Depending on the context,
'..' is either the logical parent directory, sometimes the physical
parent directory. This function can only work for the latter case,
in which case `Path::canonicalize` is often a better approach.

* split: close as much fds as needed for opening new one

* use std::command::pre_exec() to set limits on child before exec

* chore(deps): update softprops/action-gh-release action to v2

* dd: treat arg as bytes if it contains 'B'

* Fix clippy warnings

* tr: stream output instead of buffering

This should lower memory consumption, and fixes OOM in some scenarios.

* shuf: fix and test off-by-one errors around ranges

* shuf: fix error message text on negative-sized ranges

Found by @cakebaker:
uutils#6011 (comment)

* chcon: allow overriding between --dereference and --no-dereference

* chcon: allow repeated flags and arguments

* touch: Respect -h when getting metadata (uutils#5951)

* Add tests that stat symlinks

* Check follow first in stat

* Don't run tests on FreeBSD

It would be possible to get them to run on FreeBSD by avoiding
get_symlink_times, but the behavior we're testing is not
platform-specific, so it's fine to not test it on FreeBSD.

---------

Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>

* pr: fix deprecation warnings & remove comment

* chgrp: fix clippy warning

* cut: allow non utf8 characters for delimiters (uutils#6037)

* cp: improve the support of --attributes-only (uutils#6051)

* cp: improve the support of --attributes-only

* remove useless comments

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>

---------

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>

* cp: Split the copy_file function a bit

* parser: if closing square bracket not found, stop looking for it again

This solves uutils#5584, where the fuzzing would take hours without this.

* Fix install: invalid link at destination
also remove some FixMEs for FreeBsd

* Bump nix from 0.27 to 0.28

* uucore/pipes: adapt to new return type of nix fn

nix 0.28 changed the return type of unistd::pipe() from Result<(RawFd, RawFd), Error> to Result<(OwnedFd, OwnedFd), Error>

* tty: unistd::ttyname takes AsFd instead of RawFd

change introduced by nix 0.28

* stty: remove ofill output flag

flag was removed from nix::sys::termios::OutputFlags in nix 0.28

* cat: adapt to type change of unistd::write()

nix 0.28 changed "write(fd: RawFd, buf: &[u8]) -> Result<usize>" to "write<Fd: AsFd>(fd: Fd, buf: &[u8]) -> Result<usize>"

* chore(deps): update rust crate blake3 to 1.5.1

---------

Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>
Co-authored-by: Ben Wiederhake <BenWiederhake.GitHub@gmx.de>
Co-authored-by: Ulrich Hornung <hornunguli@gmx.de>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
Co-authored-by: Dimitris Apostolou <dimitris.apostolou@icloud.com>
Co-authored-by: Dorian Péron <dorianperon.i@gmail.com>
Co-authored-by: Terts Diepraam <terts.diepraam@gmail.com>
Co-authored-by: mhead <mtrxhead@protonmail.com>
Co-authored-by: Yash Thakur <45539777+ysthakur@users.noreply.github.com>
Co-authored-by: Zoltan Kiss <121870572+cj-zoltan-kiss@users.noreply.github.com>
@zhitkoff zhitkoff deleted the cut-delim branch March 12, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants