Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lots of polishing, regular maintenance, preparing for 0.4 (or 1.0?) release #96

Merged
merged 23 commits into from
Mar 8, 2020

Conversation

BurntSushi
Copy link
Owner

@BurntSushi BurntSushi commented Feb 21, 2020

High level plans:

fst-bin/Cargo.toml Outdated Show resolved Hide resolved
I stopped using this stuff a very long time ago.
@BurntSushi BurntSushi force-pushed the ag/polish branch 3 times, most recently from 236396b to 67b937a Compare February 25, 2020 12:06
src/raw/mod.rs Outdated
Comment on lines 271 to 273
pub struct Fst<D: AsRef<[u8]>> {
meta: Meta,
data: D,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the discussion I miss-started on a commit instead of the PR itself.


If a user wants to write a structure that stores a fst::Map<D> he is forced to specify that the D type is AsRef<[u8]> even if he will not be read or used but only stored.

struct CurrentStruct<D: AsRef<[u8]>> {
    fst: fst::Map<D>,
}

struct ExpectedStruct<D> {
    fst: fst::Map<D>,
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I understand that as a technical restriction. But why would someone want to do that? Is there a specific use case you have in mind?

Copy link
Contributor

@Kerollmops Kerollmops Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why would someone want to do that?

If you mean: Why would someone want to wrap/store an FST inside of a struct that do not need to read the FST itself?

I think about one little thing that would not be possible if the constraints are on the declaration, for example a generic struct that just displays the size_of a wrapped type, or displays the type name of a struct.

Don't know if it make any sense but common std collections do not trait constrain at the struct level, that makes users lives easier, even if the generic types must impl a bunch of traits to make the collections useful (K: Hash, or K: Ord).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll consider it, thanks. And yes, I know that the std collections don't do it, but it's not clear to me whether that was a mistake or not.

Copy link
Contributor

@Kerollmops Kerollmops Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is many arguments defending my point in this recent issue in the Rust repo. A PR has been merged to relax those bounds. I think that it is important to be the less restrictive possible when unneeded, but that's only my point of view.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and in that PR, my cohorts on the library team brought up concerns with doing it. That's basically my stance as well. I just really haven't seen a compelling use case. Like I said, I'll consider it.

Copy link
Contributor

@Kerollmops Kerollmops Feb 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to see if I was irrelevant so I checked the API Guideline, the Exception point is on my side. I don't want to fight, I am just trying to make the library more flexible :)

EDIT: I wrote this before I saw your message above.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. That guideline is appropriately circumspect:

There is a grey area around other non-derivable trait bounds that are not strictly required by the structure definition, like Read or Write. They may communicate the intended behavior of the type better in its definition but also limits future extensibility. Including semantically useful trait bounds on data structures is still less problematic than including derivable traits as bounds.

We are spending too much time discussing a very minor point. Like I said, I'll consider it.

@vorner
Copy link
Contributor

vorner commented Feb 25, 2020

I was asked to have a look here, as a production user. So far:

  • The planned changes don't seem to break my use case, which is nice :-)
  • The streaming-iterator might actually make my code a bit nicer as it has some combinators.

It would also be really cool if fst could store other things than just u64 as the values, but that might be asking too much.

@BurntSushi
Copy link
Owner Author

@vorner Thanks! And yeah, I don't have any plans to extent the output value to anything beyond a u64. That's a fairly major change internally and would likely also require major API design work.

@Kerollmops
Copy link
Contributor

Don't know if it is useful to talk about that in here, but the Streaming Iterator the FST rely on could be expressed using GATs when those becomes available (on stable), avoiding much of the traits tricks that are currently used.

Bumping the FST library to 1.0 is probably not a good idea now, I think it would have been better to bump it to this particular version when GATs are available (it seems there is progress) and we rely on them. If it is not something you think that will happen soon enough, so, let's bump to the stable version anyway :)

@BurntSushi
Copy link
Owner Author

Yes, I'm intensely aware that GATs would likely fix the problem. I've been waiting for them (or an equivalent) for five years. GATs don't appear to be anywhere close to landing. A generous estimate would be that they land on stable a year from now. Doing an fst 2.0 release at that point (or even another year after it) would be fine.

Normally I like to make major versions last for a long time, but fst is not a core ecosystem crate, so I'm generally happy to bump its major version at a bit of a higher frequency then, say, regex.

'cargo fix' appears to have choked on fst-bin's use of imports. We patch
it up here along with fixing all remaining warnings.
@BurntSushi BurntSushi force-pushed the ag/polish branch 9 times, most recently from 423534b to 954fa87 Compare March 7, 2020 15:20
@BurntSushi
Copy link
Owner Author

Experiment with something like streaming-iterator instead of the existing trait definition.

As soon as I re-contextualized myself with this problem, it became clear to me that streaming-iterator (or an equivalent) is insufficient for fst. In particular, its main problem is that get() (or next()) return a &Self::Item, which is a particularly restrictive requirement. In some cases this would be okay. For example, Self::Item == [u8] in the case of a fst::set::Stream. But in many cases, we want the thing we return to be (&[u8], u64) or (&[u8], Output) or u64 or (&[u8], u64, A::State) or whatever. That's incompatible with requiring, at the trait level, that the item itself be a borrow.

This is precisely what GATs ("generic associated types") are supposed to help solve. If we had those, then we could in theory write a trait whose Item was itself generic over a (possibly 'static) lifetime.

I think I knew all of this at one point in time, which was how I arrived at the current Streamer trait definition. But it's been so long that I forgot all of it.

I don't see any obviously better solution with current Rust. So wait for GATs continues here.

@BurntSushi BurntSushi force-pushed the ag/polish branch 3 times, most recently from 8bf84bc to 68aa50a Compare March 7, 2020 18:14
@BurntSushi BurntSushi force-pushed the ag/polish branch 8 times, most recently from 7f0749e to e8db372 Compare March 7, 2020 18:55
This is where the Levenshtein automata originally lived. I originally
moved it out of the crate in order to reduce dependencies. But we can
instead just make it an optional feature.

Instead of moving fst-regex back into the crate as well, we instead
recommend using `regex-automata`. `regex-automata` is basically the
productionized version of "compile a regex to a DFA," which is exactly
what we want.
This is good for a modest ~15% performance improvement in the
build/fst/set benchmark.
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
Mostly this just upgrades to the newest versions of everything, and
replaces deprecated crates (tempdir, chan) with their newer variants
(tempfile, crossbeam-channel).

We also use BString in several places instead of String so as to permit
indexing any keys and not just UTF-8 keys.

This is the calm before the storm. The next commit will switch from
Docopt to Clap.
This was a grueling refactor, but it's done. We now correctly handle
non-UTF-8 arguments in most places. We also switch to anyhow for error
handling, so setting RUST_LIB_BACKTRACE when an error occurs will show
something useful (assuming it was built with a nightly compiler).
This adds a couple environment variables that makes it easy to configure
DFA construction with a few more knobs. This is useful for benchmarking
various options.
This provides the ability to map from a value back to its corresponding
key. This only works when values are monotonically increasing, which is
unfortunate, but it's still worth exposing. Since this is only useful is
limited circumstances, we don't make this available on the fst::Map
type, but instead provide it on raw::Fst only.

Closes #16
These mirror the constructors on Set and Map.
We'll re-add once we put out a new release of regex-automata.
This is apparently useful in some cases when access to the underlying
automaton's state can produce useful information about a match.
Regretably, the implementation requires the states to satisfy `Clone`,
or else we would otherwise need to execute the state transition twice.
In practice, states are usually `Copy` and quite small.

Closes #60, Closes #61
This modifies the FST builder to compute the CRC32C checksum as the FST
is being constructed. It is written as the last 4 bytes of the FST.

We also add a new `verify` routine on `raw::Fst` that permits callers to
check whether the FST's integrity is intact. Since verification could be
quite expensive for very large FSTs, we do not do this by default.

We avoid a dependency on a CRC32C crate since hand-rolling it ourselves
is very simple and not much code. I tried using a SIMD version of
CRC32C, but I couldn't benchmark a difference. In particular, I suspect
that the FST writing process is doing a lot of small writes, so the
checksummer doesn't get much opportunity to checksum a lot of bytes at
once.

It's not quite clear how to fix this. We could use our own buffer
internally, but then the caller wouldn't be able to use their own
buffered reader, which is a bit weird. But not without precedent.

In any case, the overhead of checksumming during construction is
virtually negligible, so we don't sweat it for now.
Now that std provides safe APIs for converting between raw bytes and
integers, we can use those instead. We still define a few utility
functions for convenience though.

The packing code is definitely not as ideally efficient as what
byteorder does, since we use safe code. But I was not able to observe a
difference in benchmarks.

fst now has no required dependencies!
This replaces `inline(always)` annotations with `inline` (except for one
case, in which we add a comment).

Notably, inlining the methods on `Bound` appear to provide a nice boost:

    group                           c01                    c02
    -----                           ---                    ---
    search/wikiurls/fst/contains    1.06    428.7±9.53ns   1.00    404.8±4.58ns
    search/words/fst/contains       1.05    199.8±1.74ns   1.00    191.0±3.34ns
Touching up docs, restyling code in places, etc.
This isn't actually necessary. I put them there originally because I
thought that it added more clarity to the intended use of an FST. But
the constructors should make it clear enough.
@BurntSushi BurntSushi merged commit a1c4113 into master Mar 8, 2020
@BurntSushi BurntSushi deleted the ag/polish branch March 8, 2020 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants