-
-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lots of polishing, regular maintenance, preparing for 0.4 (or 1.0?) release #96
Conversation
7098fb8
to
93c17cb
Compare
I stopped using this stuff a very long time ago.
236396b
to
67b937a
Compare
src/raw/mod.rs
Outdated
pub struct Fst<D: AsRef<[u8]>> { | ||
meta: Meta, | ||
data: D, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the discussion I miss-started on a commit instead of the PR itself.
If a user wants to write a structure that stores a fst::Map<D>
he is forced to specify that the D type is AsRef<[u8]>
even if he will not be read or used but only stored.
struct CurrentStruct<D: AsRef<[u8]>> {
fst: fst::Map<D>,
}
struct ExpectedStruct<D> {
fst: fst::Map<D>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I understand that as a technical restriction. But why would someone want to do that? Is there a specific use case you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why would someone want to do that?
If you mean: Why would someone want to wrap/store an FST inside of a struct that do not need to read the FST itself?
I think about one little thing that would not be possible if the constraints are on the declaration, for example a generic struct that just displays the size_of
a wrapped type, or displays the type name
of a struct.
Don't know if it make any sense but common std collections do not trait constrain at the struct level, that makes users lives easier, even if the generic types must impl a bunch of traits to make the collections useful (K: Hash
, or K: Ord
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll consider it, thanks. And yes, I know that the std collections don't do it, but it's not clear to me whether that was a mistake or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is many arguments defending my point in this recent issue in the Rust repo. A PR has been merged to relax those bounds. I think that it is important to be the less restrictive possible when unneeded, but that's only my point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and in that PR, my cohorts on the library team brought up concerns with doing it. That's basically my stance as well. I just really haven't seen a compelling use case. Like I said, I'll consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to see if I was irrelevant so I checked the API Guideline, the Exception point is on my side. I don't want to fight, I am just trying to make the library more flexible :)
EDIT: I wrote this before I saw your message above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. That guideline is appropriately circumspect:
There is a grey area around other non-derivable trait bounds that are not strictly required by the structure definition, like Read or Write. They may communicate the intended behavior of the type better in its definition but also limits future extensibility. Including semantically useful trait bounds on data structures is still less problematic than including derivable traits as bounds.
We are spending too much time discussing a very minor point. Like I said, I'll consider it.
I was asked to have a look here, as a production user. So far:
It would also be really cool if fst could store other things than just |
@vorner Thanks! And yeah, I don't have any plans to extent the output value to anything beyond a |
Don't know if it is useful to talk about that in here, but the Streaming Iterator the FST rely on could be expressed using GATs when those becomes available (on stable), avoiding much of the traits tricks that are currently used. Bumping the FST library to 1.0 is probably not a good idea now, I think it would have been better to bump it to this particular version when GATs are available (it seems there is progress) and we rely on them. If it is not something you think that will happen soon enough, so, let's bump to the stable version anyway :) |
Yes, I'm intensely aware that GATs would likely fix the problem. I've been waiting for them (or an equivalent) for five years. GATs don't appear to be anywhere close to landing. A generous estimate would be that they land on stable a year from now. Doing an Normally I like to make major versions last for a long time, but |
'cargo fix' appears to have choked on fst-bin's use of imports. We patch it up here along with fixing all remaining warnings.
423534b
to
954fa87
Compare
As soon as I re-contextualized myself with this problem, it became clear to me that This is precisely what GATs ("generic associated types") are supposed to help solve. If we had those, then we could in theory write a trait whose I think I knew all of this at one point in time, which was how I arrived at the current I don't see any obviously better solution with current Rust. So wait for GATs continues here. |
8bf84bc
to
68aa50a
Compare
7f0749e
to
e8db372
Compare
This is where the Levenshtein automata originally lived. I originally moved it out of the crate in order to reduce dependencies. But we can instead just make it an optional feature. Instead of moving fst-regex back into the crate as well, we instead recommend using `regex-automata`. `regex-automata` is basically the productionized version of "compile a regex to a DFA," which is exactly what we want.
This is good for a modest ~15% performance improvement in the build/fst/set benchmark.
This change was not nearly as bad as I thought it would be. Instead of trying to provide all the different possible ways to store some bytes, we instead make FSTs maximally flexible by accepting any type that can cheaply provide byte slice. This should resolve a number of issues with folks constructing FSTs in ways that weren't supported by the old constructors. As a bonus, we no longer need to directly depend on a specific implementation of memory maps. Conveniently, the `Mmap` type in the `memmap` crate already implements `AsRef<[u8]>`, so using memory maps is as simple as let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap()); let fst = Fst::new(mmap).unwrap(); Fixes #92, Fixes #94, Fixes #97
Mostly this just upgrades to the newest versions of everything, and replaces deprecated crates (tempdir, chan) with their newer variants (tempfile, crossbeam-channel). We also use BString in several places instead of String so as to permit indexing any keys and not just UTF-8 keys. This is the calm before the storm. The next commit will switch from Docopt to Clap.
This was a grueling refactor, but it's done. We now correctly handle non-UTF-8 arguments in most places. We also switch to anyhow for error handling, so setting RUST_LIB_BACKTRACE when an error occurs will show something useful (assuming it was built with a nightly compiler).
This adds a couple environment variables that makes it easy to configure DFA construction with a few more knobs. This is useful for benchmarking various options.
This provides the ability to map from a value back to its corresponding key. This only works when values are monotonically increasing, which is unfortunate, but it's still worth exposing. Since this is only useful is limited circumstances, we don't make this available on the fst::Map type, but instead provide it on raw::Fst only. Closes #16
These mirror the constructors on Set and Map.
We'll re-add once we put out a new release of regex-automata.
This is apparently useful in some cases when access to the underlying automaton's state can produce useful information about a match. Regretably, the implementation requires the states to satisfy `Clone`, or else we would otherwise need to execute the state transition twice. In practice, states are usually `Copy` and quite small. Closes #60, Closes #61
This modifies the FST builder to compute the CRC32C checksum as the FST is being constructed. It is written as the last 4 bytes of the FST. We also add a new `verify` routine on `raw::Fst` that permits callers to check whether the FST's integrity is intact. Since verification could be quite expensive for very large FSTs, we do not do this by default. We avoid a dependency on a CRC32C crate since hand-rolling it ourselves is very simple and not much code. I tried using a SIMD version of CRC32C, but I couldn't benchmark a difference. In particular, I suspect that the FST writing process is doing a lot of small writes, so the checksummer doesn't get much opportunity to checksum a lot of bytes at once. It's not quite clear how to fix this. We could use our own buffer internally, but then the caller wouldn't be able to use their own buffered reader, which is a bit weird. But not without precedent. In any case, the overhead of checksumming during construction is virtually negligible, so we don't sweat it for now.
Now that std provides safe APIs for converting between raw bytes and integers, we can use those instead. We still define a few utility functions for convenience though. The packing code is definitely not as ideally efficient as what byteorder does, since we use safe code. But I was not able to observe a difference in benchmarks. fst now has no required dependencies!
This replaces `inline(always)` annotations with `inline` (except for one case, in which we add a comment). Notably, inlining the methods on `Bound` appear to provide a nice boost: group c01 c02 ----- --- --- search/wikiurls/fst/contains 1.06 428.7±9.53ns 1.00 404.8±4.58ns search/words/fst/contains 1.05 199.8±1.74ns 1.00 191.0±3.34ns
Touching up docs, restyling code in places, etc.
This isn't actually necessary. I put them there originally because I thought that it added more clarity to the intended use of an FST. But the constructors should make it clear enough.
High level plans:
fst-regex
in favor of usingregex-automata
. I already have a PR up onregex-automata
that adds impls forfst::Automaton
when thefst1
feature is enabled: cleanup + switch to GitHub Actions + add fst automata support + massive compile time improvements regex-automata#6fst-levenshtein
and move its code back into afst::levenshtein
sub-module.Fst
,Set
andMap
that requiresAsRef<[u8]>
. Subsequently remove the dependency onmemmap
since callers can then just create their own memory map and provide it.fst-bin
to use clap and bstr.Experiment with something likeSee: lots of polishing, regular maintenance, preparing for 0.4 (or 1.0?) release #96 (comment)streaming-iterator
instead of the existing trait definition.