-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let's fix the fish history file format, and maybe extend it to configuration #6493
Conversation
Encapsulate the process of reading from history, in preparation for history file format changes.
This adds the single-header json.hpp file from SimpleJSON to fish. It is not yet built or included. Commit 8dd3e9b84a0b532c46f22f402bafb2c4cf8621eb from https://github.com/nbsdx/SimpleJSON
This adds support to fish for reading history files in "JSON Lines" format. Each entry is a separate JSON object.
Teach SimpleJSON to emit "compact" one-line JSON records. This will allow compatibility with JSON Lines.
This switches the output format from busted YAML to real JSON Lines. To avoid breaking prior versions, we now append the "json" suffix.
Ensure that we can successfully import old fish history
SimpleJSON is WTFPL (i.e. literally "do what the fuck you want"), so it should be GPL-compatible. The last commit was in 2016, and it's a single-file lib, so I'm assuming the intention is to vendor it, and that's okay. I think JSON-lines is probably the correct choice, even though I don't think it's ideal for manual adjustment because
But I don't expect manual changes. If anything users will use tools to change or query the history, and for that jq exists for JSON, but I know of no equivalent for TOML or YAML. Being incompatible is fine, it's necessary and not too odious. But it's a bit of a risk, which is why I'd like to hold off on this for 3.1. |
Yes this would definitely be for 3.2. |
I prefer JSON Lines to TOML as well. For me, being able to use Quick note, jsonlines.org suggests |
Highly prefer JSON Lines. It'd be cool to add a number of fields at the same time with shell state for history items. rash for example stores:
|
SimpleJSON looks like it is unmaintained and might have unresolved potential security issues. Some alternatives to consider:
If the new history format is reused for configuration data, it would be nice to have support for comments. There are some libraries that have comment support; otherwise maybe add comment support to the vendored library. Also, it would be nice to have compression for the history file, especially if a lot more fields get added to the format. Personally, I find it a bit strange that there is a hard requirement for a textual format, while the proposed solution is, for all intents and purposes, not human-parseable without external tooling. Rather than extolling the virtues of One problem with the current history implementation is that when your system clock is reset/wrong fish can't figure out the history. So it would be nice to have good reliability with the new format right out of the box. |
This seems to be overkill. We deduplicate history and keep a maximum of 256k items, which is enough for everyone. My history has 33k items and is 2.3M, meaning a "full" history would be about 20M. I don't see us adding enough fields to reach even 100M. So the complexity and additional dependency of gzip just doesn't appear to be worth it. |
I agree with the binary format thing - I'm not sure it'd really be so bad just to use sqlite3 or protobufs, if the |
(Something a user can do currently if they don't love the size of their history file is just tell their filesystem to compress it behind the scenes, which macOS can do with HFS or APFS. I don't think ext3 can do this but fancier Linux filesystems support transparent compression.) |
Absolutely fish should have excellent tools for manipulating history and Text still has major advantages. fish can easily import bash history, which would have been harder if bash had used bolt or sqlite. Users might try crazy stuff like putting their history in a git repo. Or maybe your fish installation is borked: a textual history file is easier to find and recover. The only potential advantage of binary formats is performance, and I don't anticipate fish history outgrowing what is possible with text. Regarding the implementation, I started with SimpleJSON because of its size, but the OOB read that @travankor found does look like a serious issue and I agree we should not use it. Of the alternatives, both picojson and jsonx seem viable. Good finds. The time issue is orthogonal but well spotted, I hadn't thought of that. Filed #6606. |
I'm a huge toml fan (rust converted me) but I don't think it's right for the job. I think toml should be used where humans are primarily editing the file by hand but not when you only need human-readability for inspection purposes: it's too verbose and has less tooling available. When I saw the title of the issue, before even opening it, my thoughts were SQLite (which I honestly consider to be very much both human-readable and human-editable, dependent on the schema) for robustness and maximum flexibility, LMDB (which I previously forked fish to use for other reasons) for size and speed if human readability is not a concern, and NDJSON (specifically) otherwise. If you're willing to reconsider, SQLite makes a ton of sense for a history file with the requirements we have. It gives you "free" indexing and full-text search, "free" multi-thread/multi-process access, "free" paged loading of history items, etc. etc. Personally, I prefer SQLite over the NDJSON particularly because it enforces a schema (even though it's not a strongly typed database, but we can't have everything in life). With regards to JSON libraries, I adopted https://github.com/nlohmann/json for a few projects and haven't had any regrets. It's not the fastest but it's very developer friendly. I would prefer to take advantage of a native C++ JSON library rather than an untyped C library, from past experience. |
A failing history search needs to traverse the entire history. With a large history, this can be expensive, albeit rarely noticeable since autosuggestions don't block. I don't know whether SQLite would be more efficient at this kind of full text search. Of course, SQLite is much nicer to use when doing more complex queries. |
You would need to use the fts extension (which ships with the upstream SQLite amalgamation) to be able to accelerate the search and avoid a full table scan. It’s quite a capable module. The default |
I realize there was a case where a failing history search would block the shell; when typing some characters that don't have an autosuggestion, and then pressing up, the shell would block because of all those queued autosuggestions, but this has neatly been fixed by e334bec. It seems that linear searches are fine with the current model. The fts extension looks nice, it seems to be really damn fast at finding all the occurrences of some word: Browsers are all using SQLite for history nowadays, but of course the UI is different here. All in all, I think JSONL works fine. EDIT: messed up the example, of course |
Note that what you are describing is probably with the default set of stop words and tokenization. You can customize how words are tokenized and what constitutes a word even further, likely able to further tweak the results to match our needs. |
Is there any interest for a conversion tool that goes from existing format -> new format? Obviously, this would not delete the old history, but you would end up with two copies of the old history. |
This might not be a 3.2 thing, but it would be very helpful for the metadata to include the working directory in which the command was executed. I have a lot of pipelines that use identically named scripts so this feature would be pretty helpful. |
Actually, I would prefer reducing data leaked via al these side channels where possible. Would a hash of the directory suffice (so you can match against it but not fuzzy search against it). |
fish can always read old formats and will save the history in the new format, would this do the job? |
Yes, that was what I was thinking of. But fish already does this? It should only do this if it can't find the new history format. |
It would be more useful to me to have whole paths, but I do see potential for insecurity in some applications. Would it be possible for this to be configurable? The new system is extensible by design. The default could be a hash. |
Storing the PWD hash is a clever idea, but I'm not sure why the PWD would be more sensitive than the commands which were run. |
Okay, this has bit-rotted enough that it would have to be polished up again, so there's no real use in keeping it open. Closing! |
Yeah sorry! I'll pick it up again! |
I have huge
Hope to see this get merged in 3.2.0 |
Fish already deduplicates commands and only keeps the last 256K.
|
@aca how large is your history? |
This weekend I am switching to fish. So far so great. My thanks to everyone who has contributed to fish! To drive learning fish language I have been rewriting a Bash package which stores command history into an SQLite DB file. Searching for various things brought me to this PR. FWIW, I agree with the arguments against using sqlite for this new fish history file format (JSON lines sounds good to me) so I'm not here suggesting that. Rather, I want to add a link to the schema I'm using in hopes the new fish history file schema might support saving some (or all?) of the same elements. https://github.com/brettviren/fishql/blob/master/functions/fishql-initdb.fish#L22 (I'm still learning fish so please forgive any (non) "fishy" looking code) This is the same schema used by "advanced shell history" so a quasi-standard. The motivation for enlarging the new fish history file schema is to support more rich queries and to support sync/dump methods between the new format and this sqlite DB which may be made less lossy. |
I know the problem @aca is referring to; it's part of the reason for my recent patches that brought a 2-3x improvement to @ridiculousfish the problem is orthogonal to history file length; the real issue is that fish builtins buffer their output when piped to another process and fzf is designed to work with streaming input: i.e. it's a question of latency, not throughput. |
builtins avoid buffering when outputting to non-redirected stdout/stderr since bcfc54f (Do not buffer builtin output if avoidable) |
fish would have to ascertain what is reading out of the pipe (or that something is reading out of the pipe) so as to avoid a deadlock (presumably this is why the buffering exists in the first place). Since the parser is currently still single-threaded, the builtin needs a big enough buffer on the output otherwise the pipe will saturate if the reading end is not running simultaneously but rather queued to be run internally by fish after the builtin finishes execution (i.e. is a block, function, or another builtin), right? A few years ago I had a branch that side-stepped the issue by keeping the current behavior but also starting a thread would try reading from the buffer and writing to the pipe. The codebase was not as amenable to such an approach at the time and I ran into some fundamental issues, but I think it’s more doable today. |
Let's figure out what the fish history file format should be. There's some discussion in #3341 and also here.
The goal is to settle on a fish history format, and then eventually re-use the format to persist configuration data, like universal variables and abbreviations. The user is never expected to look at or edit these files manually. But they may choose to.
State of the world
fish history is currently a broken psuedo-YAML. Mea culpa: I added this about 8 years ago, without thinking hard enough. It is ad-hoc, underspecified and buggy (#6032). This complicates adding new fields while maintaining backwards compatibility, and it prevents the format from being successfully parsed by any other libraries.
The goal here is to settle on a widely-recognized, fully-specified format.
Desirable Properties for fish
These are my opinions only and are totally fair game for discussion.
Textual. Shells are built on text, which rules out binary formats like SQLite or protobuf. It should be obvious to the user how the history file is stored.
Self Describing. fish history should be a sequence of key-value pairs, not a positional list, so that new fields may be added in the future. This rules out formats like CSV or zsh's extended history.
Not Ad-Hoc. The new history format should be a real and commonly understood format, not something that we just make up.
Trivial Appending. The file must not require "closing." fish's normal history-saving is O_APPEND writes with best-effort file locking. This rules out formats like XML (
</xml>
) or JSON (}
) that require explicit closing.Streamable Searching. fish should be able to search the history without decoding the entire history file into memory. A SAX-style parser would satisfy this; so also is easily-identifiable record boundaries.
Candidates
I know of two plausible options: JSON Lines and TOML. I don't have a strong opinion between them - I keep going back and forth, and am open to alternatives. IMO YAML is out because of the complexity of the spec, and to avoid ambiguity with the existing format.
JSON Lines
JSON Lines, also known as JSON Object Streams, is simply newline-delimited JSON objects. Of course JSON is ubiquitous, but newline-delimited JSON objects is itself a fairly common serialization format - it is understood by
jq
, which is the JSON Swiss Army knife.jq
can slice and dice, filter, query, etc.fish history in JSON lines might look like:
Advantages
jq
.Disadvantages
TOML
TOML is a configuration file format aimed at human-editability. It supports appending via the "list of tables" syntax. Here we might use its support for real dates:
Advantages
Disadvantages
What Other Shells Do
AFAICT fish would be the only shell doing this:
PowerShell stores commands as naive newline-delimited, with no metadata.
ksh stores the commands with two nul bytes as delimiters. There is no metadata.
zsh has a fixed list of positional fields.
elvish uses a binary database called bolt.
Compatibility
The new format will be incompatible with the old format. For that reason, the filename should be different: it will be (for example)
fish_history.json
instead of merelyfish_history
. This avoids the problem where you test the new fish, then revert back, and now your history is mangled.Ideally this is the last time we wholesale replace the fish history file format. Future changes will be adding new key/value pairs, which will be ignored by previous fish shells.