Let's fix the fish history file format, and maybe extend it to configuration #6493

ridiculousfish · 2020-01-12T19:53:48Z

Let's figure out what the fish history file format should be. There's some discussion in #3341 and also here.

The goal is to settle on a fish history format, and then eventually re-use the format to persist configuration data, like universal variables and abbreviations. The user is never expected to look at or edit these files manually. But they may choose to.

State of the world

fish history is currently a broken psuedo-YAML. Mea culpa: I added this about 8 years ago, without thinking hard enough. It is ad-hoc, underspecified and buggy (#6032). This complicates adding new fields while maintaining backwards compatibility, and it prevents the format from being successfully parsed by any other libraries.

The goal here is to settle on a widely-recognized, fully-specified format.

Desirable Properties for fish

These are my opinions only and are totally fair game for discussion.

Textual. Shells are built on text, which rules out binary formats like SQLite or protobuf. It should be obvious to the user how the history file is stored.
Self Describing. fish history should be a sequence of key-value pairs, not a positional list, so that new fields may be added in the future. This rules out formats like CSV or zsh's extended history.
Not Ad-Hoc. The new history format should be a real and commonly understood format, not something that we just make up.
Trivial Appending. The file must not require "closing." fish's normal history-saving is O_APPEND writes with best-effort file locking. This rules out formats like XML (</xml>) or JSON (}) that require explicit closing.
Streamable Searching. fish should be able to search the history without decoding the entire history file into memory. A SAX-style parser would satisfy this; so also is easily-identifiable record boundaries.

Candidates

I know of two plausible options: JSON Lines and TOML. I don't have a strong opinion between them - I keep going back and forth, and am open to alternatives. IMO YAML is out because of the complexity of the spec, and to avoid ambiguity with the existing format.

JSON Lines

JSON Lines, also known as JSON Object Streams, is simply newline-delimited JSON objects. Of course JSON is ubiquitous, but newline-delimited JSON objects is itself a fairly common serialization format - it is understood by jq, which is the JSON Swiss Army knife. jq can slice and dice, filter, query, etc.

fish history in JSON lines might look like:

{"cmd":"git branch -D tmp_fchown","when":1457678810}
{"cmd":"git checkout pr/2809","when":1457678814}
{"cmd":"cat ~/file.txt","when":1457678820}
{"cmd":"git diff src/fallback.h","paths":["src/fallback.h"],"when":1460230565}

Advantages

Ubiquitous, including support in jq.
Good library support for both reading and writing. This PR uses SimpleJSON.
Trivial record separators (just newlines) allows efficient streaming reads.

Disadvantages

Ugly.
Bad for human editing. If we used this to persist configuration data, like universal variables and abbreviations, they will be harder to edit by hand.

TOML

TOML is a configuration file format aimed at human-editability. It supports appending via the "list of tables" syntax. Here we might use its support for real dates:

[[item]]
cmd = "git branch -D tmp_fchown"
when = 2019-05-27T05:32:00

[[item]]
cmd = "git branch -D tmp_fchown"
when = 2017-06-2T11:16:05

[[item]]
cmd = "git diff src/fallback.h"
when = 2020-05-15T1:2:4
paths = ["src/fallback.h"]

Advantages

Visually nice
Excellent for configuration due to its human editability support.

Disadvantages

Existing libraries are not ideal for fish's use case:
- We'll need some hacks to detect record boundaries.
- Generating TOML is not a priority for most libraries, we might have to write our own.

What Other Shells Do

AFAICT fish would be the only shell doing this:

PowerShell stores commands as naive newline-delimited, with no metadata.
ksh stores the commands with two nul bytes as delimiters. There is no metadata.
zsh has a fixed list of positional fields.
elvish uses a binary database called bolt.

Compatibility

The new format will be incompatible with the old format. For that reason, the filename should be different: it will be (for example) fish_history.json instead of merely fish_history. This avoids the problem where you test the new fish, then revert back, and now your history is mangled.

Ideally this is the last time we wholesale replace the fish history file format. Future changes will be adding new key/value pairs, which will be ignored by previous fish shells.

Encapsulate the process of reading from history, in preparation for history file format changes.

This adds the single-header json.hpp file from SimpleJSON to fish. It is not yet built or included. Commit 8dd3e9b84a0b532c46f22f402bafb2c4cf8621eb from https://github.com/nbsdx/SimpleJSON

This adds support to fish for reading history files in "JSON Lines" format. Each entry is a separate JSON object.

Teach SimpleJSON to emit "compact" one-line JSON records. This will allow compatibility with JSON Lines.

This switches the output format from busted YAML to real JSON Lines. To avoid breaking prior versions, we now append the "json" suffix.

Ensure that we can successfully import old fish history

faho · 2020-01-12T20:37:34Z

SimpleJSON is WTFPL (i.e. literally "do what the fuck you want"), so it should be GPL-compatible.

The last commit was in 2016, and it's a single-file lib, so I'm assuming the intention is to vendor it, and that's okay.

I think JSON-lines is probably the correct choice, even though I don't think it's ideal for manual adjustment because

The line thing makes for long, unwieldy lines
No comments

But I don't expect manual changes. If anything users will use tools to change or query the history, and for that jq exists for JSON, but I know of no equivalent for TOML or YAML.

Being incompatible is fine, it's necessary and not too odious.

But it's a bit of a risk, which is why I'd like to hold off on this for 3.1.

ridiculousfish · 2020-01-14T01:51:49Z

Yes this would definitely be for 3.2.

zx8 · 2020-01-14T09:38:42Z

I prefer JSON Lines to TOML as well. For me, being able to use jq is a huge advantage over TOML, especially in the cases where I need more advanced search capabilities, or to manually manipulate history beyond what the history builtin offers.

Quick note, jsonlines.org suggests .jsonl as the extension rather than .json – not sure if we want to stick to that (suggested) convention or just go with .json?

floam · 2020-01-20T23:44:57Z

Highly prefer JSON Lines.

It'd be cool to add a number of fields at the same time with shell state for history items. rash for example stores:

Current directory ($PWD).
Exit code ($status)
Exit code of pipes ($pipestatus)
The time command is started and terminated

travankor · 2020-02-14T02:24:33Z

SimpleJSON looks like it is unmaintained and might have unresolved potential security issues. Some alternatives to consider:

If the new history format is reused for configuration data, it would be nice to have support for comments. There are some libraries that have comment support; otherwise maybe add comment support to the vendored library.

Also, it would be nice to have compression for the history file, especially if a lot more fields get added to the format. gzip should be fairly ubiquitous across installs.

Personally, I find it a bit strange that there is a hard requirement for a textual format, while the proposed solution is, for all intents and purposes, not human-parseable without external tooling. Rather than extolling the virtues of jq, the Fish POV should be that the built-in history tools have excellent UX. Recommending external tools like jq, autojump, or fzf goes against Fish's discoverability philosophy. Furthermore, jq isn't a very UNIXy tool other than you can use it with pipes; to cut to the point, jq even has SQL-style operators. So, I'm not sure why binary formats are 100% disqualified when JSONL is not too different. Some advantages over JSONL would include better reliability (users can't accidentally edit and corrupt their history) and performance in the long run.

One problem with the current history implementation is that when your system clock is reset/wrong fish can't figure out the history. So it would be nice to have good reliability with the new format right out of the box.

faho · 2020-02-14T16:20:35Z

Also, it would be nice to have compression for the history file, especially if a lot more fields get added to the format. gzip should be fairly ubiquitous across installs.

This seems to be overkill. We deduplicate history and keep a maximum of 256k items, which is enough for everyone. My history has 33k items and is 2.3M, meaning a "full" history would be about 20M.

I don't see us adding enough fields to reach even 100M. So the complexity and additional dependency of gzip just doesn't appear to be worth it.

floam · 2020-02-14T16:36:34Z

I agree with the binary format thing - I'm not sure it'd really be so bad just to use sqlite3 or protobufs, if the history builtin can work on it well enough.

floam · 2020-02-14T16:40:11Z

(Something a user can do currently if they don't love the size of their history file is just tell their filesystem to compress it behind the scenes, which macOS can do with HFS or APFS. I don't think ext3 can do this but fancier Linux filesystems support transparent compression.)

ridiculousfish · 2020-02-14T20:32:40Z

Absolutely fish should have excellent tools for manipulating history and jq is not intended to be a substitute for that.

Text still has major advantages. fish can easily import bash history, which would have been harder if bash had used bolt or sqlite. Users might try crazy stuff like putting their history in a git repo. Or maybe your fish installation is borked: a textual history file is easier to find and recover. The only potential advantage of binary formats is performance, and I don't anticipate fish history outgrowing what is possible with text.

Regarding the implementation, I started with SimpleJSON because of its size, but the OOB read that @travankor found does look like a serious issue and I agree we should not use it. Of the alternatives, both picojson and jsonx seem viable. Good finds.

The time issue is orthogonal but well spotted, I hadn't thought of that. Filed #6606.

mqudsi · 2020-02-15T00:52:28Z

I'm a huge toml fan (rust converted me) but I don't think it's right for the job. I think toml should be used where humans are primarily editing the file by hand but not when you only need human-readability for inspection purposes: it's too verbose and has less tooling available.

When I saw the title of the issue, before even opening it, my thoughts were SQLite (which I honestly consider to be very much both human-readable and human-editable, dependent on the schema) for robustness and maximum flexibility, LMDB (which I previously forked fish to use for other reasons) for size and speed if human readability is not a concern, and NDJSON (specifically) otherwise.

If you're willing to reconsider, SQLite makes a ton of sense for a history file with the requirements we have. It gives you "free" indexing and full-text search, "free" multi-thread/multi-process access, "free" paged loading of history items, etc. etc.

Personally, I prefer SQLite over the NDJSON particularly because it enforces a schema (even though it's not a strongly typed database, but we can't have everything in life).

With regards to JSON libraries, I adopted https://github.com/nlohmann/json for a few projects and haven't had any regrets. It's not the fastest but it's very developer friendly. I would prefer to take advantage of a native C++ JSON library rather than an untyped C library, from past experience.

krobelus · 2020-03-07T10:38:44Z

A failing history search needs to traverse the entire history. With a large history, this can be expensive, albeit rarely noticeable since autosuggestions don't block. I don't know whether SQLite would be more efficient at this kind of full text search. Of course, SQLite is much nicer to use when doing more complex queries.

mqudsi · 2020-03-08T01:21:28Z

You would need to use the fts extension (which ships with the upstream SQLite amalgamation) to be able to accelerate the search and avoid a full table scan. It’s quite a capable module.

The default WHERE foo LIKE '%com%' would normally result in a full table scan without an fts index, but WHERE foo LIKE 'com%' with a regular btree index should be pretty fast with prefix lookup.

krobelus · 2020-03-10T19:17:40Z

I realize there was a case where a failing history search would block the shell; when typing some characters that don't have an autosuggestion, and then pressing up, the shell would block because of all those queued autosuggestions, but this has neatly been fixed by e334bec. It seems that linear searches are fine with the current model.

The fts extension looks nice, it seems to be really damn fast at finding all the occurrences of some word: match foo only finds entries containing the word foo, based on natural-language-like word splitting, so it doesn't match echo foo/bar.
Our default should probably stay simple substring matching; fuzzy matching or regex are also nice sometimes. For those I think a linear search is still the best option (?). The btree prefix scan would be nice for autosuggestions.

Browsers are all using SQLite for history nowadays, but of course the UI is different here. All in all, I think JSONL works fine.

EDIT: messed up the example, of course match foo does find echo foo/bar with the default tokenizer, but match fo doesn't.

mqudsi · 2020-03-10T19:44:11Z

Note that what you are describing is probably with the default set of stop words and tokenization. You can customize how words are tokenized and what constitutes a word even further, likely able to further tweak the results to match our needs.

travankor · 2020-03-11T06:23:39Z

Is there any interest for a conversion tool that goes from existing format -> new format? Obviously, this would not delete the old history, but you would end up with two copies of the old history.

BEFH · 2020-03-11T16:25:26Z

This might not be a 3.2 thing, but it would be very helpful for the metadata to include the working directory in which the command was executed. I have a lot of pipelines that use identically named scripts so this feature would be pretty helpful.

mqudsi · 2020-03-11T16:34:42Z

Actually, I would prefer reducing data leaked via al these side channels where possible. Would a hash of the directory suffice (so you can match against it but not fuzzy search against it).

krobelus · 2020-03-11T18:00:09Z

Is there any interest for a conversion tool that goes from existing format -> new format? Obviously, this would not delete the old history, but you would end up with two copies of the old history.

fish can always read old formats and will save the history in the new format, would this do the job?

travankor · 2020-03-11T23:51:08Z

fish can always read old formats and will save the history in the new format, would this do the job?

Yes, that was what I was thinking of. But fish already does this? It should only do this if it can't find the new history format.

BEFH · 2020-03-14T00:03:54Z

Actually, I would prefer reducing data leaked via al these side channels where possible. Would a hash of the directory suffice (so you can match against it but not fuzzy search against it).

It would be more useful to me to have whole paths, but I do see potential for insecurity in some applications. Would it be possible for this to be configurable? The new system is extensible by design. The default could be a hash.

ridiculousfish · 2020-03-14T20:54:55Z

Storing the PWD hash is a clever idea, but I'm not sure why the PWD would be more sensitive than the commands which were run.

faho · 2020-09-02T19:03:28Z

Okay, this has bit-rotted enough that it would have to be polished up again, so there's no real use in keeping it open.

Closing!

ridiculousfish · 2020-09-05T18:30:33Z

Yeah sorry! I'll pick it up again!

aca · 2020-09-06T09:12:27Z

I have huge fish_history. It makes history lookup extremely slow with fzf.
If fish history is formatted in a correct way. I could do some automatic cleanup for the history easily. Like

remove all "ls" commands
remove duplicated commands

Hope to see this get merged in 3.2.0

faho · 2020-09-06T09:21:30Z

@aca

remove duplicated commands

Fish already deduplicates commands and only keeps the last 256K.

remove all "ls" commands

history delete --prefix ls.

ridiculousfish · 2020-09-06T20:44:43Z

@aca how large is your history?

brettviren · 2020-09-13T14:53:28Z

This weekend I am switching to fish. So far so great. My thanks to everyone who has contributed to fish!

To drive learning fish language I have been rewriting a Bash package which stores command history into an SQLite DB file. Searching for various things brought me to this PR.

FWIW, I agree with the arguments against using sqlite for this new fish history file format (JSON lines sounds good to me) so I'm not here suggesting that. Rather, I want to add a link to the schema I'm using in hopes the new fish history file schema might support saving some (or all?) of the same elements.

https://github.com/brettviren/fishql/blob/master/functions/fishql-initdb.fish#L22

(I'm still learning fish so please forgive any (non) "fishy" looking code)

This is the same schema used by "advanced shell history" so a quasi-standard.

The motivation for enlarging the new fish history file schema is to support more rich queries and to support sync/dump methods between the new format and this sqlite DB which may be made less lossy.

mqudsi · 2020-09-15T01:25:29Z

I know the problem @aca is referring to; it's part of the reason for my recent patches that brought a 2-3x improvement to history output (but didn't address the actual problem).

@ridiculousfish the problem is orthogonal to history file length; the real issue is that fish builtins buffer their output when piped to another process and fzf is designed to work with streaming input: i.e. it's a question of latency, not throughput.

krobelus · 2020-09-15T01:41:00Z

builtins avoid buffering when outputting to non-redirected stdout/stderr since bcfc54f (Do not buffer builtin output if avoidable)
enabling the same for file/pipe could improve latency

mqudsi · 2020-09-15T03:25:18Z

fish would have to ascertain what is reading out of the pipe (or that something is reading out of the pipe) so as to avoid a deadlock (presumably this is why the buffering exists in the first place). Since the parser is currently still single-threaded, the builtin needs a big enough buffer on the output otherwise the pipe will saturate if the reading end is not running simultaneously but rather queued to be run internally by fish after the builtin finishes execution (i.e. is a block, function, or another builtin), right?

A few years ago I had a branch that side-stepped the issue by keeping the current behavior but also starting a thread would try reading from the buffer and writing to the pipe. The codebase was not as amenable to such an approach at the time and I ran into some fundamental issues, but I think it’s more doable today.

ridiculousfish added 8 commits January 12, 2020 11:18

Add a new history reader struct

18ef922

Encapsulate the process of reading from history, in preparation for history file format changes.

Add json.hpp

d921e22

This adds the single-header json.hpp file from SimpleJSON to fish. It is not yet built or included. Commit 8dd3e9b84a0b532c46f22f402bafb2c4cf8621eb from https://github.com/nbsdx/SimpleJSON

clang-format json.hpp

d0282fd

Fix some pessimizing moves in json.hpp

ea5da31

Support reading JSON history files

1f6b827

This adds support to fish for reading history files in "JSON Lines" format. Each entry is a separate JSON object.

Support emitting compact JSON

d451104

Teach SimpleJSON to emit "compact" one-line JSON records. This will allow compatibility with JSON Lines.

Teach append_history_item_to_buffer to append JSON

1cb24cf

This switches the output format from busted YAML to real JSON Lines. To avoid breaking prior versions, we now append the "json" suffix.

Add legacy history tests

0363283

Ensure that we can successfully import old fish history

JeanMertz mentioned this pull request Jan 12, 2020

empty cmd in history file breaks history #6494

Closed

ridiculousfish added this to the fish 3.2.0 milestone Jan 14, 2020

faho added the enhancement label Jan 14, 2020

ridiculousfish mentioned this pull request Feb 14, 2020

history isn't shown if your system clock is set wrong #6606

Open

sudormrfbin mentioned this pull request Jul 5, 2020

history should include CWD and history search should take CWD into account #120

Open

faho closed this Sep 2, 2020

zanchey removed this from the fish 3.2.0 milestone Oct 6, 2020

github-actions bot locked as resolved and limited conversation to collaborators Apr 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's fix the fish history file format, and maybe extend it to configuration #6493

Let's fix the fish history file format, and maybe extend it to configuration #6493

ridiculousfish commented Jan 12, 2020 •

edited

Loading

faho commented Jan 12, 2020

ridiculousfish commented Jan 14, 2020

zx8 commented Jan 14, 2020

floam commented Jan 20, 2020 •

edited

Loading

travankor commented Feb 14, 2020 •

edited

Loading

faho commented Feb 14, 2020

floam commented Feb 14, 2020

floam commented Feb 14, 2020

ridiculousfish commented Feb 14, 2020

mqudsi commented Feb 15, 2020 •

edited

Loading

krobelus commented Mar 7, 2020

mqudsi commented Mar 8, 2020 •

edited

Loading

krobelus commented Mar 10, 2020 •

edited

Loading

mqudsi commented Mar 10, 2020 •

edited

Loading

travankor commented Mar 11, 2020

BEFH commented Mar 11, 2020

mqudsi commented Mar 11, 2020

krobelus commented Mar 11, 2020

travankor commented Mar 11, 2020

BEFH commented Mar 14, 2020

ridiculousfish commented Mar 14, 2020

faho commented Sep 2, 2020

ridiculousfish commented Sep 5, 2020

aca commented Sep 6, 2020 •

edited

Loading

faho commented Sep 6, 2020

ridiculousfish commented Sep 6, 2020

brettviren commented Sep 13, 2020

mqudsi commented Sep 15, 2020 •

edited

Loading

krobelus commented Sep 15, 2020

mqudsi commented Sep 15, 2020

Let's fix the fish history file format, and maybe extend it to configuration #6493

Let's fix the fish history file format, and maybe extend it to configuration #6493

Conversation

ridiculousfish commented Jan 12, 2020 • edited Loading

State of the world

Desirable Properties for fish

Candidates

JSON Lines

TOML

What Other Shells Do

Compatibility

faho commented Jan 12, 2020

ridiculousfish commented Jan 14, 2020

zx8 commented Jan 14, 2020

floam commented Jan 20, 2020 • edited Loading

travankor commented Feb 14, 2020 • edited Loading

faho commented Feb 14, 2020

floam commented Feb 14, 2020

floam commented Feb 14, 2020

ridiculousfish commented Feb 14, 2020

mqudsi commented Feb 15, 2020 • edited Loading

krobelus commented Mar 7, 2020

mqudsi commented Mar 8, 2020 • edited Loading

krobelus commented Mar 10, 2020 • edited Loading

mqudsi commented Mar 10, 2020 • edited Loading

travankor commented Mar 11, 2020

BEFH commented Mar 11, 2020

mqudsi commented Mar 11, 2020

krobelus commented Mar 11, 2020

travankor commented Mar 11, 2020

BEFH commented Mar 14, 2020

ridiculousfish commented Mar 14, 2020

faho commented Sep 2, 2020

ridiculousfish commented Sep 5, 2020

aca commented Sep 6, 2020 • edited Loading

faho commented Sep 6, 2020

ridiculousfish commented Sep 6, 2020

brettviren commented Sep 13, 2020

mqudsi commented Sep 15, 2020 • edited Loading

krobelus commented Sep 15, 2020

mqudsi commented Sep 15, 2020

ridiculousfish commented Jan 12, 2020 •

edited

Loading

floam commented Jan 20, 2020 •

edited

Loading

travankor commented Feb 14, 2020 •

edited

Loading

mqudsi commented Feb 15, 2020 •

edited

Loading

mqudsi commented Mar 8, 2020 •

edited

Loading

krobelus commented Mar 10, 2020 •

edited

Loading

mqudsi commented Mar 10, 2020 •

edited

Loading

aca commented Sep 6, 2020 •

edited

Loading

mqudsi commented Sep 15, 2020 •

edited

Loading