Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL-encoded session names #55

Merged
merged 1 commit into from
Jun 11, 2024
Merged

URL-encoded session names #55

merged 1 commit into from
Jun 11, 2024

Conversation

awerebea
Copy link
Contributor

@awerebea awerebea commented May 13, 2024

Add a configuration option to encode/decode session names using percent-encoding syntax, very similar to what is commonly used in web development to encode URLs, to avoid using illegal characters in file names, but still allow them to be used in session names.

When this option is set to true, the session name is encoded for use as part of the session filename before saving and decoded back before displaying in the list of sessions and notification messages.

This change allows us to use CWD paths as the session name to manage sessions based on path, for example:

require("possession").save((vim.fn.getcwd(-1, -1):gsub("^" .. vim.loop.os_homedir(), "~")), { no_confirm = true })
require("possession").load((vim.fn.getcwd(-1, -1):gsub("^" .. vim.loop.os_homedir(), "~")))

Here's what it looks like in practice:
urlencoded_session_names

@jedrzejboczar
Copy link
Owner

Thanks! This looks really good.

As far as I understand enabling url-encoding shouldn't break anything for anyone, am I right? Or rather: it could break user sessions if they have names for which url-encoding is not a no-op, but that should be rare as I guess that most people should be using names without any special characters. If so, then maybe we can go for a breaking change and set the default value to true? Then we can add convenience commands PossessionSaveCwd and PossessionLoadCwd. I believe that there are many users that would want this feature, even though I personally probably won't use it.

@jedrzejboczar
Copy link
Owner

I rebased your changes on top of current master and fixed some stylua issues.

On top of that I implemented some things and pushed to your branch, I hope that it's not a problem. The changes include:

  • url-encode only the file name, session name in JSON stays intact, this reduces the amount of places where we need to use conditional_encode/decode (to only the paths.lua module)
  • renamed the config option to url_encode_name to be more explicit what it is doing
  • added autosave/autoload based on CWD like in the code snippet you provided

Are you okay with these changes? If so than I would wait some time using this to make sure it doesn't break stuff and then I would squash the commits and merge it as a breaking change (even though it should not break anything for 90% case).

Thanks again for the idea and PR, this allowed to add some functionality that has already been requested multiple times.

Related: #12, #45
Some potential TODOs: #45 (comment)

@awerebea
Copy link
Contributor Author

I'm glad you found it useful, and of course, I agree with your changes.

Thank you for maintaining such a great plugin. 👍

@josh-nz josh-nz mentioned this pull request Jun 3, 2024
@josh-nz
Copy link
Contributor

josh-nz commented Jun 3, 2024

There is a lot of good work here, but some issues remain.

Firstly, I think calling it url encoding is a bit misleading since that is not precisely what's happening. Correct url encoding has a different set of characters that must be encoded and different set of reserved characters, and this is also not a full or correct implementation. Perhaps something like filesystem_percent_encoding is a little more intention revealing.

Secondly, part of the algorithm uses simple substitution, which is not sufficient. I've outlined some issues with substitution in a different session plugin here: rmagatti/auto-session#273

Here are the results of some test cases with the "url encoding" algorithm as it currently stands:

Screen Shot 2024-06-03 at 16 56 58

So while it might work for a good number of cases, it's not entirely robust. Space should be able to be encoded as %20 and + as %2B, which removes the substitution issues. Ref:
https://en.wikipedia.org/wiki/Percent-encoding

Alternatives:
I've had a quick look into alternatives, base64 is not suitable due to case sensitivity, but base32 is suitable, and there is a nvim library that supports this:
https://github.com/MisanthropicBit/decipher.nvim

I don't know if you want to take a dependency, but this might be a quick solution.

Another possibility is that so long as any invalid filename chars are stripped or encoded, we can still use this idea provided it is a one way function. That is, so long as we are always "encoding", it can work. For example, currently the loading of the last session attempts to decode the filename. It could instead just read the CWD inside the JSON (which is not encoded). I have yet to consider whether this is possible with the migrate function and whatever M.complete_session in commands.lua is doing, as these are the other two functions that could potentially call decode (if the config flag is set).

Thanks @awerebea for your work on this!

For reference, here is the code I was testing out at https://codapi.org/lua/

function url_encode(str)
    if type(str) == 'string' then
        str = string.gsub(str, '\n', '\r\n')
        str = string.gsub(str, '([^%w %%%+%-%_%.])', function(c)
            return string.format('%%%02X', string.byte(c))
        end)
        str = string.gsub(str, ' ', '+')
    end
    return str
end

function url_decode(str)
    if type(str) == 'string' then
        str = string.gsub(str, '+', ' ')
        str = string.gsub(str, '%%(%x%x)', function(h)
            return string.char(tonumber(h, 16))
        end)
        str = string.gsub(str, '\r\n', '\n')
    end
    return str
end

function enc_dec(string)
  local encoded = url_encode(string)
  local decoded = url_decode(encoded)
  print(string .. " -> " .. encoded .. " -> " .. decoded)
end

enc_dec([[foo\bar]])
enc_dec([[unc:\\bar]])
enc_dec([[c:\foo\bar]])
enc_dec([[foo+bar]])
enc_dec([[foo%bar]])
enc_dec([[foo bar]])
enc_dec([[foo  bar]])
enc_dec([[foo++bar]])
enc_dec([[foo+%bar]])
enc_dec([[foo%%bar]])
enc_dec([=[foo[[]]bar]=])

@josh-nz
Copy link
Contributor

josh-nz commented Jun 3, 2024

After a quick play, it looks like the following works for my above samples.

+ and space don't need encoding since they're valid filename characters, so remove the substitution.

Encode % so it's not caught as a false match during decoding.

What's the point in substituting \n for \r\n and back? I think if we have any of these characters, they should instead be stripped?

function url_encode(str)
    if type(str) == 'string' then
        --str = string.gsub(str, '\n', '\r\n')
        str = string.gsub(str, '([^%w %+%-%_%.])', function(c) -- removed %% to let % encode
            return string.format('%%%02X', string.byte(c))
        end)
        --str = string.gsub(str, ' ', '+')
        
    end
    return str
end

function url_decode(str)
    if type(str) == 'string' then
        --str = string.gsub(str, '+', ' ')
        str = string.gsub(str, '%%(%x%x)', function(h)
            return string.char(tonumber(h, 16))
        end)
        --str = string.gsub(str, '\r\n', '\n')
    end
    return str
end

@awerebea
Copy link
Contributor Author

awerebea commented Jun 3, 2024

@josh-nz, thank you for your feedback and comments.
I fixed substitution issues with space, + and % signs and renamed url_encode to percentage_encode.
percent_encode

@josh-nz
Copy link
Contributor

josh-nz commented Jun 4, 2024

My personal preference would be to only percent encode characters that are not legal filename characters (this be get tricky to define exhaustively and is platform specific, which I wouldn't worry about here, just the general cases). So I would not encode space or +. My reasoning is that there should be a clear rule around when the function encodes a character and when it does not. As it stands, I don't know why it is encoding, for example, space and + but does not encode -, _, or .. It seems arbitrary.

Ultimately, the current implementation still works, and I doubt I'll really be looking at the encoded filenames on disk and dealing with them directly in any capacity, but it does have an impact on code understanding.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 4, 2024

A few other things to consider. Does anyone know how this plays with non-English characters? I'm guessing it depends on how Lua treats the %w match as to whether they are encoded or not? If they are, that might break things, if the character is longer than 1 byte, ie more than %xx.

Also, this encoding transforms a single character into 3 characters now. There might be limits on file system name lengths. Windows in particular can be pretty fussy about this. These limits are probably a minor edge case, but worth keeping in mind.

@awerebea
Copy link
Contributor Author

awerebea commented Jun 4, 2024

The current implementation allows you to create a session with a name of your choice. It can be not only CWD, but also, for example, "My session :*)".
If we encode only the characters that are forbidden to be used in filenames (in the case of Linux, this is, in fact, just a forward slash /), then an attempt will be made to create a file with spaces, a colon and an asterisk.
I don't think it's a good idea to create files with such names, although it's possible and formally allowed by the operating system.

I think it is necessary to escape all potentially unsafe characters.
From my point of view, the safest characters for file names are alphanumeric characters, hyphens and underscores. The dot is also fine, since it is placed before the file extension in any case.

If this approach does not seem reasonable to you, I prefer to encode all characters except alphanumeric, instead of allowing spaces, asterisks, commas, question marks and other legal filename characters.

@jedrzejboczar
Copy link
Owner

Thanks for your research. I read through it and it looks like this is much more complicated than I thought :P

I took a look at how Neovim/Vim do this kind of encoding for undo files (what we get from undofile()). It looks like the only thing that is done is to replace path separators with percent character (undo.c). So all other characters should probably be valid file name characters on all platforms? The main problem I see is that on Windows this is not a one-to-one mapping. We could try some heuristic on how to decide where to put : for names like C:\my\path but maybe it's better to do what Neovim does and what @josh-nz suggested, i.e. to only do mapping CWD->encoded CWD and just always read CWD from JSON instead of relying on file name at all. It would basically mean that we would have to remove the option from require('possession.session').list { no_read = true } and always parse the files (which in general is probably really cheap to do). We can always add some simple caching mechanism. This should solve all the problems with decoding and we would only need an encoding that removes invalid filename characters. I get your concerns about special characters in file names @awerebea but I believe that what undofile() does should be enough, even though some other characters in file names may seem problematic, this is usually the case for poorly implemented file system utilities and most likely there should be no issues with it for libuv (but maybe I'm too naive?).

What do you think about it? If I find some time I'll try to implement this, but if you have time and will to do so, then feel free. Or are there any issues that you see with this idea?

@jedrzejboczar
Copy link
Owner

jedrzejboczar commented Jun 4, 2024

Looks like I was naive, there seem to be a lot of forbidden characters on Windows: https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names. We could:

  • ban most of the problematic characters from session names on all platforms
  • allow all characters and just let users get errors when there is a problem to create such a file (I think it's better)
  • use encoding that will allow to save file with any name

In any case it's probably good to stop relying on file names at all when reading sessions and just parse JSON. This would just need some "health checks" in case for some reason two files with different filenames have the same session name in JSON.

This is up for discussion, I don't know which solution is best, I just think that banning is the worst one.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 4, 2024

I think given the many different filesystems, what @awerebea said about encoding space and + is probably the best choice after further consideration. Perhaps if there could at least be a comment in the encoding algorithm that explains why it's encoding what it is, that would be helpful for future selfs. Even if it's just a link back to this discussion, that would suffice.

  • ban most of the problematic characters from session names on all platforms
  • allow all characters and just let users get errors when there is a problem to create such a file (I think it's better)

Of these two, I agree on the second. Just let Neovim and filesystem deal with the error handling.

The current encoding and decoding I think is working. I wasn't aware it was encoding manually entered session names also. One option is to only encode the automatic CWD names, and leave manually entered names to be dealt with by Neovim natively (the second point above). I would probably be in favour of this, but also a bit ambivalent about the decision. Maybe it is preferable to be consistent and just continue to encode everything. Edit: yeah, I think? I'm leaning towards consistency. Let users have their "My session :*)" if they want it.

Looking forward, I can see my suggestion here being useful: #59 (comment)
which if it goes ahead, means we need to implement reading the JSON. I don't have concerns about performance, and as you said, we can cache this. They probably only need to be read once. But even with this, we still need a way to encode the CWD to a filename, even if it's only a one-way encode. I think the current implementation is suitable, even if it's only used as a one-way encode. The main benefit I see of this over something like what undo.c is doing, is that it is consistent across filesystems, so no special cases are required in the code, which simplifies the reasoning.

@jedrzejboczar
Copy link
Owner

@awerebea @josh-nz
I finally found time to work on this:

  • added some test cases for the encoding/decoding (even though we now don't need decoding)
  • added link to this PR in percent_encode for explanation as you suggested
  • removed and deprecated the session.list { no_read = true } parameter, now we always read file contents
  • deprecated paths.session_name
  • completion now just reads names from files
  • added a check for duplicates in session.list, if for whatever reason there are duplicate names then user is advised to delete incorrect files; I believe this problem should never happen if using our APIs

Please review these changes to see if it's ok. I will then squash this as there are a lot of commits.

When this is merged we can think of other suggestions from #59, e.g. "remove last symlink and use mtime" or "load CWD session based on session.cwd from most recent session".

@awerebea
Copy link
Contributor Author

Hi @jedrzejboczar,

Please review these changes to see if it's ok. I will then squash this as there are a lot of commits.

It looks good to me. Thank you for your attention to this issue.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 10, 2024

Please review these changes to see if it's ok. I will then squash this as there are a lot of commits.

When this is merged we can think of other suggestions from #59, e.g. "remove last symlink and use mtime" or "load CWD session based on session.cwd from most recent session".

I'll have a look at the changes later today.

I've also got some local commits for "remove last symlink and use mtime" or "load CWD session based on session.cwd from most recent session" (this last one is still a work in progress, only started on it last night and will need some adjustments based on your changes). You know the code a lot better than I do, so I'd consider these as proof-of-concepts, there might be a better way to implement them. Happy to share them when I've fixed them up for the latest changes in this branch.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 11, 2024

Looks ok to me. I'm wondering if there is a need to support the percent_encode_name config setting. Can we just force it on? Consider:

  • we are reading CWD from the JSON so no issues here
  • places that display the session filename display it as is, so if you previously had something like %2c as part of your manual session name it will continue to display as the user entered it (since there is no decoding). New, encoded cwd session names will display encoded, which I think is fine?

The only question is do we want to decode session names for display, eg PossessionList. If we do, there is a very slight chance we mis-display a filename if the user had manually entered something like %2c as part of the name. I'm not sure this very edge case is worth the overhead and potential confusion to users (to know what the setting does and when to use it vs not) to have to maintain the percent_encode_name config setting. I also think that if we decode the filenames we are misrepresenting them. The encoded name is what is on disk, not the decoded name. I'm in support of NOT decoding filenames for display. We display filenames exactly as they are on disk, even if that means the CWD session filenames look a little strange.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 11, 2024

A quick test from me looks like the autosave issue is fixed. Closes #60

Edit: Either I don't know the correct syntax or I don't have permission to close #60.

@jedrzejboczar
Copy link
Owner

Thanks for quick test :D

As you suggested @josh-nz I removed the config option and we will just always encode - this should be less confusing for users.

@josh-nz
Copy link
Contributor

josh-nz commented Jun 11, 2024

I think this is good to merge in. Any other issues that arise can be separate issues? I don't think there are any breaking changes that this PR will introduce.

(I've been working on updating my implementations for "remove last symlink and use mtime" or "load CWD session based on session.cwd from most recent session" based on the recent changes here)

…sions

Encode/decode session names using percent-encoding syntax commonly used
in web development to encode URLs. This way we can encode paths as
session file names to save CWD sessions. Now we always decode session
name from JSON data, name->filename is just for generating unique files.
@jedrzejboczar jedrzejboczar merged commit 9851b91 into jedrzejboczar:master Jun 11, 2024
3 checks passed
@jedrzejboczar
Copy link
Owner

Merged. Thanks @awerebea @josh-nz for the idea, implementation and feedback!

@josh-nz
Copy link
Contributor

josh-nz commented Jun 11, 2024

Thanks @jedrzejboczar for your help and creating this plugin, and @awerebea for getting this particular feature started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants