Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't parse cabal files twice #3615

Merged
merged 6 commits into from
Dec 1, 2017
Merged

Don't parse cabal files twice #3615

merged 6 commits into from
Dec 1, 2017

Conversation

snoyberg
Copy link
Contributor

This improves on the previous warning hack to keep a cache of parsed
GenericPackageDescriptions, and avoid rerunning hpack.

There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)

Note: Documentation fixes for https://docs.haskellstack.org/en/stable/ should target the "stable" branch, not master.

Please include the following checklist in your PR:

  • Any changes that could be relevant to users have been recorded in the ChangeLog.md
  • The documentation has been updated, if necessary.

Please also shortly describe how you tested your change. Bonus points for added tests!

This improves on the previous warning hack to keep a cache of parsed
GenericPackageDescriptions, and avoid rerunning hpack.

There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)
This is a large change that fell out from trying to clean up the mess
left from the previous commit. The result here should be signficant
simplification of the code paths around parsing cabal files. In fact,
there are a few existing TODOs that got hit by this.
@snoyberg
Copy link
Contributor Author

@borsboom This is the PR that implements the last feature @mgsloan and I had discussed for the 1.6 release. It uses a cache to avoid reparsing cabal files.

Unfortunately this patch became much larger than I'd intended. I'm going to review it myself again tomorrow once I have a clearer head again (been staring at this for way too long). I'm OK with pushback saying that this shouldn't make it into 1.6, I can at the very least take the hpack-related subset of this (relatively tiny) to avoid the most egregious user-facing output.

@mgsloan
Copy link
Contributor

mgsloan commented Nov 30, 2017

Seems like a good improvement and cleanup to me! If the cabal files are cached, is there any point to only running hpack once?

I think with this sort of change, it's hopefully safe if we spend a day using stack built with it. On the other hand, this optimization might not be all that high impact. Good to have, certainly.

If avoiding regressions is the primary concern, then it may make sense to merge the hpack bits, and merge the rest after the release.

There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)

Could we serve the index cache directly, to avoid processing it at all? Could be nice! Idea here would be to only serve the index cache for the most recent version of stack (within compatibility - the fetch url would have index cache version in it). Older versions would need to fetch and recompute.

@snoyberg
Copy link
Contributor Author

I think with this sort of change, it's hopefully safe if we spend a day using stack built with it.

That's probably true. And the integration tests should do a lot for us aswell.

Could we serve the index cache directly, to avoid processing it at all?

I'm not sure what you mean here. What I was concerned about was the possibility that, while processing a snapshot file, we'll end up holding onto too many GenericPackageDescriptions in memory. (I just realize that #3586 is probably irrelevant here.)

@mgsloan
Copy link
Contributor

mgsloan commented Nov 30, 2017

Could we serve the index cache directly, to avoid processing it at all?

I'm not sure what you mean here. What I was concerned about was the possibility that, while processing a snapshot file, we'll end up holding onto too many GenericPackageDescriptions in memory.

This is really more related to #3586 I think. I thought that populateCache did a lot more work than it does, forgetting a lot of info from the Index. My thought was that it might be nice to instead just directly serve the file it ends up storing as a cache here. That'd potentially lag behind hackage, though. Perhaps not worth the complexity.

Seems like it ought to be possible to load the GenericPackageDescriptions into memory lazily

@snoyberg
Copy link
Contributor Author

Loading into memory lazy would be worse, it would require keeping a closure that retains a bunch of huge ByteStrings.

@mgsloan
Copy link
Contributor

mgsloan commented Nov 30, 2017

Loading into memory lazy would be worse, it would require keeping a closure that retains a bunch of huge ByteStrings.

I was thinking deserialization via this hypothetical store feature mgsloan/store#44 + memory mapped file so it doesn't need to all be in memory. Probably not worth the effort.

I'm not sure how much retention is a problem for stack, since it usually doesn't run for long (except perhaps --file-watch). It might be worth adjusting the RTS to rarely do major GCs, use a big alloc space. Resource constrained environments might not appreciate that, though.

Instead: just cache the results of cabal file parsing, and run hpack
when doing so. This (as the previous few patches) involved much more
overhaul than seems like it should. The best way to do this reliably is
to only expose a single function from Stack.Package which can run hpack.
In turn, this ended up requiring a conversion of a bunch of parts of the
code base from passing around Path Abs File (pointing to the cabal file
itself) to instead pass around Path Abs Dir (pointing to the directory).

I think this is a good change, once against simplifying things a bit
more.
@snoyberg
Copy link
Contributor Author

If the cabal files are cached, is there any point to only running hpack once?

I just pushed another fairly large patch making it unnecessary to check the hpack cache, see the comment on that commit why this was a bigger change than it seems should be warranted: d18c620

@mgsloan
Copy link
Contributor

mgsloan commented Nov 30, 2017

LGTM!

@borsboom
Copy link
Contributor

Maybe we should include this whole PR in v1.6, but put out a release candidate and get people to test it for a week or so before we make the full release.

@snoyberg
Copy link
Contributor Author

@borsboom Isn't the plan to make a release candidate for this release regardless?

Regarding status of this patch: all integration tests but one pass. I'm doing stricter testing of cabal file name/package name matching than previously, and the following now fails:

$ stack new ば日本-4本
...
("\12400\26085\26412-4\26412.cabal","\12399\12441\26085\26412-4\26412.cabal")
cabal file path /Users/michael/Documents/stack/test/integration/tests/1336-1337-new-package-names/ば日本-4本/ば日本-4本/ば日本-4本.cabal does not match the package name it defines.
Please rename the file to: ば日本-4本.cabal
For more information, see: https://github.com/commercialhaskell/stack/issues/317

The ("\12400\26085\26412-4\26412.cabal","\12399\12441\26085\26412-4\26412.cabal") bit is something I added for debugging. I'm testing some theories on what's causing this. I'm tempted to consider this an acceptable behavior change, but I'll make a stronger argument once I understand what's happening.

@snoyberg
Copy link
Contributor Author

Alright, this is a bug in either my filesystem or in the GHC character encoding handling. With this program:

#!/usr/bin/env stack
-- stack --resolver lts-9.9 script
import System.Directory (getDirectoryContents)
import Data.List (isPrefixOf)

name :: String
name = "prefix-ば日本-4本"

main :: IO ()
main = do
  writeFile name "foo"
  getDirectoryContents "." >>= mapM_ print . filter ("prefix-" `isPrefixOf`)
  print name

I get the output:

$ ./Main.hs 
"prefix-\12399\12441\26085\26412-4\26412"
"prefix-\12400\26085\26412-4\26412"

Notice the mismatch. And even though previous versions of Stack would pass this integration test, they would immediately fail on trying to build these packages:

$ /usr/local/bin/stack --version
Version 1.5.0, Git revision 63267f94d7c819cbecc2d59aa259d17240838e43 (4845 commits) x86_64 hpack-0.17.1
$ /usr/local/bin/stack new ば日本-4本
...
$ /usr/local/bin/stack build
cabal file path /Users/michael/Desktop/ば日本-4本/ば日本-4本.cabal does not match the package name it defines.
Please rename the file to: ば日本-4本.cabal
For more information, see: https://github.com/commercialhaskell/stack/issues/317

If there's no objection, I'd like to comment out this part of the integration test. Sound reasonable?

@borsboom
Copy link
Contributor

borsboom commented Nov 30, 2017

IIRC there's more than one valid Unicode representation of the same string, and macOS's filesystem uses a different form than GHC. We're using Data.Text.Normalize.normalize in a few places to switch to a "canonical" representation, maybe that could help here?

I'm pretty sure this has caused real problems in the past, so I don't think commenting out the integration test is a good idea.

@borsboom
Copy link
Contributor

pinging @harendra-kumar, who I believe debugged and fixed this sort of thing in the past.

@snoyberg
Copy link
Contributor Author

It gets better: from everything I can see, with the "corrected" Unicode points (as my OS is reporting them), I now get an error from the PackageName parser saying that it's an invalid package name.

@snoyberg
Copy link
Contributor Author

Looks like this was covered in #1810, and before commit 12f9d01, this case was disabled on OS X. Pinging @Blaisorblade as well.

@Blaisorblade
Copy link
Collaborator

I’m not sure if you’re seeing a bug or a regression. Either way, a fix might be “easy” enough.

I think @borsboom is simply right (some commits show he looked into this test earlier on), when comparing you need to normalize both sides to NFC similarly to https://github.com/commercialhaskell/stack/pull/2397/files#diff-b8b3eca5371c1446794562093981903cL563.

TL;DR. Unicode has multiple normal forms. Unicode normalization NFD writes è as e + a combining character for accent `, or (in this case) Hiragana accented character ば as は + combining character , while NFC uses a single character, and OS X normalizes file names to NFD unlike everything else, causing the pain you’re seeing. rsync across platforms is even more fun.

I’ve never seen such problems outside OS X; I’m not sure what’s guaranteed, but NFC seems more common. However, IIUC Linux FSs don’t do any normalization and they make few assumptions on the encoding.

I think there are no guarantees for file contents, on OS X or elsewhere, so Stack should normalize their contents to NFC.
(There’s also NFKC/NFKD, but normalizing to NKFC/NFKD destroys information, so I’d avoid it).

It gets better: from everything I can see, with the "corrected" Unicode points (as my OS is reporting them), I now get an error from the PackageName parser saying that it's an invalid package name.

Theory: IIRC that parser demands letters (or numbers), and accented letters are still letters, but combining accents are not, hence the failure. In other words, that parser requires NFC but doesn’t document it.

Architectural ideas

Ideally we should standardize what is normalized and what isn’t — Unicode strings, NFC strings and NFD strings aren’t quite the same data type. But I’m not advocating using different Haskell types, especially for this release, and it seems overkill.

  • Your tests show package names are expected to be in NFC.
  • But your tests show that “everything must use NFC” won’t quite work if you want to (naively) look for a file name in the e.g. getDirectoryContents output. But do you need that? Because openFile seems to do the right thing. I’d still normalize to NFC as soon as I turn file names into package names.

Background on Unicode normalization

https://stackoverflow.com/a/7934397/53974
#1810 (comment)

@Blaisorblade
Copy link
Collaborator

To clarify: I’d advise against disabling the test, especially if you’re seeing regressions (which seems debatable), unless adding the needed calls to normalize is harder than it seems. Please don’t think it’s hard just because I wrote so much to explain the background.

@snoyberg
Copy link
Contributor Author

Thanks for the details. I understand the trade-offs around Unicode normalization. My strong hesitation on pursuing this path is twofold:

  1. As @borsboom mentioned: we want to avoid adding a text-icu dependency. And we definitely want to avoid implementing Unicode normalization ourselves.
  2. I don't think this is a worthwhile goal to strive for. I'd probably go in the opposite direction: add a warning for package names which are outside of the basic Latin character range.
    • The tooling is clearly not well developed for it
    • We're likely to run up against OS-specific bugs like this one
    • It's a great vector for security exploits by using similar characters

But on this specific PR: even though this test was working in the past, that was only because the test wasn't thorough enough. Any attempt to build the generated code would have failed. Do you have an objection to separating off the discussion of fixing the case of building such packages into a separate issue?

@Blaisorblade
Copy link
Collaborator

Blaisorblade commented Nov 30, 2017 via email

@borsboom
Copy link
Contributor

I don't think you need text-icu; we're already using unicode-transforms to normalize Unicode. I agree that if this isn't actually regression, it should be a separate issue.

@snoyberg
Copy link
Contributor Author

I wasn't aware of unicode-transforms, thanks.

To clarify on my "not a regression" claim: previously, stack new X (for some X) would succeed, but stack build inside that directory would fail. Now stack new X will also fail.

I'm going to open up a new issue about the issue, and add a new integration test that demonstrates the old and new bug.

snoyberg added a commit that referenced this pull request Nov 30, 2017
This newly added integration test demonstrates the bug described in #3616,
even using Stack 1.5.1. This demonstrates that PR #3615 does not
introduce a regression, and sets a bar for properly fixing this problem.
@harendra-kumar
Copy link
Collaborator

I wrote the unicode-transforms package precisely for #1810. @Blaisorblade fixed the issue later. I was a bit hesitant about making a point fix for it because it is fragile and can break easily. Ideally we need a version of text package that automatically normalizes all text to a common form so that the programmer does not need to worry about normalizing before comparison. This should not be too difficult now that we have unicode-transforms, I proposed it in the text package, but responses were slow and then I could not find time to follow it up.

Anyway, I have not looked at the cause of this specific issue, it may be ok to make a point fix for this particular case (and maybe other similar cases that may be lurking around) doing a normalized comparison using unicode-transforms. BTW, there is also a http://hackage.haskell.org/package/normalization-insensitive package based on unicode-transforms.

@snoyberg snoyberg merged commit 56e7ae0 into master Dec 1, 2017
@snoyberg snoyberg deleted the parse-cabal-files-once branch December 1, 2017 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants