Don't parse cabal files twice #3615

snoyberg · 2017-11-29T17:58:59Z

This improves on the previous warning hack to keep a cache of parsed
GenericPackageDescriptions, and avoid rerunning hpack.

There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)

Note: Documentation fixes for https://docs.haskellstack.org/en/stable/ should target the "stable" branch, not master.

Please include the following checklist in your PR:

Any changes that could be relevant to users have been recorded in the ChangeLog.md
The documentation has been updated, if necessary.

Please also shortly describe how you tested your change. Bonus points for added tests!

This improves on the previous warning hack to keep a cache of parsed GenericPackageDescriptions, and avoid rerunning hpack. There are some TODOs added in this commit. One further point of concern: should we opt-out of caching the results of parsing index files? I'm imagining that when loading a snapshot, this may result in a lot of memory usage. (Then again, this may already be the case, see #3586.)

This is a large change that fell out from trying to clean up the mess left from the previous commit. The result here should be signficant simplification of the code paths around parsing cabal files. In fact, there are a few existing TODOs that got hit by this.

snoyberg · 2017-11-29T18:34:05Z

@borsboom This is the PR that implements the last feature @mgsloan and I had discussed for the 1.6 release. It uses a cache to avoid reparsing cabal files.

Unfortunately this patch became much larger than I'd intended. I'm going to review it myself again tomorrow once I have a clearer head again (been staring at this for way too long). I'm OK with pushback saying that this shouldn't make it into 1.6, I can at the very least take the hpack-related subset of this (relatively tiny) to avoid the most egregious user-facing output.

mgsloan · 2017-11-30T02:40:51Z

Seems like a good improvement and cleanup to me! If the cabal files are cached, is there any point to only running hpack once?

I think with this sort of change, it's hopefully safe if we spend a day using stack built with it. On the other hand, this optimization might not be all that high impact. Good to have, certainly.

If avoiding regressions is the primary concern, then it may make sense to merge the hpack bits, and merge the rest after the release.

There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)

Could we serve the index cache directly, to avoid processing it at all? Could be nice! Idea here would be to only serve the index cache for the most recent version of stack (within compatibility - the fetch url would have index cache version in it). Older versions would need to fetch and recompute.

snoyberg · 2017-11-30T03:55:58Z

I think with this sort of change, it's hopefully safe if we spend a day using stack built with it.

That's probably true. And the integration tests should do a lot for us aswell.

Could we serve the index cache directly, to avoid processing it at all?

I'm not sure what you mean here. What I was concerned about was the possibility that, while processing a snapshot file, we'll end up holding onto too many GenericPackageDescriptions in memory. (I just realize that #3586 is probably irrelevant here.)

mgsloan · 2017-11-30T06:59:14Z

Could we serve the index cache directly, to avoid processing it at all?

I'm not sure what you mean here. What I was concerned about was the possibility that, while processing a snapshot file, we'll end up holding onto too many GenericPackageDescriptions in memory.

This is really more related to #3586 I think. I thought that populateCache did a lot more work than it does, forgetting a lot of info from the Index. My thought was that it might be nice to instead just directly serve the file it ends up storing as a cache here. That'd potentially lag behind hackage, though. Perhaps not worth the complexity.

Seems like it ought to be possible to load the GenericPackageDescriptions into memory lazily

snoyberg · 2017-11-30T07:03:10Z

Loading into memory lazy would be worse, it would require keeping a closure that retains a bunch of huge ByteStrings.

mgsloan · 2017-11-30T07:30:57Z

Loading into memory lazy would be worse, it would require keeping a closure that retains a bunch of huge ByteStrings.

I was thinking deserialization via this hypothetical store feature mgsloan/store#44 + memory mapped file so it doesn't need to all be in memory. Probably not worth the effort.

I'm not sure how much retention is a problem for stack, since it usually doesn't run for long (except perhaps --file-watch). It might be worth adjusting the RTS to rarely do major GCs, use a big alloc space. Resource constrained environments might not appreciate that, though.

Instead: just cache the results of cabal file parsing, and run hpack when doing so. This (as the previous few patches) involved much more overhaul than seems like it should. The best way to do this reliably is to only expose a single function from Stack.Package which can run hpack. In turn, this ended up requiring a conversion of a bunch of parts of the code base from passing around Path Abs File (pointing to the cabal file itself) to instead pass around Path Abs Dir (pointing to the directory). I think this is a good change, once against simplifying things a bit more.

snoyberg · 2017-11-30T09:38:17Z

If the cabal files are cached, is there any point to only running hpack once?

I just pushed another fairly large patch making it unnecessary to check the hpack cache, see the comment on that commit why this was a bigger change than it seems should be warranted: d18c620

mgsloan · 2017-11-30T12:50:30Z

LGTM!

borsboom · 2017-11-30T14:01:33Z

Maybe we should include this whole PR in v1.6, but put out a release candidate and get people to test it for a week or so before we make the full release.

snoyberg · 2017-11-30T14:04:59Z

@borsboom Isn't the plan to make a release candidate for this release regardless?

Regarding status of this patch: all integration tests but one pass. I'm doing stricter testing of cabal file name/package name matching than previously, and the following now fails:

$ stack new ば日本-4本
...
("\12400\26085\26412-4\26412.cabal","\12399\12441\26085\26412-4\26412.cabal")
cabal file path /Users/michael/Documents/stack/test/integration/tests/1336-1337-new-package-names/ば日本-4本/ば日本-4本/ば日本-4本.cabal does not match the package name it defines.
Please rename the file to: ば日本-4本.cabal
For more information, see: https://github.com/commercialhaskell/stack/issues/317

The ("\12400\26085\26412-4\26412.cabal","\12399\12441\26085\26412-4\26412.cabal") bit is something I added for debugging. I'm testing some theories on what's causing this. I'm tempted to consider this an acceptable behavior change, but I'll make a stronger argument once I understand what's happening.

snoyberg · 2017-11-30T14:17:21Z

Alright, this is a bug in either my filesystem or in the GHC character encoding handling. With this program:

#!/usr/bin/env stack
-- stack --resolver lts-9.9 script
import System.Directory (getDirectoryContents)
import Data.List (isPrefixOf)

name :: String
name = "prefix-ば日本-4本"

main :: IO ()
main = do
  writeFile name "foo"
  getDirectoryContents "." >>= mapM_ print . filter ("prefix-" `isPrefixOf`)
  print name

I get the output:

$ ./Main.hs 
"prefix-\12399\12441\26085\26412-4\26412"
"prefix-\12400\26085\26412-4\26412"

Notice the mismatch. And even though previous versions of Stack would pass this integration test, they would immediately fail on trying to build these packages:

$ /usr/local/bin/stack --version
Version 1.5.0, Git revision 63267f94d7c819cbecc2d59aa259d17240838e43 (4845 commits) x86_64 hpack-0.17.1
$ /usr/local/bin/stack new ば日本-4本
...
$ /usr/local/bin/stack build
cabal file path /Users/michael/Desktop/ば日本-4本/ば日本-4本.cabal does not match the package name it defines.
Please rename the file to: ば日本-4本.cabal
For more information, see: https://github.com/commercialhaskell/stack/issues/317

If there's no objection, I'd like to comment out this part of the integration test. Sound reasonable?

borsboom · 2017-11-30T14:23:54Z

IIRC there's more than one valid Unicode representation of the same string, and macOS's filesystem uses a different form than GHC. We're using Data.Text.Normalize.normalize in a few places to switch to a "canonical" representation, maybe that could help here?

I'm pretty sure this has caused real problems in the past, so I don't think commenting out the integration test is a good idea.

borsboom · 2017-11-30T14:26:05Z

pinging @harendra-kumar, who I believe debugged and fixed this sort of thing in the past.

snoyberg · 2017-11-30T14:29:55Z

It gets better: from everything I can see, with the "corrected" Unicode points (as my OS is reporting them), I now get an error from the PackageName parser saying that it's an invalid package name.

snoyberg · 2017-11-30T14:32:25Z

Looks like this was covered in #1810, and before commit 12f9d01, this case was disabled on OS X. Pinging @Blaisorblade as well.

Blaisorblade · 2017-11-30T17:30:49Z

I’m not sure if you’re seeing a bug or a regression. Either way, a fix might be “easy” enough.

I think @borsboom is simply right (some commits show he looked into this test earlier on), when comparing you need to normalize both sides to NFC similarly to https://github.com/commercialhaskell/stack/pull/2397/files#diff-b8b3eca5371c1446794562093981903cL563.

TL;DR. Unicode has multiple normal forms. Unicode normalization NFD writes è as e + a combining character for accent `, or (in this case) Hiragana accented character ば as は + combining character ゙, while NFC uses a single character, and OS X normalizes file names to NFD unlike everything else, causing the pain you’re seeing. rsync across platforms is even more fun.

I’ve never seen such problems outside OS X; I’m not sure what’s guaranteed, but NFC seems more common. However, IIUC Linux FSs don’t do any normalization and they make few assumptions on the encoding.

I think there are no guarantees for file contents, on OS X or elsewhere, so Stack should normalize their contents to NFC.
(There’s also NFKC/NFKD, but normalizing to NKFC/NFKD destroys information, so I’d avoid it).

It gets better: from everything I can see, with the "corrected" Unicode points (as my OS is reporting them), I now get an error from the PackageName parser saying that it's an invalid package name.

Theory: IIRC that parser demands letters (or numbers), and accented letters are still letters, but combining accents are not, hence the failure. In other words, that parser requires NFC but doesn’t document it.

Architectural ideas

Ideally we should standardize what is normalized and what isn’t — Unicode strings, NFC strings and NFD strings aren’t quite the same data type. But I’m not advocating using different Haskell types, especially for this release, and it seems overkill.

Your tests show package names are expected to be in NFC.
But your tests show that “everything must use NFC” won’t quite work if you want to (naively) look for a file name in the e.g. getDirectoryContents output. But do you need that? Because openFile seems to do the right thing. I’d still normalize to NFC as soon as I turn file names into package names.

Background on Unicode normalization

https://stackoverflow.com/a/7934397/53974
#1810 (comment)

Blaisorblade · 2017-11-30T17:33:50Z

To clarify: I’d advise against disabling the test, especially if you’re seeing regressions (which seems debatable), unless adding the needed calls to normalize is harder than it seems. Please don’t think it’s hard just because I wrote so much to explain the background.

snoyberg · 2017-11-30T17:50:03Z

Thanks for the details. I understand the trade-offs around Unicode normalization. My strong hesitation on pursuing this path is twofold:

As @borsboom mentioned: we want to avoid adding a text-icu dependency. And we definitely want to avoid implementing Unicode normalization ourselves.
I don't think this is a worthwhile goal to strive for. I'd probably go in the opposite direction: add a warning for package names which are outside of the basic Latin character range.
- The tooling is clearly not well developed for it
- We're likely to run up against OS-specific bugs like this one
- It's a great vector for security exploits by using similar characters

But on this specific PR: even though this test was working in the past, that was only because the test wasn't thorough enough. Any attempt to build the generated code would have failed. Do you have an objection to separating off the discussion of fixing the case of building such packages into a separate issue?

Blaisorblade · 2017-11-30T18:01:35Z

We're depending on unicode-transforms for this, we shouldn't need text-icu. Still, it sounds like this is not an actual regression, and clearly this isn't time to *add* support for Unicode package names. It seems worth looking why there is such a test at all, maybe there's info we're missing. If nothing turns up, I'm fine with splitting out a proper fix. -- Paolo Giarrusso From smartphone, sorry for typos or excessive brevity Il 30 nov 2017 18:50, "Michael Snoyman" <notifications@github.com> ha scritto:

…

Thanks for the details. I understand the trade-offs around Unicode normalization. My strong hesitation on pursuing this path is twofold: 1. As @borsboom <https://github.com/borsboom> mentioned: we want to avoid adding a text-icu dependency. And we definitely want to avoid implementing Unicode normalization ourselves. 2. I don't think this is a worthwhile goal to strive for. I'd probably go in the opposite direction: add a warning for package names which are outside of the basic Latin character range. - The tooling is clearly not well developed for it - We're likely to run up against OS-specific bugs like this one - It's a great vector for security exploits by using similar characters But on this specific PR: even though this test was working in the past, that was only because the test wasn't thorough enough. Any attempt to build the generated code would have failed. Do you have an objection to separating off the discussion of fixing the case of building such packages into a separate issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3615 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AARsqMUVrglAnuWunLEkIHOc9JgwlCHTks5s7urPgaJpZM4QvYg6> .

borsboom · 2017-11-30T18:02:09Z

I don't think you need text-icu; we're already using unicode-transforms to normalize Unicode. I agree that if this isn't actually regression, it should be a separate issue.

snoyberg · 2017-11-30T18:03:54Z

I wasn't aware of unicode-transforms, thanks.

To clarify on my "not a regression" claim: previously, stack new X (for some X) would succeed, but stack build inside that directory would fail. Now stack new X will also fail.

I'm going to open up a new issue about the issue, and add a new integration test that demonstrates the old and new bug.

See #3616

This newly added integration test demonstrates the bug described in #3616, even using Stack 1.5.1. This demonstrates that PR #3615 does not introduce a regression, and sets a bar for properly fixing this problem.

harendra-kumar · 2017-11-30T19:03:41Z

I wrote the unicode-transforms package precisely for #1810. @Blaisorblade fixed the issue later. I was a bit hesitant about making a point fix for it because it is fragile and can break easily. Ideally we need a version of text package that automatically normalizes all text to a common form so that the programmer does not need to worry about normalizing before comparison. This should not be too difficult now that we have unicode-transforms, I proposed it in the text package, but responses were slow and then I could not find time to follow it up.

Anyway, I have not looked at the cause of this specific issue, it may be ok to make a point fix for this particular case (and maybe other similar cases that may be lurking around) doing a normalized comparison using unicode-transforms. BTW, there is also a http://hackage.haskell.org/package/normalization-insensitive package based on unicode-transforms.

snoyberg requested a review from mgsloan November 29, 2017 17:59

Style cleanup

83537ec

Better exception on mismatch

49c6cfd

snoyberg mentioned this pull request Nov 30, 2017

Unicode normalization on OS X can break package name parsing #3616

Open

Redisable test on OS X due to Unicode normalization

2703ba4

See #3616

snoyberg mentioned this pull request Nov 30, 2017

Corrupt file header when dealing with Unicode normalization issues haskell/cabal#4920

Open

snoyberg merged commit 56e7ae0 into master Dec 1, 2017

snoyberg deleted the parse-cabal-files-once branch December 1, 2017 08:03

decentral1se mentioned this pull request Dec 10, 2017

Excessive RAM usage while “Updating package index” #3586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't parse cabal files twice #3615

Don't parse cabal files twice #3615

snoyberg commented Nov 29, 2017

snoyberg commented Nov 29, 2017

mgsloan commented Nov 30, 2017

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017 •

edited

Loading

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

snoyberg commented Nov 30, 2017

borsboom commented Nov 30, 2017 •

edited

Loading

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

snoyberg commented Nov 30, 2017

Blaisorblade commented Nov 30, 2017

Blaisorblade commented Nov 30, 2017

snoyberg commented Nov 30, 2017

Blaisorblade commented Nov 30, 2017 via email

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

harendra-kumar commented Nov 30, 2017

Don't parse cabal files twice #3615

Don't parse cabal files twice #3615

Conversation

snoyberg commented Nov 29, 2017

snoyberg commented Nov 29, 2017

mgsloan commented Nov 30, 2017

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017 • edited Loading

snoyberg commented Nov 30, 2017

mgsloan commented Nov 30, 2017

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

snoyberg commented Nov 30, 2017

borsboom commented Nov 30, 2017 • edited Loading

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

snoyberg commented Nov 30, 2017

Blaisorblade commented Nov 30, 2017

Architectural ideas

Background on Unicode normalization

Blaisorblade commented Nov 30, 2017

snoyberg commented Nov 30, 2017

Blaisorblade commented Nov 30, 2017 via email

borsboom commented Nov 30, 2017

snoyberg commented Nov 30, 2017

harendra-kumar commented Nov 30, 2017

mgsloan commented Nov 30, 2017 •

edited

Loading

borsboom commented Nov 30, 2017 •

edited

Loading