-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't parse cabal files twice #3615
Conversation
This improves on the previous warning hack to keep a cache of parsed GenericPackageDescriptions, and avoid rerunning hpack. There are some TODOs added in this commit. One further point of concern: should we opt-out of caching the results of parsing index files? I'm imagining that when loading a snapshot, this may result in a lot of memory usage. (Then again, this may already be the case, see #3586.)
This is a large change that fell out from trying to clean up the mess left from the previous commit. The result here should be signficant simplification of the code paths around parsing cabal files. In fact, there are a few existing TODOs that got hit by this.
@borsboom This is the PR that implements the last feature @mgsloan and I had discussed for the 1.6 release. It uses a cache to avoid reparsing cabal files. Unfortunately this patch became much larger than I'd intended. I'm going to review it myself again tomorrow once I have a clearer head again (been staring at this for way too long). I'm OK with pushback saying that this shouldn't make it into 1.6, I can at the very least take the hpack-related subset of this (relatively tiny) to avoid the most egregious user-facing output. |
Seems like a good improvement and cleanup to me! If the cabal files are cached, is there any point to only running hpack once? I think with this sort of change, it's hopefully safe if we spend a day using stack built with it. On the other hand, this optimization might not be all that high impact. Good to have, certainly. If avoiding regressions is the primary concern, then it may make sense to merge the hpack bits, and merge the rest after the release.
Could we serve the index cache directly, to avoid processing it at all? Could be nice! Idea here would be to only serve the index cache for the most recent version of stack (within compatibility - the fetch url would have index cache version in it). Older versions would need to fetch and recompute. |
That's probably true. And the integration tests should do a lot for us aswell.
I'm not sure what you mean here. What I was concerned about was the possibility that, while processing a snapshot file, we'll end up holding onto too many |
This is really more related to #3586 I think. I thought that populateCache did a lot more work than it does, forgetting a lot of info from the Index. My thought was that it might be nice to instead just directly serve the file it ends up storing as a cache here. That'd potentially lag behind hackage, though. Perhaps not worth the complexity. Seems like it ought to be possible to load the GenericPackageDescriptions into memory lazily |
Loading into memory lazy would be worse, it would require keeping a closure that retains a bunch of huge |
I was thinking deserialization via this hypothetical store feature mgsloan/store#44 + memory mapped file so it doesn't need to all be in memory. Probably not worth the effort. I'm not sure how much retention is a problem for stack, since it usually doesn't run for long (except perhaps --file-watch). It might be worth adjusting the RTS to rarely do major GCs, use a big alloc space. Resource constrained environments might not appreciate that, though. |
Instead: just cache the results of cabal file parsing, and run hpack when doing so. This (as the previous few patches) involved much more overhaul than seems like it should. The best way to do this reliably is to only expose a single function from Stack.Package which can run hpack. In turn, this ended up requiring a conversion of a bunch of parts of the code base from passing around Path Abs File (pointing to the cabal file itself) to instead pass around Path Abs Dir (pointing to the directory). I think this is a good change, once against simplifying things a bit more.
I just pushed another fairly large patch making it unnecessary to check the hpack cache, see the comment on that commit why this was a bigger change than it seems should be warranted: d18c620 |
LGTM! |
Maybe we should include this whole PR in v1.6, but put out a release candidate and get people to test it for a week or so before we make the full release. |
@borsboom Isn't the plan to make a release candidate for this release regardless? Regarding status of this patch: all integration tests but one pass. I'm doing stricter testing of cabal file name/package name matching than previously, and the following now fails:
The |
Alright, this is a bug in either my filesystem or in the GHC character encoding handling. With this program: #!/usr/bin/env stack
-- stack --resolver lts-9.9 script
import System.Directory (getDirectoryContents)
import Data.List (isPrefixOf)
name :: String
name = "prefix-ば日本-4本"
main :: IO ()
main = do
writeFile name "foo"
getDirectoryContents "." >>= mapM_ print . filter ("prefix-" `isPrefixOf`)
print name I get the output:
Notice the mismatch. And even though previous versions of Stack would pass this integration test, they would immediately fail on trying to build these packages:
If there's no objection, I'd like to comment out this part of the integration test. Sound reasonable? |
IIRC there's more than one valid Unicode representation of the same string, and macOS's filesystem uses a different form than GHC. We're using Data.Text.Normalize.normalize in a few places to switch to a "canonical" representation, maybe that could help here? I'm pretty sure this has caused real problems in the past, so I don't think commenting out the integration test is a good idea. |
pinging @harendra-kumar, who I believe debugged and fixed this sort of thing in the past. |
It gets better: from everything I can see, with the "corrected" Unicode points (as my OS is reporting them), I now get an error from the |
Looks like this was covered in #1810, and before commit 12f9d01, this case was disabled on OS X. Pinging @Blaisorblade as well. |
I’m not sure if you’re seeing a bug or a regression. Either way, a fix might be “easy” enough. I think @borsboom is simply right (some commits show he looked into this test earlier on), when comparing you need to normalize both sides to NFC similarly to https://github.com/commercialhaskell/stack/pull/2397/files#diff-b8b3eca5371c1446794562093981903cL563. TL;DR. Unicode has multiple normal forms. Unicode normalization NFD writes è as e + a combining character for accent `, or (in this case) Hiragana accented character ば as は + combining character I’ve never seen such problems outside OS X; I’m not sure what’s guaranteed, but NFC seems more common. However, IIUC Linux FSs don’t do any normalization and they make few assumptions on the encoding. I think there are no guarantees for file contents, on OS X or elsewhere, so Stack should normalize their contents to NFC.
Theory: IIRC that parser demands letters (or numbers), and accented letters are still letters, but combining accents are not, hence the failure. In other words, that parser requires NFC but doesn’t document it. Architectural ideasIdeally we should standardize what is normalized and what isn’t — Unicode strings, NFC strings and NFD strings aren’t quite the same data type. But I’m not advocating using different Haskell types, especially for this release, and it seems overkill.
Background on Unicode normalization |
To clarify: I’d advise against disabling the test, especially if you’re seeing regressions (which seems debatable), unless adding the needed calls to normalize is harder than it seems. Please don’t think it’s hard just because I wrote so much to explain the background. |
Thanks for the details. I understand the trade-offs around Unicode normalization. My strong hesitation on pursuing this path is twofold:
But on this specific PR: even though this test was working in the past, that was only because the test wasn't thorough enough. Any attempt to build the generated code would have failed. Do you have an objection to separating off the discussion of fixing the case of building such packages into a separate issue? |
We're depending on unicode-transforms for this, we shouldn't need text-icu.
Still, it sounds like this is not an actual regression, and clearly this
isn't time to *add* support for Unicode package names.
It seems worth looking why there is such a test at all, maybe there's info
we're missing.
If nothing turns up, I'm fine with splitting out a proper fix.
--
Paolo Giarrusso
From smartphone, sorry for typos or excessive brevity
Il 30 nov 2017 18:50, "Michael Snoyman" <notifications@github.com> ha
scritto:
… Thanks for the details. I understand the trade-offs around Unicode
normalization. My strong hesitation on pursuing this path is twofold:
1. As @borsboom <https://github.com/borsboom> mentioned: we want to
avoid adding a text-icu dependency. And we definitely want to avoid
implementing Unicode normalization ourselves.
2. I don't think this is a worthwhile goal to strive for. I'd probably
go in the opposite direction: add a warning for package names which are
outside of the basic Latin character range.
- The tooling is clearly not well developed for it
- We're likely to run up against OS-specific bugs like this one
- It's a great vector for security exploits by using similar
characters
But on this specific PR: even though this test was working in the past,
that was only because the test wasn't thorough enough. Any attempt to build
the generated code would have failed. Do you have an objection to
separating off the discussion of fixing the case of building such packages
into a separate issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3615 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARsqMUVrglAnuWunLEkIHOc9JgwlCHTks5s7urPgaJpZM4QvYg6>
.
|
I don't think you need |
I wasn't aware of unicode-transforms, thanks. To clarify on my "not a regression" claim: previously, I'm going to open up a new issue about the issue, and add a new integration test that demonstrates the old and new bug. |
I wrote the unicode-transforms package precisely for #1810. @Blaisorblade fixed the issue later. I was a bit hesitant about making a point fix for it because it is fragile and can break easily. Ideally we need a version of text package that automatically normalizes all text to a common form so that the programmer does not need to worry about normalizing before comparison. This should not be too difficult now that we have unicode-transforms, I proposed it in the text package, but responses were slow and then I could not find time to follow it up. Anyway, I have not looked at the cause of this specific issue, it may be ok to make a point fix for this particular case (and maybe other similar cases that may be lurking around) doing a normalized comparison using unicode-transforms. BTW, there is also a http://hackage.haskell.org/package/normalization-insensitive package based on unicode-transforms. |
This improves on the previous warning hack to keep a cache of parsed
GenericPackageDescriptions, and avoid rerunning hpack.
There are some TODOs added in this commit. One further point of concern:
should we opt-out of caching the results of parsing index files? I'm
imagining that when loading a snapshot, this may result in a lot of
memory usage. (Then again, this may already be the case, see #3586.)
Note: Documentation fixes for https://docs.haskellstack.org/en/stable/ should target the "stable" branch, not master.
Please include the following checklist in your PR:
Please also shortly describe how you tested your change. Bonus points for added tests!