Skip to content

Commit

Permalink
Document that readFields expects input to be UTF8
Browse files Browse the repository at this point in the history
For example `parseGenericPackageDescription` validates input it passes to `readFields'`.

That is desirable: we want try to parse non-UTF8 inputs, and report as a warning if they are not UTF8. (There are non-UTF8 .cabal files on Hackage, e.g. with German names).
  • Loading branch information
phadej authored and ffaf1 committed Jun 22, 2023
1 parent c116195 commit 117ea70
Showing 1 changed file with 30 additions and 1 deletion.
31 changes: 30 additions & 1 deletion Cabal-syntax/src/Distribution/Fields/Parser.hs
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ import qualified Data.Text.Encoding as T
import qualified Data.Text.Encoding.Error as T
#endif

-- $setup
-- >>> :set -XOverloadedStrings
-- >>> import Data.Either (isLeft)

-- | The 'LexState'' (with a prime) is an instance of parsec's 'Stream'
-- wrapped around lexer's 'LexState' (without a prime)
data LexState' = LexState' !LexState (LToken, LexState')
Expand Down Expand Up @@ -331,10 +335,35 @@ fieldInlineOrBraces name =
)

-- | Parse cabal style 'B8.ByteString' into list of 'Field's, i.e. the cabal AST.
--
-- 'readFields' assumes that input 'B8.ByteString' is valid UTF8, specifically it doesn't validate that file is valid UTF8.
-- Therefore bytestrings inside returned 'Field' will be invalid as UTF8 if the input were.
--
-- >>> readFields "foo: \223"
-- Right [Field (Name (Position 1 1) "foo") [FieldLine (Position 1 6) "\223"]]
--
-- 'readFields' won't (necessarily) fail on invalid UTF8 data, but the reported positions may be off.
--
-- __You may get weird errors on non-UTF8 input__, for example 'readFields' will fail on latin1 encoded non-breaking space:
--
-- >>> isLeft (readFields "\xa0 foo: bar")
-- True
--
-- That is rejected because parser thinks @\\xa0@ is a section name,
-- and section arguments may not contain colon.
-- If there are just latin1 non-breaking spaces, they become part of the name:
--
-- >>> readFields "\xa0\&foo: bar"
-- Right [Field (Name (Position 1 1) "\160foo") [FieldLine (Position 1 7) "bar"]]
--
-- The UTF8 non-breaking space is accepted as an indentation character (but warned about by 'readFields'').
--
-- >>> readFields' "\xc2\xa0 foo: bar"
-- Right ([Field (Name (Position 1 3) "foo") [FieldLine (Position 1 8) "bar"]],[LexWarning LexWarningNBSP (Position 1 1)])
readFields :: B8.ByteString -> Either ParseError [Field Position]
readFields s = fmap fst (readFields' s)

-- | Like 'readFields' but also return lexer warnings
-- | Like 'readFields' but also return lexer warnings.
readFields' :: B8.ByteString -> Either ParseError ([Field Position], [LexWarning])
readFields' s = do
parse parser "the input" lexSt
Expand Down

0 comments on commit 117ea70

Please sign in to comment.