forked from jgm/pandoc
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Parse base64-encoded data URIs more efficiently
(in some places) Very long data: URIs in source documents are causing outsized memory usage due to various parsing inefficiencies, for instance in Network.URI, TagSoup, and T.P.R.Markdown.source. See e.g. jgm#10075. This change improves the situation in a couple places we can control relatively easily by using an attoparsec text-specialized parser to consume base64-encoded strings. Attoparsec's takeWhile + inClass functions are designed to chew through long strings like this without doing unnecessary allocation, and the improvements in peak heap allocation are significant. One of the observations here is that if you parse something as a valid data: uri it shouldn't need any further escaping so we can short-circuit various processing steps that may unpack/iterate over the chars in the URI.
- Loading branch information
Showing
6 changed files
with
111 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
{-# LANGUAGE FlexibleContexts #-} | ||
{-# LANGUAGE OverloadedStrings #-} | ||
{- | | ||
Module : Text.Pandoc.Parsing.Base64 | ||
Copyright : © 2024 Evan Silberman | ||
License : GPL-2.0-or-later | ||
Maintainer : John MacFarlane <jgm@berkeley.edu> | ||
Parse large base64 strings efficiently within Pandoc's | ||
normal parsing environment | ||
-} | ||
|
||
module Text.Pandoc.Parsing.Base64 | ||
( parseBase64String ) | ||
|
||
where | ||
|
||
import Data.Text as T | ||
import Data.Attoparsec.Text as A | ||
import Text.Parsec (ParsecT, getInput, setInput, incSourceColumn) | ||
import Text.Pandoc.Sources | ||
import Control.Monad (mzero) | ||
|
||
parseBase64String :: Monad m => ParsecT Sources u m Text | ||
parseBase64String = do | ||
Sources ((pos, txt):rest) <- getInput | ||
let r = A.parse pBase64 txt | ||
case r of | ||
Done remaining consumed -> do | ||
let pos' = incSourceColumn pos (T.length consumed) | ||
setInput $ Sources ((pos', remaining):rest) | ||
return consumed | ||
_ -> mzero | ||
|
||
pBase64 :: A.Parser Text | ||
pBase64 = do | ||
most <- A.takeWhile1 (A.inClass "A-Za-z0-9+/") | ||
rest <- A.takeWhile (== '=') | ||
return $ most <> rest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters