Sanitizer exception for IMG SRC attribute not being applied #16020

mjfs · 2021-05-29T16:15:19Z

Gitea version (or commit ref): 1.13.7
Git version: 2.31.1
Operating system: Linux (Gitea installed from Arch repository)
Database (use [x]):
- PostgreSQL
- MySQL
- MSSQL
- SQLite
Can you reproduce the bug at https://try.gitea.io: Not Applicable (custom configuration)
Log gist: Not Applicable (not visible in log)

Description

When using external markup renderer, sanitizer exception is not being applied. The attribute is consequently removed from output.

I am using Pandoc to render Office Open XML document (docx extension). No matter what combination of sanitizer configuration and markup renderer I choose, the data URI value of src attribute on img element is always removed from Gitea's final HTML output for any docx file previewed in browser (i.e. only <img/> remains).

As I understand the Gitea documentation (as well as cheat sheet), the configuration bellow should work:

[markup.sanitizer.docx]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = ^.*$

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

I was not able to found any workaround for this scenario (that could achieve desired end result) in the documentation, so if any other solution is generally used as an alternative for this use case (e.g. such as externalizing document resources), that will also do.

The text was updated successfully, but these errors were encountered:

matthewlootens · 2021-06-01T15:56:37Z

I'm having the same issue as described by @mjfs to get src attributes on img elements through the sanitizer. In my case, I'm rendering Jupyter Notebook files (.ipynb) by nbconvert. In this case, src values are base64-encoded data URI scheme, and so I also added the data URI scheme in the app.ini config:

[markup.sanitizer.rule1]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = 

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.jupyter]
ENABLED = true
FILE_EXTENSIONS = .ipynb
RENDER_COMMAND = "/home/user/.venv/bin/jupyter-nbconvert --stdout --to html --template basic "
IS_INPUT_FILE = true

Gitea version: 1.14.2

Eugene-1984 · 2021-06-06T11:28:47Z

The following issue for the bluemonday microcosm-cc/bluemonday#51 (comment) suggest that the implementation for the src allowing policy must be something like

	p := bluemonday.NewPolicy()
	p.AllowImages()
	p.AllowDataURIImages()

rather than the straightforward

gitea/modules/markup/sanitizer.go

Line 114 in b3ef6a6

for _, rule := range setting.ExternalSanitizerRules {

And this issue suggest the the valid configuration exists #3025 and has a request for the example to be added to the docs. Would be greate if the solution (now or after a bugfix) will be added as an example to https://docs.gitea.io/en-us/external-renderers/#appini-file-configuration (now it has only TeX example)

KN4CK3R · 2021-06-06T12:31:18Z

This works for me:

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

The src attribute is not blocked but the data url. Now the images are there but not rendered for me in Firefox. The standalone pandoc output works but not embedded into Gitea. But that may be another problem.

mjfs · 2021-06-06T15:00:50Z

@KN4CK3R: Your proposal does actually produce a non-empty IMG SRC attribute. Unfortunately, the data URI gets corrupted, probably at the sanitizing phase. Therefore this results in an invalid image format since the content can not be Base64 decoded into a valid JPG (or any other format used as input). It appears that the payload is still considered as a valid uri during processing therefore shortened (e.g. multiple slashes get reduced to a single one).

Instructions bellow are not directly related to the open issue, but might be helpful to someone else trying to determine how to use Pandoc as a filter or during testing of the setup.

To avoid composing entire HTML document when we just need the BODY for the preview, you can define an empty template and reference that as well in Gitea configuration. In addition, to avoid the warning, also set the TITLE attribute:

pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html

HTML file Blank.html at /usr/bin/ (use more appropriate location) contains just the following content:

$body$

To test it outside in command line you can use the following (with Sample.docx and Sample.html being the input and output):

cat Sample.docx | pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html > Sample.html

Instead of the above one could also cut redundant lines from the Pandoc output in a wrapper (which I used before). The alternative with an empty template was suggested by @jgm as a workaround in a somewhat related Pandoc issue (jgm/pandoc#7331)

KN4CK3R · 2021-06-08T15:19:40Z

fyi #16098 and #16110

The problem with some jupyter files are the invalid data uri images. If the input file contains images in base64 format with lines separated by newlines they will be dropped by the sanitizer because a data uri should not contain control characters. You may need to convert the jupyter input or output and strip those newlines.

Sample input with \n in the image data:

"outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEZCAYAAACervI0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGdBJREFUeJzt3Xu0lXWd+PH3B7xfR5vUUrTMrPw1SjlpiubJ8lJqeBlN\nx5Hy15AzK9JxrRydzGAcXWo3tVrmXUETUUsZNRNdejQUEkmDStL6KZQheUnxkqjw+f3xbOLiAfbB\ns/fz7P28X2vtxT777IfzYQPfz..."
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],

You could use a wrapper script which replaces the newlines before passing the file to nbconvert.

KN4CK3R · 2021-06-16T21:35:07Z

A wrapper is not needed anymore after we upgrade bluemonday (see microcosm-cc/bluemonday#123)

noerw added the type/bug label May 30, 2021

KN4CK3R mentioned this issue Jun 7, 2021

Fix data URI scramble #16098

Merged

6543 closed this as completed in #16098 Jun 7, 2021

go-gitea locked and limited conversation to collaborators Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizer exception for IMG SRC attribute not being applied #16020

Sanitizer exception for IMG SRC attribute not being applied #16020

mjfs commented May 29, 2021

matthewlootens commented Jun 1, 2021

Eugene-1984 commented Jun 6, 2021

KN4CK3R commented Jun 6, 2021

mjfs commented Jun 6, 2021

KN4CK3R commented Jun 8, 2021

KN4CK3R commented Jun 16, 2021

Sanitizer exception for IMG SRC attribute not being applied #16020

Sanitizer exception for IMG SRC attribute not being applied #16020

Comments

mjfs commented May 29, 2021

Description

matthewlootens commented Jun 1, 2021

Eugene-1984 commented Jun 6, 2021

KN4CK3R commented Jun 6, 2021

mjfs commented Jun 6, 2021

KN4CK3R commented Jun 8, 2021

KN4CK3R commented Jun 16, 2021