Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitizer exception for IMG SRC attribute not being applied #16020

Closed
1 of 4 tasks
mjfs opened this issue May 29, 2021 · 6 comments · Fixed by #16098
Closed
1 of 4 tasks

Sanitizer exception for IMG SRC attribute not being applied #16020

mjfs opened this issue May 29, 2021 · 6 comments · Fixed by #16098
Labels

Comments

@mjfs
Copy link

mjfs commented May 29, 2021

  • Gitea version (or commit ref): 1.13.7
  • Git version: 2.31.1
  • Operating system: Linux (Gitea installed from Arch repository)
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io: Not Applicable (custom configuration)
  • Log gist: Not Applicable (not visible in log)

Description

When using external markup renderer, sanitizer exception is not being applied. The attribute is consequently removed from output.

I am using Pandoc to render Office Open XML document (docx extension). No matter what combination of sanitizer configuration and markup renderer I choose, the data URI value of src attribute on img element is always removed from Gitea's final HTML output for any docx file previewed in browser (i.e. only <img/> remains).

As I understand the Gitea documentation (as well as cheat sheet), the configuration bellow should work:

[markup.sanitizer.docx]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = ^.*$

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

I was not able to found any workaround for this scenario (that could achieve desired end result) in the documentation, so if any other solution is generally used as an alternative for this use case (e.g. such as externalizing document resources), that will also do.

@noerw noerw added the type/bug label May 30, 2021
@matthewlootens
Copy link

I'm having the same issue as described by @mjfs to get src attributes on img elements through the sanitizer. In my case, I'm rendering Jupyter Notebook files (.ipynb) by nbconvert. In this case, src values are base64-encoded data URI scheme, and so I also added the data URI scheme in the app.ini config:

[markup.sanitizer.rule1]
ELEMENT = img
ALLOW_ATTR = src
REGEXP = 

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.jupyter]
ENABLED = true
FILE_EXTENSIONS = .ipynb
RENDER_COMMAND = "/home/user/.venv/bin/jupyter-nbconvert --stdout --to html --template basic "
IS_INPUT_FILE = true
  • Gitea version: 1.14.2

@Eugene-1984
Copy link

The following issue for the bluemonday microcosm-cc/bluemonday#51 (comment) suggest that the implementation for the src allowing policy must be something like

	p := bluemonday.NewPolicy()
	p.AllowImages()
	p.AllowDataURIImages()

rather than the straightforward

for _, rule := range setting.ExternalSanitizerRules {

And this issue suggest the the valid configuration exists #3025 and has a request for the example to be added to the docs. Would be greate if the solution (now or after a bugfix) will be added as an example to https://docs.gitea.io/en-us/external-renderers/#appini-file-configuration (now it has only TeX example)

@KN4CK3R
Copy link
Member

KN4CK3R commented Jun 6, 2021

This works for me:

[markdown]
CUSTOM_URL_SCHEMES = data

[markup.docx]
ENABLED = true
FILE_EXTENSIONS = .docx
RENDER_COMMAND = "pandoc --from docx --to html --self-contained"
IS_INPUT_FILE = false

The src attribute is not blocked but the data url. Now the images are there but not rendered for me in Firefox. The standalone pandoc output works but not embedded into Gitea. But that may be another problem.

@mjfs
Copy link
Author

mjfs commented Jun 6, 2021

@KN4CK3R: Your proposal does actually produce a non-empty IMG SRC attribute. Unfortunately, the data URI gets corrupted, probably at the sanitizing phase. Therefore this results in an invalid image format since the content can not be Base64 decoded into a valid JPG (or any other format used as input). It appears that the payload is still considered as a valid uri during processing therefore shortened (e.g. multiple slashes get reduced to a single one).

Instructions bellow are not directly related to the open issue, but might be helpful to someone else trying to determine how to use Pandoc as a filter or during testing of the setup.

To avoid composing entire HTML document when we just need the BODY for the preview, you can define an empty template and reference that as well in Gitea configuration. In addition, to avoid the warning, also set the TITLE attribute:

pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html

HTML file Blank.html at /usr/bin/ (use more appropriate location) contains just the following content:

$body$

To test it outside in command line you can use the following (with Sample.docx and Sample.html being the input and output):

cat Sample.docx | pandoc --from docx --to html --metadata title=" " --self-contained --template /usr/bin/Blank.html > Sample.html

Instead of the above one could also cut redundant lines from the Pandoc output in a wrapper (which I used before). The alternative with an empty template was suggested by @jgm as a workaround in a somewhat related Pandoc issue (jgm/pandoc#7331)

@KN4CK3R
Copy link
Member

KN4CK3R commented Jun 8, 2021

fyi #16098 and #16110

The problem with some jupyter files are the invalid data uri images. If the input file contains images in base64 format with lines separated by newlines they will be dropped by the sanitizer because a data uri should not contain control characters. You may need to convert the jupyter input or output and strip those newlines.

Sample input with \n in the image data:

"outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEZCAYAAACervI0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGdBJREFUeJzt3Xu0lXWd+PH3B7xfR5vUUrTMrPw1SjlpiubJ8lJqeBlN\nx5Hy15AzK9JxrRydzGAcXWo3tVrmXUETUUsZNRNdejQUEkmDStL6KZQheUnxkqjw+f3xbOLiAfbB\ns/fz7P28X2vtxT777IfzYQPfz..."
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],

You could use a wrapper script which replaces the newlines before passing the file to nbconvert.

@KN4CK3R
Copy link
Member

KN4CK3R commented Jun 16, 2021

A wrapper is not needed anymore after we upgrade bluemonday (see microcosm-cc/bluemonday#123)

@go-gitea go-gitea locked and limited conversation to collaborators Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@KN4CK3R @Eugene-1984 @noerw @matthewlootens @mjfs and others