Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.9.1] Formatting issues while rendering code #1337

Open
adhishthite opened this issue Jul 10, 2024 · 8 comments
Open

[v0.9.1] Formatting issues while rendering code #1337

adhishthite opened this issue Jul 10, 2024 · 8 comments

Comments

@adhishthite
Copy link
Contributor

image

@nsarrazin Whenever I ask chat-ui to explain / generate code, the < does not get rendered correctly. Can you please take a look?

@nsarrazin
Copy link
Collaborator

If you still have access, could you send me the raw conversation that shows this behaviour ?
image
there's a download button next to user messages in the UI

@evalstate
Copy link
Contributor

OK. Think I can explain this one, and offer an improvement.

Code blocks in markdown can either be fenced ( ```html) or indented 4 spaces ( ).

The issue arises when the LLM responds with a code block that is both fenced AND indented.

In this case I think the correct behaviour is to show a code block, with the fences displayed as part of the code. VSCode and https://markdownlivepreview.com/ do this.

What is happening in Chat-UI seems to be:

  • The marked lexer does not pick this up as a code block, meaning that <CodeBlock> isn't used.
  • The marked renderer does which emits <pre> and <code> tags, causing the styling to look similar to a correctly rendered code block and the &lt to go through as-is. Note that the Copy to Clipboard button is not present because it hasn't been rendered by CodeBlock.
  • The behaviour is incorrect as in this case it should be including the triple backticks as part of its display (although I'd expect in >99% of cases the user would prefer standard CodeBlock behaviour and the LLM has made a mistake.)

In looking at this, I've bumped the marked library to 13.0.3 and then 14.0 (to see if this fix : markedjs/marked#3264 would make a difference - it doesn't). This does change the interface a little but fairly easy to update.

In the meantime, adding this to ChatMessage returns the &lt to < and renders the fences, although doesn't benefit from the Syntax Highlighting and Copy to Clipboard button that <CodeBlock> provides.

renderer.code = (code) => {		
   return `<pre><code>${sanitizeMd(code.raw)}</code></pre>`;
}

@nsarrazin - let me know if you want this as a PR - I think this improves behaviour but isn't a proper fix. In the meantime, I'm going to tweak my System prompt to advise the LLM against using indents for Code Blocks, and ultimately think this is a defect in the marked library?

:(
image

:) with additional render hook
image

@evalstate
Copy link
Contributor

evalstate commented Aug 15, 2024

Leaving the previous comment, however I am not 100% confident on whether indents are the only cause after more testing. The output below uses <CodeBlock> from list entries 10 onward...

        "updates": [
            {
                "type": "status",
                "status": "started"
            },
            {
                "type": "title",
                "title": "💻 HTML tags"
            },
            {
                "type": "finalAnswer",
                "text": "Certainly! Here are some common HTML tags with a brief description:\n\n1. **`<html>`**: The root element of an HTML page. It encloses all the content of the page.\n\n   ```html\n   <html>\n   </html>\n   ```\n\n2. **`<head>`**: Contains meta-information about the HTML document, like title and links to scripts and stylesheets.\n\n   ```html\n   <head>\n   </head>\n   ```\n\n3. **`<title>`**: Defines the title of the document, shown in the browser's title bar or tab.\n\n   ```html\n   <title>Page Title</title>\n   ```\n\n4. **`<body>`**: Contains the content of the document, such as text, images, and hyperlinks.\n\n   ```html\n   <body>\n   </body>\n   ```\n\n5. **`<h1>` to `<h6>`**: Header tags, `<h1>` is the highest level and `<h6>` is the lowest. They define headings.\n\n   ```html\n   <h1>Main Heading</h1>\n   ```\n\n6. **`<p>`**: Represents a paragraph of text.\n\n   ```html\n   <p>This is a paragraph.</p>\n   ```\n\n7. **`<a>`**: The anchor tag, used to create hyperlinks.\n\n   ```html\n   <a href=\"https://www.example.com\">Visit Example</a>\n   ```\n\n8. **`<img>`**: Embeds an image in the document. It should include the `src` (source) attribute and often an `alt` (alternative text) attribute.\n\n   ```html\n   <img src=\"image.jpg\" alt=\"Description of image\">\n   ```\n\n9. **`<ul>` / `<ol>`**: Unordered (bulleted) and ordered (numbered) list containers, respectively.\n\n   ```html\n   <ul>\n       <li>List item 1</li>\n       <li>List item 2</li>\n   </ul>\n   ```\n\n   ```html\n   <ol>\n       <li>First item</li>\n       <li>Second item</li>\n   </ol>\n   ```\n\n10. **`<li>`**: Represents a list item, used within `<ul>` or `<ol>`.\n\n   ```html\n   <li>A list item</li>\n   ```\n\n11. **`<div>`**: A generic container for content, often used for styling or layout purposes.\n\n   ```html\n   <div>This is a division.</div>\n   ```\n\n12. **`<span>`**: A generic inline container, typically used to apply styles or scripts.\n\n   ```html\n   <span style=\"color:blue\">This is a blue text.</span>\n   ```\n\n13. **`<input>`**: Represents an input field in a form, where data can be entered.\n\n   ```html\n   <input type=\"text\" name=\"username\">\n   ```\n\n14. **`<button>`**: Represents a clickable button.\n\n   ```html\n   <button>Click me</button>\n   ```\n\nRemember, these are just foundational tags, and HTML supports many more elements you can learn about as you build more complex pages.",
                "interrupted": false,
                "usage": {
                    "input_tokens": 88,
                    "output_tokens": 691
                }
            }
        ],

@evalstate
Copy link
Contributor

evalstate commented Aug 16, 2024

Here is a snippet that shows the issue:

The handling of code blocks in lists changes; asking the LLM via Chat-UI to repeat all or part of the block verbatim shows the behaviour.

The GFM spec recommends using a blank HTML comment to disambiguate indented blocks: https://github.github.com/gfm/#example-288


## Inside a List

- This is a test (normal fences)

```html
<foo />
  • This is another test (indented block)

  • This is a further test (indents and fences)

    <foo />
       <bar />
    
  • Test complete

Outside a List

This is a test (normal fences)

<foo />

This is another test (indented block)

<foo />
    <bar />

This is another test (indents and fences)

```
<foo />
   <bar />
```

Test complete

@evalstate
Copy link
Contributor

evalstate commented Aug 16, 2024

Final update on this for the moment - the issue also occurs when code blocks are children of lists, causing the parse(token.raw) to show the child codeblock rather than being caught by the type==="code" clause here:

{#each tokens as token}
{#if token.type === "code"}
<CodeBlock lang={token.lang} code={unsanitizeMd(token.text)} />
{:else}
{#await marked.parse(token.raw, options) then parsed}
<!-- eslint-disable-next-line svelte/no-at-html-tags -->
{@html DOMPurify.sanitize(parsed)}
{/await}
{/if}
{/each}

Can't see an obvious quick way to fix this.

@rotemdan
Copy link
Contributor

rotemdan commented Nov 12, 2024

Getting this issue with Qwen2.5-Coder-32B-Instruct:

Screenshot_1

The raw markdown looks like:

### Explanation of the Code

1. **Loop through each `char*` and delete it:**
   ```cpp
   for (size_t i = 0; i < count; i++) {
       delete suggestions[i];
       suggestions[i] = 0;
   }

Seems like the code block produced by Qwen is indented, which usually isn't common, but seems to be more common with this particular model.

@evalstate
Copy link
Contributor

evalstate commented Nov 12, 2024 via email

@evalstate
Copy link
Contributor

Last reply not helpful - there are 2 separate issues:

  1. Code blocks that are children of lists don't get rendered via the CodeBlock component.
  2. Those code blocks render "<" symbols incorrectly.

I can produce a PR for the second issue (I fixed this in my fork but left it as it's not a "complete" fix).

Adding this to ChatMessage fixes the <'s.

renderer.code = (code) => {		
   return `<pre><code>${sanitizeMd(code.raw)}</code></pre>`;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants