Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Insert white space when stripping tags #33

Closed
crantok opened this issue Aug 23, 2016 · 5 comments
Closed

Suggestion: Insert white space when stripping tags #33

crantok opened this issue Aug 23, 2016 · 5 comments

Comments

@crantok
Copy link

crantok commented Aug 23, 2016

I'm using the StrictPolicy() to strip tags from text in order to feed mongoDB full text search. The text content of adjacent elements may be visually separated by html rendering even though there is no whitespace in the text. Stripping the tags therefore merges words potentially altering search results. Here's an example:

package main

import (
    "fmt"
    "github.com/microcosm-cc/bluemonday"
)

func main() {
    userInput := "<p>Why oh why</p><p>she swallowed a fly</p>"
    searchableText := bluemonday.StrictPolicy().Sanitize(userInput)

    fmt.Println(searchableText) // Why oh whyshe swallowed a fly
}

I can easily solve this in my own code, e.g. by inserting a space before or after every block-level html element before stripping the tags.

I wondered whether this would be a generally useful feature. A general case might need configuration given that even adjacent inline elements can be visually separated through CSS.

@crantok
Copy link
Author

crantok commented Aug 30, 2016

Just updated and tested in my own code. I like the way you reduced my suggestion to the simplest possible feature.

Thank you :)

@grafana-dee
Copy link
Contributor

I like the way you reduced my suggestion to the simplest possible feature.

I figure that:

  • too much whitespace left in is irrelevant (browsers collapse it to one space when rendering)
  • if one cared about internal space (string lengths) it would be trivial for you to run a regex to replace /s/s+ with /s
  • if one cared about leading and trailing space it would be trivial for you to run strings.TrimSpace()
  • HTML minification processes are likely to nuke extraneous spaces anyway

Plus... I'm lazy :)

@crantok
Copy link
Author

crantok commented Aug 30, 2016

Awesome :)

@alltom
Copy link

alltom commented Apr 15, 2017

I wanted this for the same reason! Thanks!

For the purposes of indexing, it's a little unfortunate that AddSpaceWhenStrippingTag(true) also inserts spaces when it removes inline tags. So sanitizing <div>Go with<em>out</em></div><div>me</div> yields Go with out me instead of Go without me.

Not a blocker for me, but thought I'd point it out. :)

@dmitshur
Copy link
Contributor

dmitshur commented Apr 15, 2017

It's probably not possible to know what is an inline tag in a general case, unfortunately. Even <em> can be a block tag if CSS includes em { display: block; } or em { padding: 20px; }.

nixypanda added a commit to trilogy-group/kayako-bluemonday that referenced this issue Dec 19, 2017
* 'master' of github.com:microcosm-cc/bluemonday: (21 commits)
  Resolves microcosm-cc#51 Adjusted to be safe go pre-go1.8
  Resolves microcosm-cc#51 by permitting spaces in URLs within HTML
  Travis tests go1.1 to go1.9 and tip
  Rename LICENCE.md to LICENSE.md
  Add Go1.9 to Travis CI
  Remove .gitignore.
  Testing on go1.8 and go1.9rc2 tip
  Do not vendor dependencies
  Fixed build conditional for < go1.8
  Fixes 42 by using conditional compilation of tests
  Add center and marquee to whitelist of elements allowed without attributes
  Issue 37 case tag erroneously was 'javascript' not 'script'
  Added tip back to travis with allow failures
  tip is weird sometimes. It's not me, it's you.
  fmt -s, Makefile cleanup
  Resolves microcosm-cc#35
  Updated to reflect recent changes
  Add Go1.7 to testing
  Resolves microcosm-cc#33
  Added Gufran to credits
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants