getAggregateText function sticks together text between different nodes #29

ilya-khorev · 2015-03-24T15:09:25Z

Hi

Please, look at my test case
https://jsfiddle.net/weirrrdy/xLcfnrpk/

I'd like to match the exact word "Highlighted", that's why there is a special metacharacter "\b" wrapping the regex expression. Semantically it doesn't necessary to put '\b' in this case but I'm using it in my application and it's required for the test case.

The expected behaviour here would be selecting both 'highlighted' words, however only the second phrase has matched.

The problem is the getAggregateText function has combined text of first two paragraphs into the following text: "HighlightedPhrase". Therefore the current regexp expression has skipped the word "Highlighted".

I would really appreciate if you fix this.

padolsey · 2015-04-20T12:58:25Z

Sorry for the massive delay on this. Am investigating now.

ethantw · 2015-04-20T14:16:51Z

A paragraph can be styled presentationally as an inline element in CSS. Likewise, an emphasis text run can be styled as a block element too.

Two adjacent ps should be handled just like two adjacent ems by default. Maybe it's better to add an additional option (element/selector-wise) for such cases.

padolsey · 2015-04-20T14:26:36Z

This is a tough one...

I'm happy to make it configurable, but it seems like the default behaviour should be what people typically expect, and people probably don't typically expect <p> elements to be treated as inline. But I'm very wary of making this assumption because it could lead people to a lot of undesirable behaviour.

padolsey · 2015-04-20T14:28:26Z

So, yeh, defaulting to "everything is inline" makes sense. Just trying to figure out how to make this easily configurable and obvious.

padolsey · 2015-04-20T14:35:40Z

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

ethantw · 2015-04-20T16:32:43Z

The alternative is to assume everything is a block-level element by default, and then provide the opportunity to configure inline elements.

Do you mean with the boundary matcher \b only?

padolsey · 2015-04-20T16:47:14Z

Ok, so, this is what I've come up with: (Please let me know what you both think)

PROPOSAL

Everything stays the same as it currently is, by default. The findAndReplaceDOMText function will assume that it can bleed matches through ALL elements.

If someone wants to explicitly disallow this, e.g. in the case of block-level <p> elements, then they can set a custom boundary function that returns true in the case that either of its arguments are paragraphs (the boundary function will always be passed two elements so you can decide whether to create a boundary between them):

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: function(a, b) {
    return a.nodeName.toLowerCase() === 'p' || b.nodeName.toLowerCase() === 'p';
  },
  wrap: 'em'
});

You can also set boundary to true which will mean that NO matches bleed through elements, e.g. a match must exist in contiguous text nodes.

Additionally there will be an easy way to set ALL block-level borders as boundaries:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,
  wrap: 'em'
});

(I imagine this will be the most common usage)

ethantw · 2015-04-20T17:17:39Z

Would you care to use Element.prototype.matches method instead of the fairly long nodeName comparison?

https://github.com/ethantw/fibre.js/blob/6857deac21/dist/fibre.js#L18-28

padolsey · 2015-04-20T17:20:09Z

Oo, yep, didn't realize that had such good support. Still a shame about IE8.

ilya-khorev · 2015-04-20T20:00:22Z

Thanks for the answer.

The second option is the best one, at least to me. That's what I'm looking for, if I understood you correctly.
boundary: findAndReplaceDOMText.BLOCK_LEVEL_ELEMENTS,

their own matching contexts. E.g. useful with block-level elements like P and DIV. CC #29

padolsey · 2015-04-22T14:41:47Z

Some progress: d41ae22 introduces something similar to my proposal above. Instead of 'boundaries' I've chosen the term 'contexts' and specifically a new option: forceContext which accepts either a boolean or a function that returns a boolean. With this concept you can prescribe that any specific element creates a new matching context (so that no matches bleed through the element in either direction).

So, for the block-level use-case, you can do:

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.BLOCK_LEVEL_MATCH,
  wrap: 'em'
});

This means that any block-level elements will create new matching contexts. The findAndReplaceDOMText.BLOCK_LEVEL_MATCH function returns true if passed a traditionally block-level element.

You can set forceContext to your own function too, and even use Element.matches to figure out which elements to force a new context on.

If this looks good to you, I'll merge into master and push a new release. (Shall timeout in 48h and assume we're all happy to go ahead with it if I don't hear anything).

@weirdy: Here's your JSFiddle with the change added: https://jsfiddle.net/8dcqgcoa/1/

ilya-khorev · 2015-04-22T15:42:55Z

Awesome! Thank you, good work.

ethantw · 2015-04-22T15:43:29Z

Great! Thanks for the significant fix. I have two questions though.

Even without the \b matcher they'll still be considered as another context?
Also, I believe that br, details, dt, img, li, summary, tbody, td, th, thead, tr (, etc) are accidentally missed out?

The br and img (like hr) are necessary for cases like,

Highlight<img src="./path"><span>xxx</span>

https://developer.mozilla.org/en/docs/Web/HTML/Element

padolsey · 2015-04-22T15:50:14Z

Ah, good point; I'll review the list of block elements. Actually, I think what we want is not a list of block elements, per se, but instead a list of all elements that are not inline textual elements (including <img> and <br> and <hr> etc.). I'll work on this.

ethantw · 2015-04-22T15:56:22Z

Agree. 👍

padolsey · 2015-04-25T16:28:42Z

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

findAndReplaceDOMText(document.getElementById('test'), {
  find: RegExp('\\b' + 'highlighted' + '\\b', 'gi'),
  forceContext: findAndReplaceDOMText.NON_INLINE_PROSE,
  wrap: 'em'
});

This will force a context on all of the following elements:

NON_PROSE_ELEMENTS = {
    // Block Elements
    address:1, article:1, aside:1, blockquote:1, canvas:1, dd:1, div:1,
    dl:1, fieldset:1, figcaption:1, figure:1, footer:1, form:1, h1:1, h2:1, h3:1,
    h4:1, h5:1, h6:1, header:1, hgroup:1, hr:1, main:1, nav:1, noscript:1, ol:1,
    output:1, p:1, pre:1, section:1, ul:1,
    // Other misc. elements that are not part of continuous inline prose:
    br:1, li: 1, summary: 1, dt:1, details:1,
    // Media / Source elements:
    script:1, style:1, img:1, video:1, audio:1, canvas:1, svg:1, map:1, object:1,
    // Input elements
    input:1, textarea:1, select:1, option:1, optgroup: 1, button:1,
    // Table related elements:
    table:1, tbody:1, thead:1, th:1, tr:1, td:1, caption:1, col:1, tfoot:1, colgroup:1
};

It's probably missing a few (let me know if you notice any). Thanks!

PR happening in #30 (shall merge soon I hope). I also need to add a chunk to the readme to explain what on earth this is all about :P

ethantw · 2015-04-25T17:03:27Z

Until I find a better name for this it's gonna be called "NON_INLINE_PROSE":

Awesome, I think that's comprehensible enough. 👍

It's probably missing a few (let me know if you notice any).

I would like to suggest we add rp, rt and rtc, but exclude ruby and rb. The rt element is used to wrap the annotation for its corresponding inline text run, while rp is used to provide fallback. Their box display attribute would not be inline as well. For example,

He invented <ruby>WWW<rp>(<rt>world wide web<rp>)</ruby> that helps people connect to one another.

Well, maybe they belong to those who need ignoring instead?

As for ruby and rb, they are both inline and part of the main content. Would make more sense for them to remain outside of the list.

And what do you think about the elements designed for web component such as template?

padolsey · 2015-04-25T20:05:34Z

Cool, I've added rp, rt and rtc.

Right now there's no easy way to filter out non-prose elements completely. So, even though they can now have "forced contexts" via NON_INLINE_PROSE, things like <style> and <script> are, by default, still searched.

So... I think it'd be worth adding a "prose" preset, like:

findAndReplaceDOMText(..., {
  preset: 'prose'
});

And it could do the following:

Only searches textual elements (not STYLE, SCRIPT, OBJECT, etc.)
- I.e. they're completely avoided/skipped
Forces contexts using the NON_INLINE_PROSE set of elements

It would also avoid <template>.

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

ilya-khorev · 2015-04-26T09:28:18Z

Excellent!

I thought styles, scripts, etc are already skipped by your function. If not, it would be a good option.
It's worth making this as a default behavior.

ethantw · 2015-04-26T12:48:04Z

Seems like this is the most sensible/common usage of the lib, and it wouldn't be a breaking change since it's a brand new option. Thoughts?

Agree. 🌈

padolsey · 2015-04-28T11:07:16Z

Ok, all shipped (0.4.3). 🐑

padolsey self-assigned this Apr 20, 2015

padolsey added a commit that referenced this issue Apr 22, 2015

Implement forceContext option to force certain elements to have

d41ae22

their own matching contexts. E.g. useful with block-level elements like P and DIV. CC #29

padolsey mentioned this issue Apr 25, 2015

forceContext option #30

Merged

padolsey added a commit that referenced this issue Apr 27, 2015

Implement preset:prose as per discussion in #29

6a09a71

padolsey closed this as completed Apr 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getAggregateText function sticks together text between different nodes #29

getAggregateText function sticks together text between different nodes #29

ilya-khorev commented Mar 24, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

padolsey commented Apr 20, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

ilya-khorev commented Apr 20, 2015

padolsey commented Apr 22, 2015

ilya-khorev commented Apr 22, 2015

ethantw commented Apr 22, 2015

padolsey commented Apr 22, 2015

ethantw commented Apr 22, 2015

padolsey commented Apr 25, 2015

ethantw commented Apr 25, 2015

padolsey commented Apr 25, 2015

ilya-khorev commented Apr 26, 2015

ethantw commented Apr 26, 2015

padolsey commented Apr 28, 2015

getAggregateText function sticks together text between different nodes #29

getAggregateText function sticks together text between different nodes #29

Comments

ilya-khorev commented Mar 24, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

padolsey commented Apr 20, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

ethantw commented Apr 20, 2015

padolsey commented Apr 20, 2015

ilya-khorev commented Apr 20, 2015

padolsey commented Apr 22, 2015

ilya-khorev commented Apr 22, 2015

ethantw commented Apr 22, 2015

padolsey commented Apr 22, 2015

ethantw commented Apr 22, 2015

padolsey commented Apr 25, 2015

ethantw commented Apr 25, 2015

padolsey commented Apr 25, 2015

ilya-khorev commented Apr 26, 2015

ethantw commented Apr 26, 2015

padolsey commented Apr 28, 2015