-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading encoded emojis in HTML attributes #2074
Comments
The provided Parser.unescapeEntities util has similar issues
|
Thanks for the clear report, fixed! This was failing to decode and returning the replacement character due to an explicit range check it was hitting: (>= 0xD800 && <= 0xDFFF). That check has been there since 8749726 (12 years ago!) when I reimplemented to follow the HTML5 spec. But I can't see why that range was explicitly excluded - perhaps it was previously reserved in Unicode and then opened later. The single-point version ( Have fixed by removing that range as it is now valid and not excluded in the current spec. I note that Chrome seemingly treats this as we previously did; |
Fixing commit is: 954c46a |
This MR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [flow-bin](https://github.com/flowtype/flow-bin) ([changelog](https://github.com/facebook/flow/blob/master/Changelog.md)) | devDependencies | minor | [`^0.224.0` -> `^0.225.0`](https://renovatebot.com/diffs/npm/flow-bin/0.224.0/0.225.1) | | [org.liquibase.ext:liquibase-hibernate5](https://github.com/liquibase/liquibase-hibernate/wiki) ([source](https://github.com/liquibase/liquibase-hibernate)) | build | patch | `4.25.0` -> `4.25.1` | | [org.liquibase:liquibase-maven-plugin](http://www.liquibase.org/liquibase-maven-plugin) ([source](https://github.com/liquibase/liquibase)) | build | patch | `4.25.0` -> `4.25.1` | | [org.jsoup:jsoup](https://jsoup.org/) ([source](https://github.com/jhy/jsoup)) | compile | patch | `1.17.1` -> `1.17.2` | | [io.hypersistence:hypersistence-utils-hibernate-62](https://github.com/vladmihalcea/hypersistence-utils) | compile | minor | `3.6.1` -> `3.7.0` | | [io.quarkus:quarkus-maven-plugin](https://github.com/quarkusio/quarkus) | build | patch | `3.6.3` -> `3.6.4` | | [io.quarkus:quarkus-universe-bom](https://github.com/quarkusio/quarkus-platform) | import | patch | `3.6.3` -> `3.6.4` | | [org.apache.maven.plugins:maven-compiler-plugin](https://maven.apache.org/plugins/) | build | minor | `3.11.0` -> `3.12.1` | --- ### Release Notes <details> <summary>flowtype/flow-bin</summary> ### [`v0.225.1`](flow/flow-bin@62a43fb...23ec616) [Compare Source](flow/flow-bin@62a43fb...23ec616) ### [`v0.225.0`](flow/flow-bin@e6104a1...62a43fb) [Compare Source](flow/flow-bin@e6104a1...62a43fb) </details> <details> <summary>liquibase/liquibase</summary> ### [`v4.25.1`](https://github.com/liquibase/liquibase/blob/HEAD/changelog.txt#Liquibase-4251-is-a-patch-release) [Compare Source](liquibase/liquibase@v4.25.0...v4.25.1) </details> <details> <summary>jhy/jsoup</summary> ### [`v1.17.2`](https://github.com/jhy/jsoup/blob/HEAD/CHANGES.md#​1172-2023-Dec-29) ##### Improvements - **Attribute object accessors**: Added `Element.attribute(String)` and `Attributes.attribute(String)` to more simply obtain an `Attribute` object. [2069](jhy/jsoup#2069) - **Attribute source tracking**: If source tracking is on, and an Attribute's key is changed ( via `Attribute.setKey(String)`), the source range is now still tracked in `Attribute.sourceRange()`. [2070](jhy/jsoup#2070) - **Wildcard attribute selector**: Added support for the `[*]` element with any attribute selector. And also restored support for selecting by an empty attribute name prefix (`[^]`). [2079](jhy/jsoup#2079) ##### Bug Fixes - **Mixed-cased source position**: When tracking the source position of attributes, if the source attribute name was mix-cased but the parser was lower-case normalizing attribute names, the source position for that attribute was not tracked correctly. [2067](jhy/jsoup#2067) - **Source position NPE**: When tracking the source position of a body fragment parse, a null pointer exception was thrown. [2068](jhy/jsoup#2068) - **Multi-point emoji entity**: A multi-point encoded emoji entity may be incorrectly decoded to the replacement character. [2074](jhy/jsoup#2074) - **Selector sub-expressions**: (Regression) in a selector like `parent [attr=va], other`, the `, OR` was binding to `[attr=va]` instead of `parent [attr=va]`, causing incorrect selections. The fix includes a EvaluatorDebug class that generates a sexpr to represent the query, allowing simpler and more thorough query parse tests. [2073](jhy/jsoup#2073) - **XML CData output**: When generating XML-syntax output from parsed HTML, script nodes containing (pseudo) CData sections would have an extraneous CData section added, causing script execution errors. Now, the data content is emitted in a HTML/XML/XHTML polyglot format, if the data is not already within a CData section. [2078](jhy/jsoup#2078) - **Thread safety**: The `:has` evaluator held a non-thread-safe Iterator, and so if an Evaluator object was shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be incorrect. Now, the iterator object is a thread-local. [2088](jhy/jsoup#2088) *** Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in [change-archive.txt](./change-archive.txt). </details> <details> <summary>vladmihalcea/hypersistence-utils</summary> ### [`v3.7.0`](https://github.com/vladmihalcea/hypersistence-utils/blob/HEAD/changelog.txt#Version-370---December-18-2023) \================================================================================ Oracle Interval Type does not support negative intervals [#​682](vladmihalcea/hypersistence-utils#682) Return original object if target and original are the same when merging [#​677](vladmihalcea/hypersistence-utils#677) Add a hypersistence-utils-hibernate-63 module for Hibernate 6.3 [#​657](vladmihalcea/hypersistence-utils#657) </details> <details> <summary>quarkusio/quarkus</summary> ### [`v3.6.4`](https://github.com/quarkusio/quarkus/releases/tag/3.6.4) [Compare Source](quarkusio/quarkus@3.6.3...3.6.4) ##### Complete changelog - [#​37808](quarkusio/quarkus#37808) - CLI - Rework how missing commands are detected - [#​37803](quarkusio/quarkus#37803) - Dev mode: add null checks to TimestampSet.isRestartNeeded() - [#​37798](quarkusio/quarkus#37798) - Only update ~/.docker/config.json if it exists - [#​37787](quarkusio/quarkus#37787) - Take priority into account in ConfigurationImpl - [#​37775](quarkusio/quarkus#37775) - Docs: fix typo in rabbitmq reference documentation - [#​37770](quarkusio/quarkus#37770) - Add SequencedCollection to BANNED_INTERFACE_TYPES - [#​37768](quarkusio/quarkus#37768) - Running application build with JDK21 and target Java 17 crash with NoClassDefFoundError: java/util/SequencedCollection - [#​37731](quarkusio/quarkus#37731) - Query logging is being done in io.quarkus.mongodb.panache.common.runtime.MongoOperations - [#​37723](quarkusio/quarkus#37723) - Do not use CSRF cookie as the next token value - [#​37717](quarkusio/quarkus#37717) - Docs: Fix incorrect link reference in Cross-Site Request Forgery Prevention guide - [#​37714](quarkusio/quarkus#37714) - Remove the driver property in the documentation for Cloud SQL - [#​37710](quarkusio/quarkus#37710) - Use NoStackTraceException in metrics - [#​37677](quarkusio/quarkus#37677) - Bump io.quarkus:quarkus-platform-bom-maven-plugin from 0.0.100 to 0.0.101 - [#​37654](quarkusio/quarkus#37654) - Make sure dev mode is properly written in doc - [#​36848](quarkusio/quarkus#36848) - CSRF Token with HMAC signature gets double signed </details> <details> <summary>quarkusio/quarkus-platform</summary> ### [`v3.6.4`](quarkusio/quarkus-platform@3.6.3...3.6.4) [Compare Source](quarkusio/quarkus-platform@3.6.3...3.6.4) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever MR is behind base branch, or you tick the rebase/retry checkbox. 👻 **Immortal**: This MR will be recreated if closed unmerged. Get [config help](https://github.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC4yNC4wIiwidXBkYXRlZEluVmVyIjoiMzQuMjQuMCJ9-->
I'm using jsoup to parse images that link to emoji images. These images have the actual emoji characters encoded in the alt attributes:
When I parse the attribute using jsoup, the resulting string is garbage:
Is there a configuration setting that I'm missing which could help? Failing that, is there a way that I can get the raw value of the attribute so I can parse it myself? For instance, Apache commons text has a utility which can decode the value:
The text was updated successfully, but these errors were encountered: