- Raise compile target to Java 11.
- Prepare jwarcex for public release.
- Refactor text extraction interface.
- Raised log4j version to 2.16.0.
- Streaming mode works correctly again. (Most notable in combination with piping on the command line.)
- Jwarcex now requires Java 8.
- Improved whitespace handling so that the same text is extracted from equivalent html, regardless of the html code's indentation. As a byproduct this reduces the amount of spaces caused by table elements.
- The different threads use a better waiting mechanism. This makes it slightly faster while using less CPU. (Use wait instead of sleep.)
- The text extraction implementation was changed to work iteratively instead of recursively.
- Prevent PeekingMetaTagEncodingDetector from returning unavailable charsets.
- Title tag is now handled correctly (and not extracted by default).
- Updated jsoup version to 1.11.3.
- Separate consecutive links by a space.
- The detected encoding will be written to the source file headers again.
- The encoding error detection mechanism now correctly checks against the utf-8 character (instead of utf-16).
- The description of command line parameter -e was corrected.
- Fixed html in javadocs of TextExtractorImpl.
- Added auto detection of encoding errors with a configurable threshold (-e or --max_encoding_errors. If a document contains more than 3 replacement chars, it will be dropped.
- Added parameter for minimum document length (-n or --document_length). See the program help (-h) for more details.
- Ignore entries ending with robots.txt.
- Make the minimum line length parameter accessible via the command line (-m or --line_length).
- Improved handling of tables, which results in more readable output of table cell text.
- Fixed the usage of the maven-shade-plugin. The resulting jwarcex-standalone*.jar can now be again used with the
java -jar jwarcex-standalone*.jar
syntax.