Skip to content

Latest commit

 

History

History
69 lines (41 loc) · 2.17 KB

CHANGELOG.md

File metadata and controls

69 lines (41 loc) · 2.17 KB

Changelog

[next] - unreleased

  • Raise compile target to Java 11.
  • Prepare jwarcex for public release.

[3.0.1] - 16.12.2021

  • Refactor text extraction interface.
  • Raised log4j version to 2.16.0.

[2.0.0] - 16.07.2019

Fixed

  • Streaming mode works correctly again. (Most notable in combination with piping on the command line.)

Changed

  • Jwarcex now requires Java 8.
  • Improved whitespace handling so that the same text is extracted from equivalent html, regardless of the html code's indentation. As a byproduct this reduces the amount of spaces caused by table elements.
  • The different threads use a better waiting mechanism. This makes it slightly faster while using less CPU. (Use wait instead of sleep.)
  • The text extraction implementation was changed to work iteratively instead of recursively.

[1.3.1] - 02.04.2019

Fixed

  • Prevent PeekingMetaTagEncodingDetector from returning unavailable charsets.
  • Title tag is now handled correctly (and not extracted by default).

Changed

  • Updated jsoup version to 1.11.3.

[1.3.0] - 26.11.2018

Added

  • Separate consecutive links by a space.

[1.2.0] - 25.10.2018

Added

  • The detected encoding will be written to the source file headers again.

Fixed

  • The encoding error detection mechanism now correctly checks against the utf-8 character (instead of utf-16).
  • The description of command line parameter -e was corrected.
  • Fixed html in javadocs of TextExtractorImpl.

[1.1.0] - 23.10.2017

Added

  • Added auto detection of encoding errors with a configurable threshold (-e or --max_encoding_errors. If a document contains more than 3 replacement chars, it will be dropped.
  • Added parameter for minimum document length (-n or --document_length). See the program help (-h) for more details.
  • Ignore entries ending with robots.txt.

Changed

  • Make the minimum line length parameter accessible via the command line (-m or --line_length).
  • Improved handling of tables, which results in more readable output of table cell text.

Fixed

  • Fixed the usage of the maven-shade-plugin. The resulting jwarcex-standalone*.jar can now be again used with the java -jar jwarcex-standalone*.jar syntax.