Releases: aphp/edspdf
Releases · aphp/edspdf
v0.9.3
Changelog
- Support pydantic v2
Pull Requests
- Support pydantic v2 by @percevalw in #31
Full Changelog: v0.9.2...v0.9.3
v0.9.2
Changelog
Changed
- Default to fp16 when inferring with gpu
- Support
inputs
parameter inTrainablePipe.postprocess(...)
method (as in edsnlp) - We now check that the user isn't trying to write a single file in a split fashion (when
write_in_worker is True
ornum_rows_per_file is not None
) and raise an error if they do
Fixed
- Batches full of empty content boxes no longer crash the
huggingface-embedding
component - Ensure models are always loaded in non training mode
- Improved performance of
edspdf.data
methods over a filesystem (fs
parameter)
Pull Requests
- Fix empty batches & update data API by @percevalw in #28
- chore: bump version to 0.9.2 by @percevalw in #30
Full Changelog: v0.9.1...v0.9.2
v0.9.1
Changelog
Fixed
- It is now possible to recursively retrieve pdf files in a directory using
edspdf.data.read_files
What's Changed
- fix: allow recursive pdf file searching by @percevalw and @acalliger in #26
Full Changelog: v0.9.0...v0.9.1
v0.9.0
What's Changed ?
Added
- New unified
edspdf.data
api (pdf files, pandas, parquet) and LazyCollection object
to efficiently read / write data from / to different formats & sources. This API is
has been heavily inspired by theedsnlp.data
API. - New unified processing API to select the execution backend via
data.set_processing(...)
to replace the oldaccelerators
API (which is now deprecated, but still available). huggingface-embedding
now supports quantization and otherAutoModel.from_pretrained
kwargs- It is now possible to add convert a label to multiple labels in the
simple-aggregator
component :
# To build the "text" field, we will aggregate "title", "body" and "table" lines,
# and output "title" lines in a separate field as well.
label_map = {
"text" : [ "title", "body", "table" ],
"title": "title",
}
Fixed
huggingface-embedding
now resize bbox features for large PDFs, instead of making the model crashhuggingface-embedding
andsub-box-cnn-pooler
now handle empty PDFs correctly
Pull Requests
- API update (data & processing) by @percevalw in #25
Full Changelog: v0.8.1...v0.9.0
v0.8.1
Changelog
Fixed
- Fix typing to allow passing an accelerator dict to
Pipeline.pipe(...)
- Removed multiprocessing accelerator debug output
- Fixed absolute links in github-pages docs (e.g. image assets)
Changed
- Added auto-links to components in the docs (by comparing span contents with entry points)
Pull Requests
- v0.8.1 by @percevalw in #23
Full Changelog: v0.8.0...v0.8.1
v0.8.0
What's changed
Added
- Add multi-modal transformers (
huggingface-embedding
) with windowing options - Add
render_page
option topdfminer
extractor, for multi-modal PDF features - Add inference utilities (
accelerators
), with simple mono process support and multi gpu / cpu support - Packaging utils (
pipeline.package(...)
) to make a pip installable package from a pipeline
Changed
- Updated API to follow EDS-NLP's refactoring
- Updated
confit
to 0.4.2 (better errors) andfoldedtensor
to 0.3.0 (better multiprocess support) - Removed
pipeline.score
. You should usepipeline.pipe
, a custom scorer andpipeline.select_pipes
instead. - Better test coverage
- Use
hatch
instead ofsetuptools
to build the package / docs and run the tests
Fixed
- Fixed
attrs
dependency only being installed in dev mode
Pull Requests
- Huggingface multi-modal transformers by @percevalw in #15
- Dev install documentation and dependencies fix by @ian-fox in #16
- Huggingface by @percevalw in #17
- Accelerators by @percevalw in #19
- Scoring by @percevalw in #20
- Packaging utils by @percevalw in #18
- chore: bump version to 0.8.0 by @percevalw in #21
- feat: switch to hatch package manager by @percevalw in #22
New Contributors
Full Changelog: v0.7.0...v0.8.0
v0.7.0
What's changed
This public release comes with a major overhaul of the library since v0.5.3
Core features
- new pipeline system whose API is inspired by spaCy
- first-class support for pytorch
- hybrid model inference and training (rules + deep learning)
- moved from pandas DataFrame to attrs dataclasses (
PDFDoc
,Page
,Box
, ...) for representing PDF documents - new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...
Functional features
- new extractors: pymupdf and poppler (separate packages for licensing reasons)
- many deep learning layers (box-transformer, 2d attention with relative position information, ...)
- trainable deep learning classifier
- training recipes for deep learning models
Full Changelog: v0.5.3...v0.7.0
v0.5.3
What's Changed
Added
- Add label mapping parameter to aggregators (to merge different types of blocks such as
title
andbody
) - Improved line aggregation formula
Full Changelog: v0.5.2...v0.5.3
v0.5.2
What's Changed
- ci: remove unnecessary poppler dependency by @bdura in #7
- Fix aggregation for empty documents by @percevalw in #8
Full Changelog: v0.5.1...v0.5.2