Dossier is a library for extracting textual information from PDF documents. It is written using the Go programming language.
Currently PDF is the only supported format (using MuPDF). Other formats can be implemented using custom parsers or by amending the library.
Sketches provide a declarative approach to locating information as an alternative to imperative/procedural access.
Protocol buffers are used to define a sketch. The sketch protobuf definition documents available configuration options. Usually textproto will be the format used for writing sketches.
A web-based viewer is included in the command line utility. Screenshot of the viewer with an example sketch for invoices:
Invocation:
$ dossiercli web ./invoice.pdf ./sketch.textproto
2023/12/31 00:00:00 HTTP server listening on http://[::1]:8080
go get github.com/hansmi/dossier
Command line utility:
go install github.com/hansmi/dossier/cmd/dossiercli@latest