Skip to content

A paperless-ngx postconsume script that automatically generates meaningful titles of ingested documents using openai or other llm providers such as ollama.

License

Notifications You must be signed in to change notification settings

sjafferali/paperless-titles-from-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

paperless-titles-from-ai

This is a project to generate meaningful titles for ingested paperless documents using AI. Sends the OCR text of the document to the OpenAI API and generates a title for the document. The title is then saved to the document's metadata.

Examples of Generated Titles

  • taxreturntranscript2022
  • aigclaims_check_20230313
  • prosharesultrabloombergtax2023

Setup

Clone Repository

git clone https://github.com/sjafferali/paperless-titles-from-ai.git

Create .env file

cp -av paperless-titles-from-ai/.env.example paperless-titles-from-ai/.env
# Update .env file with the correct values

Update docker-compose.yml file

Update docker compose file with the correct path to the project directory.

services:
  # ...
  paperless-webserver:
    # ...
    volumes:
      - /path/to/paperless-titles-from-ai:/usr/src/paperless/scripts
      - /path/to/paperless-titles-from-ai/init:/custom-cont-init.d:ro
  environment:
    # ...
    PAPERLESS_POST_CONSUME_SCRIPT: /usr/src/paperless/scripts/app/main.py

The init folder (used to ensure open package is installed) must be owned by root.

Back-filling Titles on Existing Documents

To back-fill titles on existing documents, run the helper cli from the project directory:

docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] [single|all]

Arguments

Option Required Default Description
--paperlessurl [URL] Yes https://paperless.local:8080 Sets the URL of the paperless API endpoint.
--paperlesskey [KEY] Yes Sets the API key to use when authenticating to paperless.
--openaimodel [MODEL] No gpt-4-turbo Sets the OpenAI model used to generate title. Full list of supported models available at models.
--openaibaseurl [API Endpoint] No Sets the OpenAI compatible endpoint to generate the title from.
--openaikey [KEY] Yes Sets the OpenAI key used to generate title.
--dry No False Enables dry run which only prints out the changes that would be made.
--loglevel [LEVEL] No INFO Loglevel sets the desired loglevel.

To run on all documents

docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] all [filter_args]

Arguments

Option Required Default Description
--exclude [ID] No Excludes the document ID specified from being updated. This argument may be specified multiple times.
--filterstr [FILTERSTRING] No Filters the documents to be updated based on the URL filter string.

To run on a single document

docker run --rm -v ./app:/app python:3 /app/scripts/backfill.sh [args] single (document_id)

Additional Notes

  • The default OpenAI model used for generation is gpt-4-turbo. For a slightly less accurate title generation, but drastically reduced cost, use a GPT 3.5 model.
  • The number of characters of the OCR text that is sent varies depending on the model being used. We try to send the maximum number the model supports to get the best generation we can.

Privacy Concerns

Although the OpenAI API privacy document states that data sent to the OpenAI API is not used for training, other OpenAI compatible API endpoints are also supported by this post-consume script, which allows you to use a locally hosted LLM to generate titles.

Contact, Support and Contributions

  • Create a GitHub issue for bug reports, feature requests, or questions.
  • Add a ⭐️ star on GitHub.
  • PRs are welcome

About

A paperless-ngx postconsume script that automatically generates meaningful titles of ingested documents using openai or other llm providers such as ollama.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published