PDF to JPEG Azure Function (Custom docker runtime)

This Azure function uses the pdf2image library to convert PDF files to JPEG format.

There are multiple ways this project can be used:

The master branch will convert every page to JPEG and concatenate them togetehr into a single .jblob file. The .jblob file can then be converted back into multiple JPEGs by parsing it (example parser).
The first_page_only branch will only convert the first page to JPEG and save it as a .jpg file.
The one_file_per_page branch will convert every page of the PDF input into its own JPEG file. This version is a bit special and doesn't use the built-in Azure Function set() method to create the BLOBs, as it doesn't seem to support a variable number of BLOB outputs yet (with Python at least). Thus, another library is used and the name pattern in the output blob binding is ignored. The resulting JPEG files will be stored at the root of the destination container with the pattern {filename}-{pageNumber}.png.

The function is triggered each time a file is dropped in the pdfs BLOB container and puts the resulting .jblob in the images BLOB container. (see Configuration).

Usage

To use this function, you must create an Azure Function App and select Docker Container in the "Publish" section.

Once the Function App is created, select "Container settings" is the app dashboard and enter one of these:

dockerutils/pypdftojpg:latest is the master branch, with the jblob file containing all the images concatenated.
dockerutils/pypdftojpg:first_page_only is the first_page_only branch.
dockerutils/pypdftojpg:one_file_per_page is the one_file_per_page branch.

See the section above for more details on these variants.

Configuration

This function uses two variables called InputStorage and OutputStorage. In the app dashboard, select "Configuration" and provide a value for both InputStorage and OutputStorage (The Storage Accounts access key). Both variable can designate the same Azure Data Storage

See the application settings documentation.

{
    "scriptFile": "__init__.py",
    "bindings": [
        {
            "name": "myBlob",
            "type": "blobTrigger",
            "direction": "in",
            "path": "pdfs/{name}",
            "connection": "InputStorage"
        },
        {
            "name": "myOutputBlob",
            "type": "blob",
            "path": "images/{name}.jblob",
            "connection": "OutputStorage",
            "direction": "out"
        }
    ]
}

Why Docker

The pdf2image library requires the poppler tool to be installed on the host. This tool is not included in the default Microsoft Python runtime, so this runtime needs to be extended.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
pdftojpeg		pdftojpeg
.dockerignore		.dockerignore
.funcignore		.funcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
host.json		host.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to JPEG Azure Function (Custom docker runtime)

Usage

Configuration

Why Docker

About

Languages

SoTrx/function-pdf-to-jpeg

Folders and files

Latest commit

History

Repository files navigation

PDF to JPEG Azure Function (Custom docker runtime)

Usage

Configuration

Why Docker

About

Topics

Resources

Stars

Watchers

Forks

Languages