Optimal performance with scribe.js (workers/threads) #14

ErwinAI · 2024-11-04T02:09:54Z

ErwinAI
Nov 4, 2024

Hello there!

I've come across scribe.js while working with tesseract.js. My current challenge is converting about 15.000 PDFs with ~25 pages on average, in Dutch ('nld') to text.

My tesseract.js implementation uses the scheduler and workers that the package provides. What I'm wondering is how to use scribe in the same, efficient way. Right now, it seems to be a bit faster than tesseract in some instances.

Sample of processing times:

File1.pdf: 22 pages, 223kb, 344.0 seconds
File2.pdf: 11 pages, 126kb, 158.7 seconds
File3.pdf: 33 pages, 262kb, 514.3 seconds
File4.pdf: 27 pages, 211kb, 416.1 seconds
File5.pdf: 46 pages, 526kb, 665.3 seconds
File6.pdf: 13 pages, 140kb, 187.0 seconds
This comes down to ~15 seconds per page.

The server that I am running this on has 256GB RAM and 24 cpu core.
What I did find out so far, is when running scribe with a simple, one-by-one implementation, it utilizes about 50% of one CPU core (~2% of total available cpu resources) and about 1GB of RAM.

Implementation:

async function processFile(pdfPath) {
    try {
        const startTime = Date.now();
        logger.info(`Starting OCR for ${path.basename(pdfPath)}...`);
        
        const result = await scribe.extractText([pdfPath], ['nld'], 'txt', {
            skipRecPDFTextNative: false,
            skipRecPDFTextOCR: false
        });
        
        const duration = (Date.now() - startTime) / 1000;
        logger.info(`Completed OCR for ${path.basename(pdfPath)} in ${duration.toFixed(1)}s`);
        
        return result;
    } catch (error) {
        logger.error(`Error processing ${pdfPath}:`, error);
        throw error;
    }
}

Concisely, I'm wondering if:

scribe uses all resources or % resources available on the machine
if not, how resource allocation works
if there is anything that can speed up scribe more, utilizing the available resources?

I've tried using worker_threads to run multiple jobs at once, but it seems scribe uses node-canvas and my attempt stranded there.

Error processing /opt/project/files/file.pdf: Error: node-canvas is not currently supported on worker threads.
    at initCanvasNode (file:///opt/project/node_modules/scribe.js-ocr/js/worker/compareOCRModule.js:98:30)
    at async runFontOptimization (file:///opt/project/node_modules/scribe.js-ocr/js/fontEval.js:170:7)
    at async recognize (file:///opt/project/node_modules/scribe.js-ocr/js/recognizeConvert.js:595:5)
    at async Object.extractText (file:///opt/project/node_modules/scribe.js-ocr/scribe.js:99:35)
    at async processFile (file:///opt/project/workers/scribe_worker.js:29:24)
    at async MessagePort.<anonymous> (file:///opt/project/workers/scribe_worker.js:47:24)

Any response or info would be appreciated 🙏

@Balearica Also, do you have sponsor options?

Balearica · 2024-11-04T04:19:46Z

Balearica
Nov 4, 2024
Maintainer

While the browser version is configured to be highly parallelized and run everything using workers, the Node.js version essentially only uses one core. In theory, the Node.js version should be able to work similar to the browser version using worker_threads, however, as you have already discovered yourself, the node-canvas dependency still does not support threads. This is the main issue--if node-canvas was updated then every computationally expensive operation could run in a separate thread. Automattic/node-canvas#1394

On a technical level, it is still possible to run most parts of the code in parallel on Node.js, which should speed things up considerably. No option for this yet exists, however I can add this easily.

I've come across scribe.js while working with tesseract.js. My current challenge is converting about 15.000 PDFs with ~25 pages on average, in Dutch ('nld') to text.

If this is an "offline" task (i.e. you're not attaching scribe.js to a web server) the most efficient way to handle would be to parallelize on the "coarse grained" level and run many processes at the same time using a utility such as GNU Parallel.

Also, do you have sponsor options?

I don't have any sponsorship page set up for Scribe.js; we made one for Tesseract.js but it was not very popular. If you're interested in sponsoring the project in general or a particular development goal we can open a page and/or discuss any specific goals.

0 replies

ErwinAI · 2024-11-04T05:07:05Z

ErwinAI
Nov 4, 2024
Author

Thank you for the fast response @Balearica!

That makes a lot of sense. I considered trying out child_processes and I might at a later stage, but for now going down the coarse grained road is a better call.

As for the sponsorship, I was curious because scribe.js might potentially become an important dependency, and it's in my interest to have this package/project developed further. If there isn't anything set up yet, just let me know if there is at a later stage and I'll chime in.

0 replies

ErwinAI · 2024-11-04T07:51:02Z

ErwinAI
Nov 4, 2024
Author

@Balearica Small update, reached the following benchmark with 15 parallel processes (small extract of entire log with variety of page count):

File1.pdf: 13 pages, 82.1 seconds
File2.pdf: 7 pages, 41.6 seconds
File3.pdf: 5 pages, 30.1 seconds
File4.pdf: 9 pages, 51.9 seconds
File5.pdf: 11 pages, 68.0 seconds
File6.pdf: 11 pages, 66.5 seconds
File7.pdf: 23 pages, 141.7 seconds
File8.pdf: 41 pages, 259.0 seconds
File9.pdf: 40 pages, 239.9 seconds
File10.pdf: 44 pages, 253.1 seconds

Averages at ~6 seconds per page.

Very impressed with this approach. Thank you for pointing me in the right direction!

Below is the code, if anyone needs some pointers or help and comes across this in the future.
Be aware that my PDF files were sorted with a folder for each year the PDF was made in (might explain some more specific parts of code):

Shell script

#!/bin/bash

# Configuration
MAX_PARALLEL_JOBS=15
SCRIPT_PATH="src/document_processor.js"
PDF_DIR="data/documents"
OCR_DATA_DIR="/usr/share/tessdata"

check_requirements() {
    if ! command -v tesseract > /dev/null; then
        echo "Error: Tesseract is not installed"
        echo "Install with: apt-get install tesseract-ocr"
        exit 1
    fi

    if [ ! -f "$OCR_DATA_DIR/eng.traineddata" ]; then
        echo "Error: Language data not found at $OCR_DATA_DIR/eng.traineddata"
        exit 1
    fi

    if ! command -v parallel > /dev/null; then
        echo "Error: GNU Parallel is not installed"
        echo "Install with: apt-get install parallel"
        exit 1
    fi

    echo "Checking Node.js dependencies..."
    if ! node -e "require('canvas')" 2>/dev/null; then
        echo "Rebuilding canvas module for current Node.js version..."
        npm rebuild canvas
        if [ $? -ne 0 ]; then
            echo "Error: Failed to rebuild canvas module"
            exit 1
        fi
    fi
}

process_documents() {
    if [ ! -d "$PDF_DIR" ]; then
        echo "Error: Documents directory not found: $PDF_DIR"
        exit 1
    fi

    echo "Finding PDF files..."
    find "$PDF_DIR" -name "*.pdf" > /tmp/pdf_files.txt
    
    total_files=$(wc -l < /tmp/pdf_files.txt)
    echo "Found $total_files PDF files to process"
    
    if [ $total_files -eq 0 ]; then
        echo "No PDF files found in $PDF_DIR"
        rm /tmp/pdf_files.txt
        exit 1
    fi
    
    # Process files in parallel with error handling
    cat /tmp/pdf_files.txt | parallel -j $MAX_PARALLEL_JOBS --bar --halt now,fail=1 \
        "node $SCRIPT_PATH '{}' 2>&1; 
        exit_code=\$?;
        if [ \$exit_code -eq 1 ]; then
            echo 'Failed processing {}' >&2;
            exit 1;
        elif [ \$exit_code -eq 2 ]; then
            echo 'Skipped {} (already processed)';
            exit 0;
        elif [ \$exit_code -eq 0 ]; then
            echo 'Successfully processed {}';
            exit 0;
        else
            echo 'Unknown exit code \$exit_code for {}' >&2;
            exit 1;
        fi"
    
    process_status=$?
    rm /tmp/pdf_files.txt
    
    if [ $process_status -ne 0 ]; then
        echo "Error: One or more files failed to process"
        exit 1
    fi
    
    echo "Processing completed successfully"
}

# Display help if requested
if [ "$1" == "-h" ] || [ "$1" == "--help" ]; then
    echo "Usage: $0 [--help]"
    echo "Process all PDF files in the documents directory using parallel OCR processing"
    echo ""
    echo "Options:"
    echo "  -h, --help    Show this help message"
    exit 0
fi

# Main execution
echo "Starting document processing..."
check_requirements
process_documents

src/document_processor.js

/*
This script processes PDF documents using Scribe OCR.
It reads PDFs and extracts their text content.
*/

import fs from 'fs/promises';
import path from 'path';
import { fileURLToPath } from 'url';
import scribe from 'scribe.js-ocr';

const CONFIG = {
    ocr: {
        language: 'eng',  // Change as needed
    }
};

async function processFile(pdfPath) {
    try {
        const startTime = Date.now();
        console.log(`Processing: ${path.basename(pdfPath)}...`);
        
        const result = await scribe.extractText([pdfPath], [CONFIG.ocr.language], 'txt', {
            skipRecPDFTextNative: false,
            skipRecPDFTextOCR: false
        });
        
        const duration = (Date.now() - startTime) / 1000;
        console.log(`OCR completed in ${duration.toFixed(1)}s`);
        
        return result;
    } catch (error) {
        console.error(`Error processing ${pdfPath}:`, error);
        throw error;
    }
}

async function processDocument(options = {}) {
    try {
        if (!options.file) {
            throw new Error('No file specified');
        }

        const documentId = path.basename(options.file, '.pdf');
        
        // Initialize OCR
        try {
            await scribe.init({ pdf: true });
        } catch (error) {
            console.error('Failed to initialize Scribe:', error);
            throw error;
        }

        // Process the file
        const startTime = Date.now();
        const result = await processFile(options.file);
        const processingTime = (Date.now() - startTime) / 1000;

        // Here you would save the result to your preferred storage
        // For example:
        // await saveToDatabase(result);
        // or
        // await fs.writeFile(`output/${documentId}.txt`, result);
        
        console.log(`OK: ${documentId} [${processingTime.toFixed(1)}s]`);
        return 'processed';
    } catch (error) {
        console.error(`ERROR: ${error.message}`);
        throw error;
    }
}

// Main execution
if (process.argv[1] === fileURLToPath(import.meta.url)) {
    processDocument({
        file: process.argv[2]
    }).then(result => {
        // Exit code 0: Successfully processed
        // Exit code 1: Error
        process.exit(0);
    }).catch(error => {
        console.error(`Unhandled error: ${error.message}`);
        process.exit(1);
    });
}

// Basic error handlers
process.on('uncaughtException', (error) => {
    console.error('Uncaught Exception:', error);
    process.exit(1);
});

process.on('unhandledRejection', (reason, promise) => {
    console.error('Unhandled Rejection at:', promise, 'reason:', reason);
    process.exit(1);
});

export { processDocument };

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal performance with scribe.js (workers/threads) #14

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Optimal performance with scribe.js (workers/threads) #14

ErwinAI Nov 4, 2024

Replies: 3 comments

Balearica Nov 4, 2024 Maintainer

ErwinAI Nov 4, 2024 Author

ErwinAI Nov 4, 2024 Author

ErwinAI
Nov 4, 2024

Balearica
Nov 4, 2024
Maintainer

ErwinAI
Nov 4, 2024
Author

ErwinAI
Nov 4, 2024
Author