Replies: 3 comments
-
While the browser version is configured to be highly parallelized and run everything using workers, the Node.js version essentially only uses one core. In theory, the Node.js version should be able to work similar to the browser version using On a technical level, it is still possible to run most parts of the code in parallel on Node.js, which should speed things up considerably. No option for this yet exists, however I can add this easily.
If this is an "offline" task (i.e. you're not attaching scribe.js to a web server) the most efficient way to handle would be to parallelize on the "coarse grained" level and run many processes at the same time using a utility such as GNU Parallel.
I don't have any sponsorship page set up for Scribe.js; we made one for Tesseract.js but it was not very popular. If you're interested in sponsoring the project in general or a particular development goal we can open a page and/or discuss any specific goals. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the fast response @Balearica! That makes a lot of sense. I considered trying out As for the sponsorship, I was curious because scribe.js might potentially become an important dependency, and it's in my interest to have this package/project developed further. If there isn't anything set up yet, just let me know if there is at a later stage and I'll chime in. |
Beta Was this translation helpful? Give feedback.
-
@Balearica Small update, reached the following benchmark with 15 parallel processes (small extract of entire log with variety of page count):
Averages at ~6 seconds per page. Very impressed with this approach. Thank you for pointing me in the right direction! Below is the code, if anyone needs some pointers or help and comes across this in the future. Shell script #!/bin/bash
# Configuration
MAX_PARALLEL_JOBS=15
SCRIPT_PATH="src/document_processor.js"
PDF_DIR="data/documents"
OCR_DATA_DIR="/usr/share/tessdata"
check_requirements() {
if ! command -v tesseract > /dev/null; then
echo "Error: Tesseract is not installed"
echo "Install with: apt-get install tesseract-ocr"
exit 1
fi
if [ ! -f "$OCR_DATA_DIR/eng.traineddata" ]; then
echo "Error: Language data not found at $OCR_DATA_DIR/eng.traineddata"
exit 1
fi
if ! command -v parallel > /dev/null; then
echo "Error: GNU Parallel is not installed"
echo "Install with: apt-get install parallel"
exit 1
fi
echo "Checking Node.js dependencies..."
if ! node -e "require('canvas')" 2>/dev/null; then
echo "Rebuilding canvas module for current Node.js version..."
npm rebuild canvas
if [ $? -ne 0 ]; then
echo "Error: Failed to rebuild canvas module"
exit 1
fi
fi
}
process_documents() {
if [ ! -d "$PDF_DIR" ]; then
echo "Error: Documents directory not found: $PDF_DIR"
exit 1
fi
echo "Finding PDF files..."
find "$PDF_DIR" -name "*.pdf" > /tmp/pdf_files.txt
total_files=$(wc -l < /tmp/pdf_files.txt)
echo "Found $total_files PDF files to process"
if [ $total_files -eq 0 ]; then
echo "No PDF files found in $PDF_DIR"
rm /tmp/pdf_files.txt
exit 1
fi
# Process files in parallel with error handling
cat /tmp/pdf_files.txt | parallel -j $MAX_PARALLEL_JOBS --bar --halt now,fail=1 \
"node $SCRIPT_PATH '{}' 2>&1;
exit_code=\$?;
if [ \$exit_code -eq 1 ]; then
echo 'Failed processing {}' >&2;
exit 1;
elif [ \$exit_code -eq 2 ]; then
echo 'Skipped {} (already processed)';
exit 0;
elif [ \$exit_code -eq 0 ]; then
echo 'Successfully processed {}';
exit 0;
else
echo 'Unknown exit code \$exit_code for {}' >&2;
exit 1;
fi"
process_status=$?
rm /tmp/pdf_files.txt
if [ $process_status -ne 0 ]; then
echo "Error: One or more files failed to process"
exit 1
fi
echo "Processing completed successfully"
}
# Display help if requested
if [ "$1" == "-h" ] || [ "$1" == "--help" ]; then
echo "Usage: $0 [--help]"
echo "Process all PDF files in the documents directory using parallel OCR processing"
echo ""
echo "Options:"
echo " -h, --help Show this help message"
exit 0
fi
# Main execution
echo "Starting document processing..."
check_requirements
process_documents
src/document_processor.js /*
This script processes PDF documents using Scribe OCR.
It reads PDFs and extracts their text content.
*/
import fs from 'fs/promises';
import path from 'path';
import { fileURLToPath } from 'url';
import scribe from 'scribe.js-ocr';
const CONFIG = {
ocr: {
language: 'eng', // Change as needed
}
};
async function processFile(pdfPath) {
try {
const startTime = Date.now();
console.log(`Processing: ${path.basename(pdfPath)}...`);
const result = await scribe.extractText([pdfPath], [CONFIG.ocr.language], 'txt', {
skipRecPDFTextNative: false,
skipRecPDFTextOCR: false
});
const duration = (Date.now() - startTime) / 1000;
console.log(`OCR completed in ${duration.toFixed(1)}s`);
return result;
} catch (error) {
console.error(`Error processing ${pdfPath}:`, error);
throw error;
}
}
async function processDocument(options = {}) {
try {
if (!options.file) {
throw new Error('No file specified');
}
const documentId = path.basename(options.file, '.pdf');
// Initialize OCR
try {
await scribe.init({ pdf: true });
} catch (error) {
console.error('Failed to initialize Scribe:', error);
throw error;
}
// Process the file
const startTime = Date.now();
const result = await processFile(options.file);
const processingTime = (Date.now() - startTime) / 1000;
// Here you would save the result to your preferred storage
// For example:
// await saveToDatabase(result);
// or
// await fs.writeFile(`output/${documentId}.txt`, result);
console.log(`OK: ${documentId} [${processingTime.toFixed(1)}s]`);
return 'processed';
} catch (error) {
console.error(`ERROR: ${error.message}`);
throw error;
}
}
// Main execution
if (process.argv[1] === fileURLToPath(import.meta.url)) {
processDocument({
file: process.argv[2]
}).then(result => {
// Exit code 0: Successfully processed
// Exit code 1: Error
process.exit(0);
}).catch(error => {
console.error(`Unhandled error: ${error.message}`);
process.exit(1);
});
}
// Basic error handlers
process.on('uncaughtException', (error) => {
console.error('Uncaught Exception:', error);
process.exit(1);
});
process.on('unhandledRejection', (reason, promise) => {
console.error('Unhandled Rejection at:', promise, 'reason:', reason);
process.exit(1);
});
export { processDocument }; |
Beta Was this translation helpful? Give feedback.
-
Hello there!
I've come across scribe.js while working with tesseract.js. My current challenge is converting about 15.000 PDFs with ~25 pages on average, in Dutch ('nld') to text.
My tesseract.js implementation uses the scheduler and workers that the package provides. What I'm wondering is how to use scribe in the same, efficient way. Right now, it seems to be a bit faster than tesseract in some instances.
Sample of processing times:
This comes down to ~15 seconds per page.
The server that I am running this on has 256GB RAM and 24 cpu core.
What I did find out so far, is when running scribe with a simple, one-by-one implementation, it utilizes about 50% of one CPU core (~2% of total available cpu resources) and about 1GB of RAM.
Implementation:
Concisely, I'm wondering if:
I've tried using worker_threads to run multiple jobs at once, but it seems scribe uses node-canvas and my attempt stranded there.
Any response or info would be appreciated 🙏
@Balearica Also, do you have sponsor options?
Beta Was this translation helpful? Give feedback.
All reactions