Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text block in PDF-to-text conversion being broken up into multiple lines #6504

Closed
idea-launch-lab opened this issue Oct 5, 2015 · 1 comment

Comments

@idea-launch-lab
Copy link

Hi,
I am trying to recover text from reference section from a pdf article, but the text block sometimes break up into multiple lines. A somewhat related issue might be #4629

Incorrect text splitting:
pdfjs_text_splitting_issue

Correct text splitting:
pdfjs_correct_text_splitting

I would like to recover text without extra line delimiters. Below is code I use to get text from pdf:

// PDFJS.version = '1.0.85';
// PDFJS.build = '094d0e2';

context.getPDFText = function pico_getPDFText() {

    // Variable to hold PDF text data
    var pdfTextContent = '';

    // pdfObj: global reference to PDF file object

    // Get each page and extract text content of the page
    for (var p = 1; p <= context.pdfObj.numPages; p++) {

        // Asynchronous processing
        context.pdfObj.getPage(p).then(function(res) {

            var content = res.getTextContent().then(function(textContent) {
                var promise = new Promise(function(resolve, reject) {

                    for (var i = 0; i < textContent.items.length; i++) {
                        var line = textContent.items[i].str.trim().toLowerCase();

                        if ((line.indexOf('reference') > -1 || line.indexOf('bibliograph') > -1) &&
                            (line === 'references' || line === 'bibliography')) {
                            // print references / bibliographic citations
                            console.log(line);
                        }
                    }

                });

            });
        });
    }
}

Thank you.
Sid

@Snuffleupagus
Copy link
Collaborator

Closing as incomplete, please provide a link to the PDF file (or attach it to the issue, since GitHub now supports that) in order to re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants