Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting language page-wise #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

rachitavya
Copy link
Collaborator

@rachitavya rachitavya commented Sep 24, 2024

Concerning ticket: https://github.com/BharatSahAIyak/pdf-parser/issues/210

Summary by CodeRabbit

  • New Features

    • Enhanced language detection capabilities for PDFs, including detection for all pages and specific page detection using OCR.
    • Introduction of a streamlined language detection process consolidating multiple checks into a single function.
  • Bug Fixes

    • Improved handling of unknown languages and refined criteria for OCR processing.
  • Documentation

    • Updated code formatting for improved readability and consistency.

Copy link

coderabbitai bot commented Sep 24, 2024

Walkthrough

Walkthrough

The changes involve a restructuring of the language detection logic within the convert_single_pdf function, consolidating the process into a new language_detection function. The marker/ocr/langdetect.py file sees enhancements to language detection capabilities, including new functions for processing PDF pages. The run_ocr function in marker/ocr/recognition.py is updated to refine which pages require OCR processing based on language metadata. Overall, the modifications aim to streamline language detection and improve code readability.

Changes

File Change Summary
marker/convert.py Restructured language detection logic in convert_single_pdf, added languages_meta parameter, and replaced conditional checks with a call to language_detection.
marker/ocr/langdetect.py Enhanced language detection functions, added detect_language_ocr_page, modified detect_language_ocr to process all pages, and updated string formatting for consistency.
marker/ocr/recognition.py Updated run_ocr function to include languages_meta and refined criteria for determining pages requiring OCR processing.

Possibly related PRs

  • unknown lang bug fix #7: Enhancements to language handling in convert_single_pdf are relevant to the restructuring of language detection logic in the main PR.
  • Removing all invalid languages #8: Modifications to convert_single_pdf for improved language validation align with the focus on language detection improvements in the main PR.
  • Table Detection and Parser #11: Changes to convert_single_pdf related to table detection directly connect to the logic and parameters modified in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Outside diff range and nitpick comments (4)
marker/ocr/langdetect.py (4)

104-104: Remove unused variable e or use it in exception handling.

The variable e is assigned but never used in the except block. If you don't need the exception details, you can remove as e. If you do, consider using it in your error message.

Apply this diff to remove e if not needed:

-except Exception as e:
+except Exception:
    print("Error while detecting language.")
    language = "en"

Or use e in the error message:

-    print("Error while detecting language.")
+    print(f"Error while detecting language: {e}")
Tools
Ruff

104-104: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


152-152: Remove unused variable n_pages.

The variable n_pages is assigned but never used in the function detect_language_ocr_page. Unused variables can be removed to clean up the code.

Apply this diff to remove the unused variable:

-n_pages = pdf_document.page_count
Tools
Ruff

152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)


148-180: Add a docstring to detect_language_ocr_page function for clarity.

Adding a docstring will improve code readability by explaining the purpose, parameters, and return value of the function.

Here's an example docstring:

def detect_language_ocr_page(pdf_path, page_number):
    """
    Detect the language of a specific page in a PDF using OCR.

    Parameters:
        pdf_path (str): The path to the PDF file.
        page_number (int): The zero-based index of the page to process.

    Returns:
        str: The detected language code, or 'unknown' if detection fails.
    """
    # Function implementation...
Tools
Ruff

152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)


201-217: Add a docstring to language_detection function for better understanding.

Including a docstring helps other developers understand the function's purpose, inputs, and outputs.

Example docstring:

def language_detection(pages: List[Page], pdf_path, valid_langs):
    """
    Detect the languages of pages in a PDF, using preliminary text detection and OCR as a fallback.

    Parameters:
        pages (List[Page]): List of Page objects containing preliminary text.
        pdf_path (str): Path to the PDF file.
        valid_langs (List[str]): List of valid language codes.

    Returns:
        Dict[str, str]: A dictionary mapping page numbers (as strings) to detected language codes.
    """
    # Function implementation...
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5b52a59 and d72ae16.

Files selected for processing (3)
  • marker/convert.py (3 hunks)
  • marker/ocr/langdetect.py (5 hunks)
  • marker/ocr/recognition.py (1 hunks)
Additional context used
Ruff
marker/convert.py

32-32: marker.ocr.langdetect.get_text imported but unused

Remove unused import

(F401)


33-33: marker.ocr.langdetect.detect_language_text imported but unused

Remove unused import

(F401)


34-34: marker.ocr.langdetect.detect_language_ocr imported but unused

Remove unused import

(F401)


35-35: marker.ocr.langdetect.keep_most_frequent_element imported but unused

Remove unused import

(F401)

marker/ocr/langdetect.py

15-15: langdetect.detect_langs imported but unused

Remove unused import: langdetect.detect_langs

(F401)


17-17: marker.settings.settings imported but unused

Remove unused import: marker.settings.settings

(F401)


71-71: Do not use bare except

(E722)


104-104: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)

Additional comments not posted (2)
marker/ocr/recognition.py (1)

54-54: Verify Conditional Logic for OCR Page Selection

After correcting the definitions of ocr_langs and unknown_langs, please verify that the conditional statement accurately reflects the intended logic for determining which pages require OCR processing:

if (ocr_needed or pnum in ocr_langs) and pnum not in unknown_langs:

Ensure that this condition correctly identifies pages needing OCR, especially with the updated unknown_langs.

marker/ocr/langdetect.py (1)

99-102: ⚠️ Potential issue

Consider not defaulting to 'en' on exception in detect_language_page.

In the except block, the language is set to "en" when an error occurs. This might not reflect the actual language and could lead to incorrect processing. Consider setting the language to "unknown" instead.

Apply this diff to set the language to "unknown":

        print(llm_response)
        language = json.loads(llm_response)["language"]

    except Exception as e:
        print("Error while detecting language.")
-       language = "en"
+       language = "unknown"

Likely invalid or redundant comment.

Comment on lines +46 to +51
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical Error: Duplicate Definitions of ocr_langs and unknown_langs

The variables ocr_langs and unknown_langs are both assigned the same list comprehension, resulting in identical lists. Given their intended purposes, this seems unintended. Specifically, ocr_langs should contain page indices where the language is neither "en" nor "unknown", while unknown_langs should contain indices where the language is "unknown".

Apply this diff to correct the definition of unknown_langs:

 ocr_langs = [
     int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
 ]
 unknown_langs = [
-    int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+    int(k) for k, v in languages_meta.items() if v == "unknown"
 ]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
int(k) for k, v in languages_meta.items() if v == "unknown"
]

@@ -76,41 +84,10 @@ def convert_single_pdf(
}
)

valid_langs=["en","hi","or"]
valid_langs = ["en", "hi", "or"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making valid_langs configurable for flexibility.

Instead of hard-coding valid_langs = ["en", "hi", "or"], consider retrieving the valid languages from configuration settings or passing them as a parameter. This approach enhances flexibility and maintainability, allowing for easier updates and broader language support in the future.

Comment on lines +32 to +35
get_text,
detect_language_text,
detect_language_ocr,
keep_most_frequent_element,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports to clean up code.

The functions get_text, detect_language_text, detect_language_ocr, and keep_most_frequent_element are imported but not used in this file. Removing these unused imports will improve code readability and maintainability.

Apply this diff to remove the unused imports:

 from marker.ocr.langdetect import (
-    get_text,
-    detect_language_text,
-    detect_language_ocr,
-    keep_most_frequent_element,
     language_detection,
 )
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
get_text,
detect_language_text,
detect_language_ocr,
keep_most_frequent_element,
from marker.ocr.langdetect import (
language_detection,
)
Tools
Ruff

32-32: marker.ocr.langdetect.get_text imported but unused

Remove unused import

(F401)


33-33: marker.ocr.langdetect.detect_language_text imported but unused

Remove unused import

(F401)


34-34: marker.ocr.langdetect.detect_language_ocr imported but unused

Remove unused import

(F401)


35-35: marker.ocr.langdetect.keep_most_frequent_element imported but unused

Remove unused import

(F401)

Comment on lines +108 to +114
doc,
pages,
langs,
ocr_model,
OCR_ALL_PAGES,
languages_meta,
batch_multiplier=batch_multiplier,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Remove the languages_meta parameter from the run_ocr function call in marker/convert.py (lines 108-114) as it is not utilized within the function.

Analysis chain

Verify the necessity of passing both langs and languages_meta to run_ocr.

Both langs and languages_meta are being passed to the run_ocr function. Ensure that both parameters are required and used appropriately within the function. If they serve similar purposes, consider consolidating them to streamline the code and reduce potential confusion.

Run the following script to inspect the run_ocr function's parameters and usage:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Inspect the `run_ocr` function definition to verify parameter usage.

# Test: Search for the `run_ocr` function definition.
rg --type python -A 5 -B 2 '^def run_ocr\('

# Expected result: The function definition showing how `langs` and `languages_meta` are utilized.

Length of output: 76


Script:

#!/bin/bash
# Description: Inspect the `run_ocr` function definition without specifying file type to verify parameter usage.

# Search for the `run_ocr` function definition across all files.
rg '^def run_ocr\(' -A 5 -B 2

Length of output: 342

Comment on lines +159 to +174
with tempfile.NamedTemporaryFile(
suffix=".png", delete=False
) as temp_image_file:
image_path = temp_image_file.name
pix.save(image_path)

# language = detect_language_page(image_path)

# results.append(language)

text = extract_text_from_image(image_path)
result = detect_language_text(text)

# Clean up the temporary image file
os.remove(image_path)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure temporary files are deleted even if an exception occurs.

If an exception occurs after the temporary image file is created but before it's deleted, the temporary file may not be cleaned up, causing resource leakage. Use a context manager or a try...finally block to ensure the file is deleted.

Refactor the code to use a context manager:

with tempfile.NamedTemporaryFile(suffix=".png") as temp_image_file:
    image_path = temp_image_file.name
    pix.save(image_path)
    # Process the image
    text = extract_text_from_image(image_path)
    result = detect_language_text(text)
    # The temporary file is automatically deleted when the block exits

@@ -12,33 +12,35 @@
from pdf2image import convert_from_path
import pytesseract

from langdetect import detect, DetectorFactory
from langdetect import detect, DetectorFactory, detect_langs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import detect_langs.

The detect_langs function is imported but never used in the code. Removing unused imports helps keep the code clean and maintainable.

Apply this diff to remove the unused import:

-from langdetect import detect, DetectorFactory, detect_langs
+from langdetect import detect, DetectorFactory
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from langdetect import detect, DetectorFactory, detect_langs
from langdetect import detect, DetectorFactory
Tools
Ruff

15-15: langdetect.detect_langs imported but unused

Remove unused import: langdetect.detect_langs

(F401)

Comment on lines 71 to 74
except:
print("Error while detecting language")
language = ""

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid using bare except; specify the exception type.

Using a bare except clause can catch unintended exceptions and make debugging difficult. It's better to specify the exception you intend to catch or use except Exception as e.

Apply this diff to specify the exception and handle it appropriately:

-except:
+except Exception as e:
    print("Error while detecting language")
+   print(e)
    language = ""
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except:
print("Error while detecting language")
language = ""
except Exception as e:
print("Error while detecting language")
print(e)
language = ""
Tools
Ruff

71-71: Do not use bare except

(E722)

@@ -12,33 +12,35 @@
from pdf2image import convert_from_path
import pytesseract

from langdetect import detect, DetectorFactory
from langdetect import detect, DetectorFactory, detect_langs
from langdetect.lang_detect_exception import LangDetectException
from marker.settings import settings
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused import settings.

The settings module from marker.settings is imported but never used. Cleaning up unused imports improves code readability.

Apply this diff to remove the unused import:

-from marker.settings import settings
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from marker.settings import settings
Tools
Ruff

17-17: marker.settings.settings imported but unused

Remove unused import: marker.settings.settings

(F401)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant