Detecting language page-wise #18

rachitavya · 2024-09-24T11:15:24Z

Concerning ticket: https://github.com/BharatSahAIyak/pdf-parser/issues/210

Summary by CodeRabbit

New Features
- Enhanced language detection capabilities for PDFs, including detection for all pages and specific page detection using OCR.
- Introduction of a streamlined language detection process consolidating multiple checks into a single function.
Bug Fixes
- Improved handling of unknown languages and refined criteria for OCR processing.
Documentation
- Updated code formatting for improved readability and consistency.

coderabbitai · 2024-09-24T11:15:31Z

Walkthrough

The changes involve a restructuring of the language detection logic within the convert_single_pdf function, consolidating the process into a new language_detection function. The marker/ocr/langdetect.py file sees enhancements to language detection capabilities, including new functions for processing PDF pages. The run_ocr function in marker/ocr/recognition.py is updated to refine which pages require OCR processing based on language metadata. Overall, the modifications aim to streamline language detection and improve code readability.

Changes

File	Change Summary
marker/convert.py	Restructured language detection logic in `convert_single_pdf`, added `languages_meta` parameter, and replaced conditional checks with a call to `language_detection`.
marker/ocr/langdetect.py	Enhanced language detection functions, added `detect_language_ocr_page`, modified `detect_language_ocr` to process all pages, and updated string formatting for consistency.
marker/ocr/recognition.py	Updated `run_ocr` function to include `languages_meta` and refined criteria for determining pages requiring OCR processing.

Possibly related PRs

unknown lang bug fix #7: Enhancements to language handling in convert_single_pdf are relevant to the restructuring of language detection logic in the main PR.
Removing all invalid languages #8: Modifications to convert_single_pdf for improved language validation align with the focus on language detection improvements in the main PR.
Table Detection and Parser #11: Changes to convert_single_pdf related to table detection directly connect to the logic and parameters modified in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

Outside diff range and nitpick comments (4)

marker/ocr/langdetect.py (4)
104-104: Remove unused variable e or use it in exception handling.

The variable e is assigned but never used in the except block. If you don't need the exception details, you can remove as e. If you do, consider using it in your error message.

Apply this diff to remove e if not needed:
-except Exception as e:
+except Exception:
    print("Error while detecting language.")
    language = "en"
Or use e in the error message:
-    print("Error while detecting language.")
+    print(f"Error while detecting language: {e}")
Tools

Ruff

104-104: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

152-152: Remove unused variable n_pages.

The variable n_pages is assigned but never used in the function detect_language_ocr_page. Unused variables can be removed to clean up the code.

Apply this diff to remove the unused variable:
-n_pages = pdf_document.page_count
Tools

Ruff

152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)

148-180: Add a docstring to detect_language_ocr_page function for clarity.

Adding a docstring will improve code readability by explaining the purpose, parameters, and return value of the function.

Here's an example docstring:
def detect_language_ocr_page(pdf_path, page_number):
    """
    Detect the language of a specific page in a PDF using OCR.

    Parameters:
        pdf_path (str): The path to the PDF file.
        page_number (int): The zero-based index of the page to process.

    Returns:
        str: The detected language code, or 'unknown' if detection fails.
    """
    # Function implementation...
Tools

Ruff

152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)

201-217: Add a docstring to language_detection function for better understanding.

Including a docstring helps other developers understand the function's purpose, inputs, and outputs.

Example docstring:
def language_detection(pages: List[Page], pdf_path, valid_langs):
    """
    Detect the languages of pages in a PDF, using preliminary text detection and OCR as a fallback.

    Parameters:
        pages (List[Page]): List of Page objects containing preliminary text.
        pdf_path (str): Path to the PDF file.
        valid_langs (List[str]): List of valid language codes.

    Returns:
        Dict[str, str]: A dictionary mapping page numbers (as strings) to detected language codes.
    """
    # Function implementation...

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5b52a59 and d72ae16.

Files selected for processing (3)

marker/convert.py (3 hunks)
marker/ocr/langdetect.py (5 hunks)
marker/ocr/recognition.py (1 hunks)

Additional context used

Ruff

marker/convert.py

32-32: marker.ocr.langdetect.get_text imported but unused

Remove unused import

(F401)

33-33: marker.ocr.langdetect.detect_language_text imported but unused

Remove unused import

(F401)

34-34: marker.ocr.langdetect.detect_language_ocr imported but unused

Remove unused import

(F401)

35-35: marker.ocr.langdetect.keep_most_frequent_element imported but unused

Remove unused import

(F401)

marker/ocr/langdetect.py

15-15: langdetect.detect_langs imported but unused

Remove unused import: langdetect.detect_langs

(F401)

17-17: marker.settings.settings imported but unused

Remove unused import: marker.settings.settings

(F401)

71-71: Do not use bare except

(E722)

104-104: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

152-152: Local variable n_pages is assigned to but never used

Remove assignment to unused variable n_pages

(F841)

Additional comments not posted (2)

marker/ocr/recognition.py (1)
54-54: Verify Conditional Logic for OCR Page Selection

After correcting the definitions of ocr_langs and unknown_langs, please verify that the conditional statement accurately reflects the intended logic for determining which pages require OCR processing:
if (ocr_needed or pnum in ocr_langs) and pnum not in unknown_langs:
Ensure that this condition correctly identifies pages needing OCR, especially with the updated unknown_langs.
marker/ocr/langdetect.py (1)
99-102: ⚠️ Potential issue

Consider not defaulting to 'en' on exception in detect_language_page.

In the except block, the language is set to "en" when an error occurs. This might not reflect the actual language and could lead to incorrect processing. Consider setting the language to "unknown" instead.

Apply this diff to set the language to "unknown":
        print(llm_response)
        language = json.loads(llm_response)["language"]

    except Exception as e:
        print("Error while detecting language.")
-       language = "en"
+       language = "unknown"
Likely invalid or redundant comment.

coderabbitai · 2024-09-24T11:21:20Z

marker/ocr/recognition.py

+    ocr_langs = [
+        int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+    ]
+    unknown_langs = [
+        int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+    ]


⚠️ Potential issue

Logical Error: Duplicate Definitions of ocr_langs and unknown_langs

The variables ocr_langs and unknown_langs are both assigned the same list comprehension, resulting in identical lists. Given their intended purposes, this seems unintended. Specifically, ocr_langs should contain page indices where the language is neither "en" nor "unknown", while unknown_langs should contain indices where the language is "unknown".

Apply this diff to correct the definition of unknown_langs:

ocr_langs = [ int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] ] unknown_langs = [ - int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] + int(k) for k, v in languages_meta.items() if v == "unknown" ]

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ocr_langs = [

int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]

]

unknown_langs = [

int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]

]

ocr_langs = [

int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]

]

unknown_langs = [

int(k) for k, v in languages_meta.items() if v == "unknown"

]

coderabbitai · 2024-09-24T11:21:21Z

marker/convert.py

@@ -76,41 +84,10 @@ def convert_single_pdf(
        }
    )

-    valid_langs=["en","hi","or"]
+    valid_langs = ["en", "hi", "or"]


🛠️ Refactor suggestion

Consider making valid_langs configurable for flexibility.

Instead of hard-coding valid_langs = ["en", "hi", "or"], consider retrieving the valid languages from configuration settings or passing them as a parameter. This approach enhances flexibility and maintainability, allowing for easier updates and broader language support in the future.

coderabbitai · 2024-09-24T11:21:21Z

marker/convert.py

+    get_text,
+    detect_language_text,
+    detect_language_ocr,
+    keep_most_frequent_element,


⚠️ Potential issue

Remove unused imports to clean up code.

The functions get_text, detect_language_text, detect_language_ocr, and keep_most_frequent_element are imported but not used in this file. Removing these unused imports will improve code readability and maintainability.

Apply this diff to remove the unused imports:

from marker.ocr.langdetect import ( - get_text, - detect_language_text, - detect_language_ocr, - keep_most_frequent_element, language_detection, )

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

get_text,

detect_language_text,

detect_language_ocr,

keep_most_frequent_element,

from marker.ocr.langdetect import (

language_detection,

)

Tools

Ruff

32-32: marker.ocr.langdetect.get_text imported but unused

Remove unused import

(F401)

33-33: marker.ocr.langdetect.detect_language_text imported but unused

Remove unused import

(F401)

34-34: marker.ocr.langdetect.detect_language_ocr imported but unused

Remove unused import

(F401)

35-35: marker.ocr.langdetect.keep_most_frequent_element imported but unused

Remove unused import

(F401)

coderabbitai · 2024-09-24T11:21:21Z

marker/convert.py

+        doc,
+        pages,
+        langs,
+        ocr_model,
+        OCR_ALL_PAGES,
+        languages_meta,
+        batch_multiplier=batch_multiplier,


💡 Codebase verification

Remove the languages_meta parameter from the run_ocr function call in marker/convert.py (lines 108-114) as it is not utilized within the function.

Analysis chain

Verify the necessity of passing both langs and languages_meta to run_ocr.

Both langs and languages_meta are being passed to the run_ocr function. Ensure that both parameters are required and used appropriately within the function. If they serve similar purposes, consider consolidating them to streamline the code and reduce potential confusion.

Run the following script to inspect the run_ocr function's parameters and usage:

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Inspect the `run_ocr` function definition to verify parameter usage. # Test: Search for the `run_ocr` function definition. rg --type python -A 5 -B 2 '^def run_ocr\(' # Expected result: The function definition showing how `langs` and `languages_meta` are utilized.

Length of output: 76

Script:

#!/bin/bash # Description: Inspect the `run_ocr` function definition without specifying file type to verify parameter usage. # Search for the `run_ocr` function definition across all files. rg '^def run_ocr\(' -A 5 -B 2

Length of output: 342

coderabbitai · 2024-09-24T11:21:21Z

marker/ocr/langdetect.py

+        with tempfile.NamedTemporaryFile(
+            suffix=".png", delete=False
+        ) as temp_image_file:
+            image_path = temp_image_file.name
+            pix.save(image_path)
+
+        # language = detect_language_page(image_path)
+
+        # results.append(language)
+
+        text = extract_text_from_image(image_path)
+        result = detect_language_text(text)
+
+        # Clean up the temporary image file
+        os.remove(image_path)
+


⚠️ Potential issue

Ensure temporary files are deleted even if an exception occurs.

If an exception occurs after the temporary image file is created but before it's deleted, the temporary file may not be cleaned up, causing resource leakage. Use a context manager or a try...finally block to ensure the file is deleted.

Refactor the code to use a context manager:

with tempfile.NamedTemporaryFile(suffix=".png") as temp_image_file: image_path = temp_image_file.name pix.save(image_path) # Process the image text = extract_text_from_image(image_path) result = detect_language_text(text) # The temporary file is automatically deleted when the block exits

coderabbitai · 2024-09-24T11:21:21Z

marker/ocr/langdetect.py

@@ -12,33 +12,35 @@
 from pdf2image import convert_from_path
 import pytesseract

-from langdetect import detect, DetectorFactory
+from langdetect import detect, DetectorFactory, detect_langs


⚠️ Potential issue

Remove unused import detect_langs.

The detect_langs function is imported but never used in the code. Removing unused imports helps keep the code clean and maintainable.

Apply this diff to remove the unused import:

-from langdetect import detect, DetectorFactory, detect_langs +from langdetect import detect, DetectorFactory

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from langdetect import detect, DetectorFactory, detect_langs

from langdetect import detect, DetectorFactory

Tools

Ruff

15-15: langdetect.detect_langs imported but unused

Remove unused import: langdetect.detect_langs

(F401)

coderabbitai · 2024-09-24T11:21:21Z

marker/ocr/langdetect.py

    except:
        print("Error while detecting language")
        language = ""



⚠️ Potential issue

Avoid using bare except; specify the exception type.

Using a bare except clause can catch unintended exceptions and make debugging difficult. It's better to specify the exception you intend to catch or use except Exception as e.

Apply this diff to specify the exception and handle it appropriately:

-except: +except Exception as e: print("Error while detecting language") + print(e) language = ""

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except:

print("Error while detecting language")

language = ""

except Exception as e:

print("Error while detecting language")

print(e)

language = ""

Tools

Ruff

71-71: Do not use bare except

(E722)

coderabbitai · 2024-09-24T11:21:21Z

marker/ocr/langdetect.py

@@ -12,33 +12,35 @@
 from pdf2image import convert_from_path
 import pytesseract

-from langdetect import detect, DetectorFactory
+from langdetect import detect, DetectorFactory, detect_langs
 from langdetect.lang_detect_exception import LangDetectException
 from marker.settings import settings


⚠️ Potential issue

Remove unused import settings.

The settings module from marker.settings is imported but never used. Cleaning up unused imports improves code readability.

Apply this diff to remove the unused import:

-from marker.settings import settings

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from marker.settings import settings

Tools

Ruff

17-17: marker.settings.settings imported but unused

Remove unused import: marker.settings.settings

(F401)

Detecting language page-wise

d72ae16

coderabbitai bot reviewed Sep 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting language page-wise #18

Detecting language page-wise #18

rachitavya commented Sep 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

coderabbitai bot Sep 24, 2024

	from langdetect import detect, DetectorFactory, detect_langs
	from langdetect import detect, DetectorFactory

Detecting language page-wise #18

Are you sure you want to change the base?

Detecting language page-wise #18

Conversation

rachitavya commented Sep 24, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Sep 24, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

coderabbitai bot Sep 24, 2024

Choose a reason for hiding this comment

rachitavya commented Sep 24, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)