-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detecting language page-wise #18
base: master
Are you sure you want to change the base?
Conversation
WalkthroughWalkthroughThe changes involve a restructuring of the language detection logic within the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Outside diff range and nitpick comments (4)
marker/ocr/langdetect.py (4)
104-104
: Remove unused variablee
or use it in exception handling.The variable
e
is assigned but never used in theexcept
block. If you don't need the exception details, you can removeas e
. If you do, consider using it in your error message.Apply this diff to remove
e
if not needed:-except Exception as e: +except Exception: print("Error while detecting language.") language = "en"Or use
e
in the error message:- print("Error while detecting language.") + print(f"Error while detecting language: {e}")Tools
Ruff
104-104: Local variable
e
is assigned to but never usedRemove assignment to unused variable
e
(F841)
152-152
: Remove unused variablen_pages
.The variable
n_pages
is assigned but never used in the functiondetect_language_ocr_page
. Unused variables can be removed to clean up the code.Apply this diff to remove the unused variable:
-n_pages = pdf_document.page_count
Tools
Ruff
152-152: Local variable
n_pages
is assigned to but never usedRemove assignment to unused variable
n_pages
(F841)
148-180
: Add a docstring todetect_language_ocr_page
function for clarity.Adding a docstring will improve code readability by explaining the purpose, parameters, and return value of the function.
Here's an example docstring:
def detect_language_ocr_page(pdf_path, page_number): """ Detect the language of a specific page in a PDF using OCR. Parameters: pdf_path (str): The path to the PDF file. page_number (int): The zero-based index of the page to process. Returns: str: The detected language code, or 'unknown' if detection fails. """ # Function implementation...Tools
Ruff
152-152: Local variable
n_pages
is assigned to but never usedRemove assignment to unused variable
n_pages
(F841)
201-217
: Add a docstring tolanguage_detection
function for better understanding.Including a docstring helps other developers understand the function's purpose, inputs, and outputs.
Example docstring:
def language_detection(pages: List[Page], pdf_path, valid_langs): """ Detect the languages of pages in a PDF, using preliminary text detection and OCR as a fallback. Parameters: pages (List[Page]): List of Page objects containing preliminary text. pdf_path (str): Path to the PDF file. valid_langs (List[str]): List of valid language codes. Returns: Dict[str, str]: A dictionary mapping page numbers (as strings) to detected language codes. """ # Function implementation...
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- marker/convert.py (3 hunks)
- marker/ocr/langdetect.py (5 hunks)
- marker/ocr/recognition.py (1 hunks)
Additional context used
Ruff
marker/convert.py
32-32:
marker.ocr.langdetect.get_text
imported but unusedRemove unused import
(F401)
33-33:
marker.ocr.langdetect.detect_language_text
imported but unusedRemove unused import
(F401)
34-34:
marker.ocr.langdetect.detect_language_ocr
imported but unusedRemove unused import
(F401)
35-35:
marker.ocr.langdetect.keep_most_frequent_element
imported but unusedRemove unused import
(F401)
marker/ocr/langdetect.py
15-15:
langdetect.detect_langs
imported but unusedRemove unused import:
langdetect.detect_langs
(F401)
17-17:
marker.settings.settings
imported but unusedRemove unused import:
marker.settings.settings
(F401)
71-71: Do not use bare
except
(E722)
104-104: Local variable
e
is assigned to but never usedRemove assignment to unused variable
e
(F841)
152-152: Local variable
n_pages
is assigned to but never usedRemove assignment to unused variable
n_pages
(F841)
Additional comments not posted (2)
marker/ocr/recognition.py (1)
54-54
: Verify Conditional Logic for OCR Page SelectionAfter correcting the definitions of
ocr_langs
andunknown_langs
, please verify that the conditional statement accurately reflects the intended logic for determining which pages require OCR processing:if (ocr_needed or pnum in ocr_langs) and pnum not in unknown_langs:Ensure that this condition correctly identifies pages needing OCR, especially with the updated
unknown_langs
.marker/ocr/langdetect.py (1)
99-102
:⚠️ Potential issueConsider not defaulting to
'en'
on exception indetect_language_page
.In the
except
block, the language is set to"en"
when an error occurs. This might not reflect the actual language and could lead to incorrect processing. Consider setting the language to"unknown"
instead.Apply this diff to set the language to
"unknown"
:print(llm_response) language = json.loads(llm_response)["language"] except Exception as e: print("Error while detecting language.") - language = "en" + language = "unknown"Likely invalid or redundant comment.
ocr_langs = [ | ||
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | ||
] | ||
unknown_langs = [ | ||
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical Error: Duplicate Definitions of ocr_langs
and unknown_langs
The variables ocr_langs
and unknown_langs
are both assigned the same list comprehension, resulting in identical lists. Given their intended purposes, this seems unintended. Specifically, ocr_langs
should contain page indices where the language is neither "en"
nor "unknown"
, while unknown_langs
should contain indices where the language is "unknown"
.
Apply this diff to correct the definition of unknown_langs
:
ocr_langs = [
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
]
unknown_langs = [
- int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"]
+ int(k) for k, v in languages_meta.items() if v == "unknown"
]
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
ocr_langs = [ | |
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | |
] | |
unknown_langs = [ | |
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | |
] | |
ocr_langs = [ | |
int(k) for k, v in languages_meta.items() if v not in ["en", "unknown"] | |
] | |
unknown_langs = [ | |
int(k) for k, v in languages_meta.items() if v == "unknown" | |
] |
@@ -76,41 +84,10 @@ def convert_single_pdf( | |||
} | |||
) | |||
|
|||
valid_langs=["en","hi","or"] | |||
valid_langs = ["en", "hi", "or"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider making valid_langs
configurable for flexibility.
Instead of hard-coding valid_langs = ["en", "hi", "or"]
, consider retrieving the valid languages from configuration settings or passing them as a parameter. This approach enhances flexibility and maintainability, allowing for easier updates and broader language support in the future.
get_text, | ||
detect_language_text, | ||
detect_language_ocr, | ||
keep_most_frequent_element, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports to clean up code.
The functions get_text
, detect_language_text
, detect_language_ocr
, and keep_most_frequent_element
are imported but not used in this file. Removing these unused imports will improve code readability and maintainability.
Apply this diff to remove the unused imports:
from marker.ocr.langdetect import (
- get_text,
- detect_language_text,
- detect_language_ocr,
- keep_most_frequent_element,
language_detection,
)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
get_text, | |
detect_language_text, | |
detect_language_ocr, | |
keep_most_frequent_element, | |
from marker.ocr.langdetect import ( | |
language_detection, | |
) |
Tools
Ruff
32-32:
marker.ocr.langdetect.get_text
imported but unusedRemove unused import
(F401)
33-33:
marker.ocr.langdetect.detect_language_text
imported but unusedRemove unused import
(F401)
34-34:
marker.ocr.langdetect.detect_language_ocr
imported but unusedRemove unused import
(F401)
35-35:
marker.ocr.langdetect.keep_most_frequent_element
imported but unusedRemove unused import
(F401)
doc, | ||
pages, | ||
langs, | ||
ocr_model, | ||
OCR_ALL_PAGES, | ||
languages_meta, | ||
batch_multiplier=batch_multiplier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Remove the languages_meta
parameter from the run_ocr
function call in marker/convert.py
(lines 108-114) as it is not utilized within the function.
Analysis chain
Verify the necessity of passing both langs
and languages_meta
to run_ocr
.
Both langs
and languages_meta
are being passed to the run_ocr
function. Ensure that both parameters are required and used appropriately within the function. If they serve similar purposes, consider consolidating them to streamline the code and reduce potential confusion.
Run the following script to inspect the run_ocr
function's parameters and usage:
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Inspect the `run_ocr` function definition to verify parameter usage.
# Test: Search for the `run_ocr` function definition.
rg --type python -A 5 -B 2 '^def run_ocr\('
# Expected result: The function definition showing how `langs` and `languages_meta` are utilized.
Length of output: 76
Script:
#!/bin/bash
# Description: Inspect the `run_ocr` function definition without specifying file type to verify parameter usage.
# Search for the `run_ocr` function definition across all files.
rg '^def run_ocr\(' -A 5 -B 2
Length of output: 342
with tempfile.NamedTemporaryFile( | ||
suffix=".png", delete=False | ||
) as temp_image_file: | ||
image_path = temp_image_file.name | ||
pix.save(image_path) | ||
|
||
# language = detect_language_page(image_path) | ||
|
||
# results.append(language) | ||
|
||
text = extract_text_from_image(image_path) | ||
result = detect_language_text(text) | ||
|
||
# Clean up the temporary image file | ||
os.remove(image_path) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure temporary files are deleted even if an exception occurs.
If an exception occurs after the temporary image file is created but before it's deleted, the temporary file may not be cleaned up, causing resource leakage. Use a context manager or a try...finally
block to ensure the file is deleted.
Refactor the code to use a context manager:
with tempfile.NamedTemporaryFile(suffix=".png") as temp_image_file:
image_path = temp_image_file.name
pix.save(image_path)
# Process the image
text = extract_text_from_image(image_path)
result = detect_language_text(text)
# The temporary file is automatically deleted when the block exits
@@ -12,33 +12,35 @@ | |||
from pdf2image import convert_from_path | |||
import pytesseract | |||
|
|||
from langdetect import detect, DetectorFactory | |||
from langdetect import detect, DetectorFactory, detect_langs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import detect_langs
.
The detect_langs
function is imported but never used in the code. Removing unused imports helps keep the code clean and maintainable.
Apply this diff to remove the unused import:
-from langdetect import detect, DetectorFactory, detect_langs
+from langdetect import detect, DetectorFactory
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from langdetect import detect, DetectorFactory, detect_langs | |
from langdetect import detect, DetectorFactory |
Tools
Ruff
15-15:
langdetect.detect_langs
imported but unusedRemove unused import:
langdetect.detect_langs
(F401)
except: | ||
print("Error while detecting language") | ||
language = "" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using bare except
; specify the exception type.
Using a bare except
clause can catch unintended exceptions and make debugging difficult. It's better to specify the exception you intend to catch or use except Exception as e
.
Apply this diff to specify the exception and handle it appropriately:
-except:
+except Exception as e:
print("Error while detecting language")
+ print(e)
language = ""
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
except: | |
print("Error while detecting language") | |
language = "" | |
except Exception as e: | |
print("Error while detecting language") | |
print(e) | |
language = "" |
Tools
Ruff
71-71: Do not use bare
except
(E722)
@@ -12,33 +12,35 @@ | |||
from pdf2image import convert_from_path | |||
import pytesseract | |||
|
|||
from langdetect import detect, DetectorFactory | |||
from langdetect import detect, DetectorFactory, detect_langs | |||
from langdetect.lang_detect_exception import LangDetectException | |||
from marker.settings import settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import settings
.
The settings
module from marker.settings
is imported but never used. Cleaning up unused imports improves code readability.
Apply this diff to remove the unused import:
-from marker.settings import settings
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from marker.settings import settings |
Tools
Ruff
17-17:
marker.settings.settings
imported but unusedRemove unused import:
marker.settings.settings
(F401)
Concerning ticket: https://github.com/BharatSahAIyak/pdf-parser/issues/210
Summary by CodeRabbit
New Features
Bug Fixes
Documentation