Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Detection and Parser #11

Merged
merged 1 commit into from
Aug 21, 2024
Merged

Table Detection and Parser #11

merged 1 commit into from
Aug 21, 2024

Conversation

TakshPanchal
Copy link
Collaborator

@TakshPanchal TakshPanchal commented Aug 21, 2024

Summary by CodeRabbit

  • New Features

    • Introduced advanced table detection and processing capabilities in PDF conversion.
    • Added image processing functions for detecting lines and intersections in images.
    • Created a new utility for saving cropped images based on detected table areas.
  • Bug Fixes

    • Improved logic for table statistics and detection accuracy within converted documents.
  • Documentation

    • Enhanced docstrings for newly implemented functions to clarify their purposes and usage.

Copy link

coderabbitai bot commented Aug 21, 2024

Walkthrough

Walkthrough

The recent changes enhance the functionality of the marker module, primarily focusing on improved table detection and processing within PDFs. New image processing techniques have been implemented to identify lines, intersections, and text clusters. Additionally, class structures have been introduced to represent geometric entities, facilitating better organization and manipulation of detected elements. These updates collectively aim to improve the accuracy and efficiency of document analysis processes.

Changes

Files Change Summary
marker/convert.py Modified convert_single_pdf to include table_detection; commented out previous table counting logic.
marker/tables/detections.py, marker/tables/intersections.py, marker/tables/schema.py Introduced new functions for detecting and processing lines and intersections in images; implemented geometric classes.
marker/tables/table.py Added table_detection function for enhanced table extraction from PDFs; modified existing functions to integrate new logic.
marker/tables/utils.py Introduced save_table_image for cropping and saving images based on rectangular coordinates.

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3d07842 and ee9069f.

Files selected for processing (6)
  • marker/convert.py (4 hunks)
  • marker/tables/detections.py (1 hunks)
  • marker/tables/intersections.py (1 hunks)
  • marker/tables/schema.py (1 hunks)
  • marker/tables/table.py (3 hunks)
  • marker/tables/utils.py (1 hunks)
Additional context used
Ruff
marker/tables/intersections.py

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

marker/tables/detections.py

111-111: Test for object identity should be is not

Convert to is not

(E714)


117-117: Test for object identity should be is not

Convert to is not

(E714)


228-228: Ambiguous variable name: l

(E741)


249-249: Ambiguous variable name: l

(E741)


251-251: Local variable start_point is assigned to but never used

Remove assignment to unused variable start_point

(F841)

marker/tables/table.py

213-213: Ambiguous variable name: l

(E741)


214-214: Ambiguous variable name: l

(E741)


300-300: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


316-316: Ambiguous variable name: l

(E741)

Additional comments not posted (7)
marker/tables/utils.py (1)

39-40: Verify import necessity.

Ensure that the PIL import is necessary for the current implementation. If it's only used for type hinting, consider using from PIL import Image to improve clarity.

Verification successful

Refine the PIL import for clarity.

The PIL import is used solely for type hinting in the save_table_image function. Consider changing the import statement to from PIL import Image for improved clarity.

  • Line 39: Change import PIL to from PIL import Image.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the PIL import is necessary beyond type hinting.

# Test: Search for the usage of PIL in the file. Expect: Only type hinting usage.
rg --type python 'PIL' marker/tables/utils.py

Length of output: 76


Script:

#!/bin/bash
# Description: Check if the PIL import is necessary beyond type hinting.

# Test: Search for the usage of PIL in the file.
rg 'PIL' marker/tables/utils.py

Length of output: 131

marker/convert.py (1)

158-158: Verify integration of table_detection.

Ensure that the table_detection function is correctly integrated and that its parameters are appropriate for the intended functionality.

Verification successful

Integration of table_detection is correct.

The table_detection function is properly defined and integrated in the codebase. It is defined in marker/tables/table.py and used correctly in marker/convert.py with the appropriate parameters.

  • Function Definition: marker/tables/table.py
  • Function Usage: marker/convert.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of `table_detection` in the codebase.

# Test: Search for the `table_detection` function definition and its usage.
ast-grep --lang python --pattern $'def table_detection($_, $_, $_):'

Length of output: 69


Script:

#!/bin/bash
# Search for any definition of the `table_detection` function in the codebase.
rg 'def table_detection' --glob '*.py' -A 5

# Search for all usages of `table_detection` in the codebase.
rg 'table_detection' --glob '*.py' -A 2

Length of output: 1005

marker/tables/table.py (5)

1-22: Imports look good.

The imports are necessary for the new functionality and image processing techniques.


Line range hint 24-48: Function get_table_surya looks good.

The logic for processing table data using bounding boxes and spans is well-implemented.


Line range hint 51-97: Function get_table_pdftext looks good.

The logic for processing table data using character blocks and spans is well-implemented.


Line range hint 100-176: Function format_tables looks good.

The logic for formatting tables into GitHub-flavored markdown is well-implemented.


278-283: Function ocr_img looks good.

The OCR implementation using Tesseract is well-configured.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Gautam-Rajeev Gautam-Rajeev merged commit bbf24fd into master Aug 21, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Outside diff range, codebase verification and nitpick comments (5)
marker/tables/utils.py (1)

46-49: Remove or clarify commented-out code.

The commented-out lines suggest functionality that is not currently implemented. If these lines are not needed, consider removing them to improve code clarity. If they are placeholders for future work, add a comment explaining their purpose.

-    # box = [int(coord) for coord in box.tolist()]
-    # Crop format: (left, upper, right, lower)
-    # left, upper, right, lower = box
marker/tables/schema.py (1)

51-52: Remove or clarify commented-out code.

The commented-out __str__ method in the Rectangle class should be removed if not needed, or uncommented if it serves a purpose. Consider adding a comment to explain its intended use.

-    # def __str__(self) -> str:
-    #    return f"x: {self.x}, y: {self.y}, width:{self.width}, height:{self.height}"
marker/tables/intersections.py (2)

Line range hint 228-228: Avoid ambiguous variable names.

The variable name l is ambiguous and can be confused with the number 1 or the letter I. Consider renaming it to something more descriptive.

-    for l in h_lines:
+    for line in h_lines:

-    for l in v_lines:
+    for line in v_lines:

Also applies to: 249-249

Tools
Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)


Line range hint 251-251: Remove unused variable start_point.

The variable start_point is assigned but never used. Consider removing it to clean up the code.

-        start_point = x - LEN_THRESHOLD, y - line_width
Tools
Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

marker/convert.py (1)

171-172: Clarify or remove commented-out code.

The commented-out lines related to format_tables suggest a change in strategy. If this logic is no longer needed, consider removing it. Otherwise, add comments to explain why it's commented out.

-    # table_count = format_tables(pages)
-    # out_meta["block_stats"]["table"] = table_count

Comment on lines +32 to +35
def fromCoords(x1, y1, x2, y2):
return Rectangle(x1, y1, x2 - x1, y2 - y1)

def fromPoints(top_left_point: Point, bottom_right_point: Point):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making fromCoords and fromPoints class methods.

The fromCoords and fromPoints methods could be converted to class methods using the @classmethod decorator. This would make it clear that they are alternative constructors for the Rectangle class.

+    @classmethod
     def fromCoords(cls, x1, y1, x2, y2):
         return cls(x1, y1, x2 - x1, y2 - y1)

+    @classmethod
     def fromPoints(cls, top_left_point: Point, bottom_right_point: Point):
         return cls(
             top_left_point.x,
             top_left_point.y,
             abs(top_left_point.x - bottom_right_point.x),
             abs(top_left_point.y - bottom_right_point.y),
         )
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def fromCoords(x1, y1, x2, y2):
return Rectangle(x1, y1, x2 - x1, y2 - y1)
def fromPoints(top_left_point: Point, bottom_right_point: Point):
@classmethod
def fromCoords(cls, x1, y1, x2, y2):
return cls(x1, y1, x2 - x1, y2 - y1)
@classmethod
def fromPoints(cls, top_left_point: Point, bottom_right_point: Point):
return cls(
top_left_point.x,
top_left_point.y,
abs(top_left_point.x - bottom_right_point.x),
abs(top_left_point.y - bottom_right_point.y),
)

new_cells = []
GAP = 10

for box_id, box in enumerate(boxes):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use is not for object identity checks.

The conditions should use is not for checking object identity.

-    if not vertical_lines_width is None:
+    if vertical_lines_width is not None:

-    if not horizontal_lines_height is None:
+    if horizontal_lines_height is not None:

Also applies to: 117-117

Tools
Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

original_vertical_lines = vertical_lines.copy()

# Modify the dimensions for intersection checking if specified
if not vertical_lines_width is None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use is not for object identity checks.

The conditions should use is not for checking object identity.

-    if not vertical_lines_width is None:
+    if vertical_lines_width is not None:

-    if not horizontal_lines_height is None:
+    if horizontal_lines_height is not None:

Also applies to: 117-117

Tools
Ruff

111-111: Test for object identity should be is not

Convert to is not

(E714)

img_h, img_w, _ = img.shape
checking_lines_left = []
checking_lines_right = []
for l in h_lines:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid ambiguous variable names.

The variable name l is ambiguous and can be confused with the number 1 or the letter I. Consider renaming it to something more descriptive.

-    for l in h_lines:
+    for line in h_lines:

-    for l in v_lines:
+    for line in v_lines:

Also applies to: 249-249

Tools
Ruff

228-228: Ambiguous variable name: l

(E741)

checking_lines_bottom = []
for l in v_lines:
x, y, w, h = l.tolist()
start_point = x - LEN_THRESHOLD, y - line_width
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused variable start_point.

The variable start_point is assigned but never used. Consider removing it to clean up the code.

-        start_point = x - LEN_THRESHOLD, y - line_width
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
start_point = x - LEN_THRESHOLD, y - line_width
Tools
Ruff

251-251: Local variable start_point is assigned to but never used

Remove assignment to unused variable start_point

(F841)

Comment on lines +270 to +275
# TODO: Add page check if page_num is valid
page = pdf.load_page(page_num)
pix = page.get_pixmap(dpi=180)
image = PIL.Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
return image
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: Add page validation.

The TODO comment indicates that page validation is missing.

Do you want me to implement the page validation logic or open a GitHub issue to track this task?

Comment on lines +286 to +352
image = get_page(pdf, page_num)

encodings = encoder(image, return_tensors="pt")

with torch.no_grad():
outputs = detector(**encodings)
width, height = image.size
target_size = [(height, width)]
results = encoder.post_process_object_detection(
outputs, threshold=THRESHOLD, target_sizes=target_size
)[0]

if len(results["boxes"]) > 0:
for idx, ex_box in enumerate(results["boxes"]):
box = Rectangle.fromCoords(*ex_box)
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:
image_path = temp_file.name
save_table_image(image, box, image_path)
image = PIL.Image.open(image_path)
img_cv2 = cv2.imread(image_path)
ocr_data = ocr_img(image)

horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(
image_path
)
text_line_bbox, text_line_height = detect_horizontal_textlines(
ocr_data, image
)

text_line_bbox = [l for l in text_line_bbox if l.width > l.height]

_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(
text_line_bbox,
horizontal_lines_bbox,
horizontal_lines_height=text_line_height / 4,
)

horizontal_lines_bbox_clustered = cluster_horizontal_lines(
new_horizontal_lines_bbox, text_line_height
)
vertical_line_bboxes_clustered = cluster_vertical_lines(
vertical_line_bboxes, text_line_height
)

horizontal_lines_bbox_clustered_filt, _, _, _ = (
filter_non_intersecting_lines(
horizontal_lines_bbox_clustered,
vertical_line_bboxes_clustered,
vertical_lines_width=text_line_height / 8,
)
)

h_lines, v_lines = extend_lines(
img_cv2,
horizontal_lines_bbox_clustered_filt,
vertical_line_bboxes_clustered,
text_line_height / 8,
)
for h in h_lines:
h.draw(img_cv2)

for v in v_lines:
v.draw(img_cv2)

cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)
return h_lines, v_lines, ocr_data, ex_box, image_path
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming ambiguous variable names and addressing unused loop variable.

The function extract_table is well-implemented, but consider renaming the variable l to something more descriptive to avoid ambiguity. Also, rename the unused loop variable idx to _.

- for idx, ex_box in enumerate(results["boxes"]):
+ for _idx, ex_box in enumerate(results["boxes"]):

- text_line_bbox = [l for l in text_line_bbox if l.width > l.height]
+ text_line_bbox = [line for line in text_line_bbox if line.width > line.height]
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8):
image = get_page(pdf, page_num)
encodings = encoder(image, return_tensors="pt")
with torch.no_grad():
outputs = detector(**encodings)
width, height = image.size
target_size = [(height, width)]
results = encoder.post_process_object_detection(
outputs, threshold=THRESHOLD, target_sizes=target_size
)[0]
if len(results["boxes"]) > 0:
for idx, ex_box in enumerate(results["boxes"]):
box = Rectangle.fromCoords(*ex_box)
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:
image_path = temp_file.name
save_table_image(image, box, image_path)
image = PIL.Image.open(image_path)
img_cv2 = cv2.imread(image_path)
ocr_data = ocr_img(image)
horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(
image_path
)
text_line_bbox, text_line_height = detect_horizontal_textlines(
ocr_data, image
)
text_line_bbox = [l for l in text_line_bbox if l.width > l.height]
_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(
text_line_bbox,
horizontal_lines_bbox,
horizontal_lines_height=text_line_height / 4,
)
horizontal_lines_bbox_clustered = cluster_horizontal_lines(
new_horizontal_lines_bbox, text_line_height
)
vertical_line_bboxes_clustered = cluster_vertical_lines(
vertical_line_bboxes, text_line_height
)
horizontal_lines_bbox_clustered_filt, _, _, _ = (
filter_non_intersecting_lines(
horizontal_lines_bbox_clustered,
vertical_line_bboxes_clustered,
vertical_lines_width=text_line_height / 8,
)
)
h_lines, v_lines = extend_lines(
img_cv2,
horizontal_lines_bbox_clustered_filt,
vertical_line_bboxes_clustered,
text_line_height / 8,
)
for h in h_lines:
h.draw(img_cv2)
for v in v_lines:
v.draw(img_cv2)
cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)
return h_lines, v_lines, ocr_data, ex_box, image_path
def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8):
image = get_page(pdf, page_num)
encodings = encoder(image, return_tensors="pt")
with torch.no_grad():
outputs = detector(**encodings)
width, height = image.size
target_size = [(height, width)]
results = encoder.post_process_object_detection(
outputs, threshold=THRESHOLD, target_sizes=target_size
)[0]
if len(results["boxes"]) > 0:
for _idx, ex_box in enumerate(results["boxes"]):
box = Rectangle.fromCoords(*ex_box)
with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:
image_path = temp_file.name
save_table_image(image, box, image_path)
image = PIL.Image.open(image_path)
img_cv2 = cv2.imread(image_path)
ocr_data = ocr_img(image)
horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(
image_path
)
text_line_bbox, text_line_height = detect_horizontal_textlines(
ocr_data, image
)
text_line_bbox = [line for line in text_line_bbox if line.width > line.height]
_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(
text_line_bbox,
horizontal_lines_bbox,
horizontal_lines_height=text_line_height / 4,
)
horizontal_lines_bbox_clustered = cluster_horizontal_lines(
new_horizontal_lines_bbox, text_line_height
)
vertical_line_bboxes_clustered = cluster_vertical_lines(
vertical_line_bboxes, text_line_height
)
horizontal_lines_bbox_clustered_filt, _, _, _ = (
filter_non_intersecting_lines(
horizontal_lines_bbox_clustered,
vertical_line_bboxes_clustered,
vertical_lines_width=text_line_height / 8,
)
)
h_lines, v_lines = extend_lines(
img_cv2,
horizontal_lines_bbox_clustered_filt,
vertical_line_bboxes_clustered,
text_line_height / 8,
)
for h in h_lines:
h.draw(img_cv2)
for v in v_lines:
v.draw(img_cv2)
cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)
return h_lines, v_lines, ocr_data, ex_box, image_path
Tools
Ruff

300-300: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)


316-316: Ambiguous variable name: l

(E741)

Comment on lines +178 to +266
from transformers import (
TableTransformerForObjectDetection,
DetrFeatureExtractor
)

# Model setup
THRESHOLD = 0.8
table_detect_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
feature_extractor = DetrFeatureExtractor()

doc = fitz.open(filename)
length = len(doc)
if max_pages:
length = min(length, max_pages)

for i_pg in range(length):
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)
if extracted_data is None:
continue

(
h_lines,
v_lines,
ocr_data,
table_bbox,
table_img_path,
) = extracted_data
img = cv2.imread(table_img_path)

print("H Lines: ", len(h_lines))
print("H Lines: ", len(v_lines))

h_lines.sort(key=lambda l: l.y)
v_lines.sort(key=lambda l: l.x)
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)
for p_r in rowwise_intersections:
for p in p_r:
p.draw(img)
boxes = detect_boxes(rowwise_intersections)
cells = get_cells(boxes, h_lines, v_lines)

words_original = [
{"x": x, "y": y, "width": width, "height": height, "text": text}
for x, y, width, height, text in zip(
ocr_data["left"],
ocr_data["top"],
ocr_data["width"],
ocr_data["height"],
ocr_data["text"],
)
if text.strip()
]
fill_text_in_cells(words_original, cells, img)
# row wise cell segregation
rows = defaultdict(dict)
for cell in cells:
rows[cell.r][cell.c] = cell.text

table_df = pd.DataFrame.from_dict(rows, orient="index")
if table_df.empty:
print("The DataFrame is empty")
continue
table_df = table_df.drop_duplicates(keep='last')
table_df.columns = table_df.iloc[0]
table_df = table_df[1:].reset_index(drop=True)
table_df.to_csv(f"{i_pg}_table.csv", index=False)
md = table_df.to_markdown(index=False)

table_block = Block(
bbox=table_bbox,
block_type="Table",
pnum=i_pg,
lines=[Line(
bbox=table_bbox,
spans=[Span(
bbox=table_bbox,
span_id=f"{i_pg}_table",
font="Table",
font_size=0,
font_weight=0,
block_type="TABLE",
text=md
)]
)]
)
pages[i_pg].blocks.append(table_block)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming ambiguous variable names.

The function table_detection is well-implemented, but consider renaming the variable l to something more descriptive to avoid ambiguity.

- h_lines.sort(key=lambda l: l.y)
- v_lines.sort(key=lambda l: l.x)
+ h_lines.sort(key=lambda line: line.y)
+ v_lines.sort(key=lambda line: line.x)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def table_detection(filename: str, pages: List[Page], max_pages: int):
from transformers import (
TableTransformerForObjectDetection,
DetrFeatureExtractor
)
# Model setup
THRESHOLD = 0.8
table_detect_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
feature_extractor = DetrFeatureExtractor()
doc = fitz.open(filename)
length = len(doc)
if max_pages:
length = min(length, max_pages)
for i_pg in range(length):
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)
if extracted_data is None:
continue
(
h_lines,
v_lines,
ocr_data,
table_bbox,
table_img_path,
) = extracted_data
img = cv2.imread(table_img_path)
print("H Lines: ", len(h_lines))
print("H Lines: ", len(v_lines))
h_lines.sort(key=lambda l: l.y)
v_lines.sort(key=lambda l: l.x)
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)
for p_r in rowwise_intersections:
for p in p_r:
p.draw(img)
boxes = detect_boxes(rowwise_intersections)
cells = get_cells(boxes, h_lines, v_lines)
words_original = [
{"x": x, "y": y, "width": width, "height": height, "text": text}
for x, y, width, height, text in zip(
ocr_data["left"],
ocr_data["top"],
ocr_data["width"],
ocr_data["height"],
ocr_data["text"],
)
if text.strip()
]
fill_text_in_cells(words_original, cells, img)
# row wise cell segregation
rows = defaultdict(dict)
for cell in cells:
rows[cell.r][cell.c] = cell.text
table_df = pd.DataFrame.from_dict(rows, orient="index")
if table_df.empty:
print("The DataFrame is empty")
continue
table_df = table_df.drop_duplicates(keep='last')
table_df.columns = table_df.iloc[0]
table_df = table_df[1:].reset_index(drop=True)
table_df.to_csv(f"{i_pg}_table.csv", index=False)
md = table_df.to_markdown(index=False)
table_block = Block(
bbox=table_bbox,
block_type="Table",
pnum=i_pg,
lines=[Line(
bbox=table_bbox,
spans=[Span(
bbox=table_bbox,
span_id=f"{i_pg}_table",
font="Table",
font_size=0,
font_weight=0,
block_type="TABLE",
text=md
)]
)]
)
pages[i_pg].blocks.append(table_block)
def table_detection(filename: str, pages: List[Page], max_pages: int):
from transformers import (
TableTransformerForObjectDetection,
DetrFeatureExtractor
)
# Model setup
THRESHOLD = 0.8
table_detect_model = TableTransformerForObjectDetection.from_pretrained(
"microsoft/table-transformer-detection"
)
feature_extractor = DetrFeatureExtractor()
doc = fitz.open(filename)
length = len(doc)
if max_pages:
length = min(length, max_pages)
for i_pg in range(length):
extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)
if extracted_data is None:
continue
(
h_lines,
v_lines,
ocr_data,
table_bbox,
table_img_path,
) = extracted_data
img = cv2.imread(table_img_path)
print("H Lines: ", len(h_lines))
print("H Lines: ", len(v_lines))
h_lines.sort(key=lambda line: line.y)
v_lines.sort(key=lambda line: line.x)
rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)
for p_r in rowwise_intersections:
for p in p_r:
p.draw(img)
boxes = detect_boxes(rowwise_intersections)
cells = get_cells(boxes, h_lines, v_lines)
words_original = [
{"x": x, "y": y, "width": width, "height": height, "text": text}
for x, y, width, height, text in zip(
ocr_data["left"],
ocr_data["top"],
ocr_data["width"],
ocr_data["height"],
ocr_data["text"],
)
if text.strip()
]
fill_text_in_cells(words_original, cells, img)
# row wise cell segregation
rows = defaultdict(dict)
for cell in cells:
rows[cell.r][cell.c] = cell.text
table_df = pd.DataFrame.from_dict(rows, orient="index")
if table_df.empty:
print("The DataFrame is empty")
continue
table_df = table_df.drop_duplicates(keep='last')
table_df.columns = table_df.iloc[0]
table_df = table_df[1:].reset_index(drop=True)
table_df.to_csv(f"{i_pg}_table.csv", index=False)
md = table_df.to_markdown(index=False)
table_block = Block(
bbox=table_bbox,
block_type="Table",
pnum=i_pg,
lines=[Line(
bbox=table_bbox,
spans=[Span(
bbox=table_bbox,
span_id=f"{i_pg}_table",
font="Table",
font_size=0,
font_weight=0,
block_type="TABLE",
text=md
)]
)]
)
pages[i_pg].blocks.append(table_block)
Tools
Ruff

213-213: Ambiguous variable name: l

(E741)


214-214: Ambiguous variable name: l

(E741)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants