Table Detection and Parser #11

TakshPanchal · 2024-08-21T05:11:51Z

Summary by CodeRabbit

New Features
- Introduced advanced table detection and processing capabilities in PDF conversion.
- Added image processing functions for detecting lines and intersections in images.
- Created a new utility for saving cropped images based on detected table areas.
Bug Fixes
- Improved logic for table statistics and detection accuracy within converted documents.
Documentation
- Enhanced docstrings for newly implemented functions to clarify their purposes and usage.

coderabbitai · 2024-08-21T05:12:00Z

Walkthrough

The recent changes enhance the functionality of the marker module, primarily focusing on improved table detection and processing within PDFs. New image processing techniques have been implemented to identify lines, intersections, and text clusters. Additionally, class structures have been introduced to represent geometric entities, facilitating better organization and manipulation of detected elements. These updates collectively aim to improve the accuracy and efficiency of document analysis processes.

Changes

Files	Change Summary
`marker/convert.py`	Modified `convert_single_pdf` to include `table_detection`; commented out previous table counting logic.
`marker/tables/detections.py`, `marker/tables/intersections.py`, `marker/tables/schema.py`	Introduced new functions for detecting and processing lines and intersections in images; implemented geometric classes.
`marker/tables/table.py`	Added `table_detection` function for enhanced table extraction from PDFs; modified existing functions to integrate new logic.
`marker/tables/utils.py`	Introduced `save_table_image` for cropping and saving images based on rectangular coordinates.

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3d07842 and ee9069f.

Files selected for processing (6)

marker/convert.py (4 hunks)
marker/tables/detections.py (1 hunks)
marker/tables/intersections.py (1 hunks)
marker/tables/schema.py (1 hunks)
marker/tables/table.py (3 hunks)
marker/tables/utils.py (1 hunks)

Additional context used

Ruff

marker/tables/intersections.py

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

marker/tables/detections.py

111-111: Test for object identity should be is not

Convert to is not

(E714)

117-117: Test for object identity should be is not

Convert to is not

(E714)

228-228: Ambiguous variable name: l

(E741)

249-249: Ambiguous variable name: l

(E741)

251-251: Local variable start_point is assigned to but never used

Remove assignment to unused variable start_point

(F841)

marker/tables/table.py

213-213: Ambiguous variable name: l

(E741)

214-214: Ambiguous variable name: l

(E741)

300-300: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

316-316: Ambiguous variable name: l

(E741)

Additional comments not posted (7)

marker/tables/utils.py (1)
39-40: Verify import necessity.

Ensure that the PIL import is necessary for the current implementation. If it's only used for type hinting, consider using from PIL import Image to improve clarity.

Verification successful

Refine the PIL import for clarity.

The PIL import is used solely for type hinting in the save_table_image function. Consider changing the import statement to from PIL import Image for improved clarity.

Line 39: Change import PIL to from PIL import Image.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if the PIL import is necessary beyond type hinting.

# Test: Search for the usage of PIL in the file. Expect: Only type hinting usage.
rg --type python 'PIL' marker/tables/utils.py
Length of output: 76

Script:
#!/bin/bash
# Description: Check if the PIL import is necessary beyond type hinting.

# Test: Search for the usage of PIL in the file.
rg 'PIL' marker/tables/utils.py
Length of output: 131
marker/convert.py (1)
158-158: Verify integration of table_detection.

Ensure that the table_detection function is correctly integrated and that its parameters are appropriate for the intended functionality.

Verification successful

Integration of table_detection is correct.

The table_detection function is properly defined and integrated in the codebase. It is defined in marker/tables/table.py and used correctly in marker/convert.py with the appropriate parameters.

Function Definition: marker/tables/table.py

Function Usage: marker/convert.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the integration of `table_detection` in the codebase.

# Test: Search for the `table_detection` function definition and its usage.
ast-grep --lang python --pattern $'def table_detection($_, $_, $_):'
Length of output: 69

Script:
#!/bin/bash
# Search for any definition of the `table_detection` function in the codebase.
rg 'def table_detection' --glob '*.py' -A 5

# Search for all usages of `table_detection` in the codebase.
rg 'table_detection' --glob '*.py' -A 2
Length of output: 1005
marker/tables/table.py (5)

1-22: Imports look good.

The imports are necessary for the new functionality and image processing techniques.

Line range hint 24-48: Function get_table_surya looks good.

The logic for processing table data using bounding boxes and spans is well-implemented.

Line range hint 51-97: Function get_table_pdftext looks good.

The logic for processing table data using character blocks and spans is well-implemented.

Line range hint 100-176: Function format_tables looks good.

The logic for formatting tables into GitHub-flavored markdown is well-implemented.

278-283: Function ocr_img looks good.

The OCR implementation using Tesseract is well-configured.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

Outside diff range, codebase verification and nitpick comments (5)

marker/tables/utils.py (1)
46-49: Remove or clarify commented-out code.

The commented-out lines suggest functionality that is not currently implemented. If these lines are not needed, consider removing them to improve code clarity. If they are placeholders for future work, add a comment explaining their purpose.
-    # box = [int(coord) for coord in box.tolist()]
-    # Crop format: (left, upper, right, lower)
-    # left, upper, right, lower = box
marker/tables/schema.py (1)
51-52: Remove or clarify commented-out code.

The commented-out __str__ method in the Rectangle class should be removed if not needed, or uncommented if it serves a purpose. Consider adding a comment to explain its intended use.
-    # def __str__(self) -> str:
-    #    return f"x: {self.x}, y: {self.y}, width:{self.width}, height:{self.height}"
marker/tables/intersections.py (2)
Line range hint 228-228: Avoid ambiguous variable names.

The variable name l is ambiguous and can be confused with the number 1 or the letter I. Consider renaming it to something more descriptive.
-    for l in h_lines:
+    for line in h_lines:

-    for l in v_lines:
+    for line in v_lines:
Also applies to: 249-249

Tools

Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

Line range hint 251-251: Remove unused variable start_point.

The variable start_point is assigned but never used. Consider removing it to clean up the code.
-        start_point = x - LEN_THRESHOLD, y - line_width
Tools

Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)
marker/convert.py (1)
171-172: Clarify or remove commented-out code.

The commented-out lines related to format_tables suggest a change in strategy. If this logic is no longer needed, consider removing it. Otherwise, add comments to explain why it's commented out.
-    # table_count = format_tables(pages)
-    # out_meta["block_stats"]["table"] = table_count

coderabbitai · 2024-08-21T05:14:29Z

marker/tables/schema.py

+    def fromCoords(x1, y1, x2, y2):
+        return Rectangle(x1, y1, x2 - x1, y2 - y1)
+
+    def fromPoints(top_left_point: Point, bottom_right_point: Point):


Consider making fromCoords and fromPoints class methods.

The fromCoords and fromPoints methods could be converted to class methods using the @classmethod decorator. This would make it clear that they are alternative constructors for the Rectangle class.

+ @classmethod def fromCoords(cls, x1, y1, x2, y2): return cls(x1, y1, x2 - x1, y2 - y1) + @classmethod def fromPoints(cls, top_left_point: Point, bottom_right_point: Point): return cls( top_left_point.x, top_left_point.y, abs(top_left_point.x - bottom_right_point.x), abs(top_left_point.y - bottom_right_point.y), )

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def fromCoords(x1, y1, x2, y2):

return Rectangle(x1, y1, x2 - x1, y2 - y1)

def fromPoints(top_left_point: Point, bottom_right_point: Point):

@classmethod

def fromCoords(cls, x1, y1, x2, y2):

return cls(x1, y1, x2 - x1, y2 - y1)

@classmethod

def fromPoints(cls, top_left_point: Point, bottom_right_point: Point):

return cls(

top_left_point.x,

top_left_point.y,

abs(top_left_point.x - bottom_right_point.x),

abs(top_left_point.y - bottom_right_point.y),

)

coderabbitai · 2024-08-21T05:14:29Z

marker/tables/intersections.py

+    new_cells = []
+    GAP = 10
+
+    for box_id, box in enumerate(boxes):


Use is not for object identity checks.

The conditions should use is not for checking object identity.

- if not vertical_lines_width is None: + if vertical_lines_width is not None: - if not horizontal_lines_height is None: + if horizontal_lines_height is not None:

Also applies to: 117-117

Tools

Ruff

64-64: Loop control variable box_id not used within loop body

Rename unused box_id to _box_id

(B007)

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/detections.py

+    original_vertical_lines = vertical_lines.copy()
+
+    # Modify the dimensions for intersection checking if specified
+    if not vertical_lines_width is None:


Use is not for object identity checks.

The conditions should use is not for checking object identity.

- if not vertical_lines_width is None: + if vertical_lines_width is not None: - if not horizontal_lines_height is None: + if horizontal_lines_height is not None:

Also applies to: 117-117

Tools

Ruff

111-111: Test for object identity should be is not

Convert to is not

(E714)

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/detections.py

+    img_h, img_w, _ = img.shape
+    checking_lines_left = []
+    checking_lines_right = []
+    for l in h_lines:


Avoid ambiguous variable names.

The variable name l is ambiguous and can be confused with the number 1 or the letter I. Consider renaming it to something more descriptive.

- for l in h_lines: + for line in h_lines: - for l in v_lines: + for line in v_lines:

Also applies to: 249-249

Tools

Ruff

228-228: Ambiguous variable name: l

(E741)

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/detections.py

+    checking_lines_bottom = []
+    for l in v_lines:
+        x, y, w, h = l.tolist()
+        start_point = x - LEN_THRESHOLD, y - line_width


Remove unused variable start_point.

The variable start_point is assigned but never used. Consider removing it to clean up the code.

- start_point = x - LEN_THRESHOLD, y - line_width

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

start_point = x - LEN_THRESHOLD, y - line_width

Tools

Ruff

251-251: Local variable start_point is assigned to but never used

Remove assignment to unused variable start_point

(F841)

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/table.py

+    # TODO: Add page check if page_num is valid
+    page = pdf.load_page(page_num)
+    pix = page.get_pixmap(dpi=180)
+    image = PIL.Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
+    return image


Reminder: Add page validation.

The TODO comment indicates that page validation is missing.

Do you want me to implement the page validation logic or open a GitHub issue to track this task?

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/table.py

+    image = get_page(pdf, page_num)
+
+    encodings = encoder(image, return_tensors="pt")
+
+    with torch.no_grad():
+        outputs = detector(**encodings)
+    width, height = image.size
+    target_size = [(height, width)]
+    results = encoder.post_process_object_detection(
+        outputs, threshold=THRESHOLD, target_sizes=target_size
+    )[0]
+
+    if len(results["boxes"]) > 0:
+        for idx, ex_box in enumerate(results["boxes"]):
+            box = Rectangle.fromCoords(*ex_box)
+            with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:
+                image_path = temp_file.name
+                save_table_image(image, box, image_path)
+                image = PIL.Image.open(image_path)
+                img_cv2 = cv2.imread(image_path)
+                ocr_data = ocr_img(image)
+
+                horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(
+                    image_path
+                )
+                text_line_bbox, text_line_height = detect_horizontal_textlines(
+                    ocr_data, image
+                )
+
+                text_line_bbox = [l for l in text_line_bbox if l.width > l.height]
+
+                _, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(
+                    text_line_bbox,
+                    horizontal_lines_bbox,
+                    horizontal_lines_height=text_line_height / 4,
+                )
+
+                horizontal_lines_bbox_clustered = cluster_horizontal_lines(
+                    new_horizontal_lines_bbox, text_line_height
+                )
+                vertical_line_bboxes_clustered = cluster_vertical_lines(
+                    vertical_line_bboxes, text_line_height
+                )
+
+                horizontal_lines_bbox_clustered_filt, _, _, _ = (
+                    filter_non_intersecting_lines(
+                        horizontal_lines_bbox_clustered,
+                        vertical_line_bboxes_clustered,
+                        vertical_lines_width=text_line_height / 8,
+                    )
+                )
+
+                h_lines, v_lines = extend_lines(
+                    img_cv2,
+                    horizontal_lines_bbox_clustered_filt,
+                    vertical_line_bboxes_clustered,
+                    text_line_height / 8,
+                )
+                for h in h_lines:
+                    h.draw(img_cv2)
+
+                for v in v_lines:
+                    v.draw(img_cv2)
+
+                cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)
+                return h_lines, v_lines, ocr_data, ex_box, image_path        


Consider renaming ambiguous variable names and addressing unused loop variable.

The function extract_table is well-implemented, but consider renaming the variable l to something more descriptive to avoid ambiguity. Also, rename the unused loop variable idx to _.

- for idx, ex_box in enumerate(results["boxes"]): + for _idx, ex_box in enumerate(results["boxes"]): - text_line_bbox = [l for l in text_line_bbox if l.width > l.height] + text_line_bbox = [line for line in text_line_bbox if line.width > line.height]

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8):

image = get_page(pdf, page_num)

encodings = encoder(image, return_tensors="pt")

with torch.no_grad():

outputs = detector(**encodings)

width, height = image.size

target_size = [(height, width)]

results = encoder.post_process_object_detection(

outputs, threshold=THRESHOLD, target_sizes=target_size

)[0]

if len(results["boxes"]) > 0:

for idx, ex_box in enumerate(results["boxes"]):

box = Rectangle.fromCoords(*ex_box)

with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:

image_path = temp_file.name

save_table_image(image, box, image_path)

image = PIL.Image.open(image_path)

img_cv2 = cv2.imread(image_path)

ocr_data = ocr_img(image)

horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(

image_path

)

text_line_bbox, text_line_height = detect_horizontal_textlines(

ocr_data, image

)

text_line_bbox = [l for l in text_line_bbox if l.width > l.height]

_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(

text_line_bbox,

horizontal_lines_bbox,

horizontal_lines_height=text_line_height / 4,

)

horizontal_lines_bbox_clustered = cluster_horizontal_lines(

new_horizontal_lines_bbox, text_line_height

)

vertical_line_bboxes_clustered = cluster_vertical_lines(

vertical_line_bboxes, text_line_height

)

horizontal_lines_bbox_clustered_filt, _, _, _ = (

filter_non_intersecting_lines(

horizontal_lines_bbox_clustered,

vertical_line_bboxes_clustered,

vertical_lines_width=text_line_height / 8,

)

)

h_lines, v_lines = extend_lines(

img_cv2,

horizontal_lines_bbox_clustered_filt,

vertical_line_bboxes_clustered,

text_line_height / 8,

)

for h in h_lines:

h.draw(img_cv2)

for v in v_lines:

v.draw(img_cv2)

cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)

return h_lines, v_lines, ocr_data, ex_box, image_path

def extract_table(pdf: fitz.Document, page_num: int, encoder, detector, THRESHOLD=0.8):

image = get_page(pdf, page_num)

encodings = encoder(image, return_tensors="pt")

with torch.no_grad():

outputs = detector(**encodings)

width, height = image.size

target_size = [(height, width)]

results = encoder.post_process_object_detection(

outputs, threshold=THRESHOLD, target_sizes=target_size

)[0]

if len(results["boxes"]) > 0:

for _idx, ex_box in enumerate(results["boxes"]):

box = Rectangle.fromCoords(*ex_box)

with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as temp_file:

image_path = temp_file.name

save_table_image(image, box, image_path)

image = PIL.Image.open(image_path)

img_cv2 = cv2.imread(image_path)

ocr_data = ocr_img(image)

horizontal_lines_bbox, vertical_line_bboxes = detect_borderlines(

image_path

)

text_line_bbox, text_line_height = detect_horizontal_textlines(

ocr_data, image

)

text_line_bbox = [line for line in text_line_bbox if line.width > line.height]

_, _, _, new_horizontal_lines_bbox = filter_non_intersecting_lines(

text_line_bbox,

horizontal_lines_bbox,

horizontal_lines_height=text_line_height / 4,

)

horizontal_lines_bbox_clustered = cluster_horizontal_lines(

new_horizontal_lines_bbox, text_line_height

)

vertical_line_bboxes_clustered = cluster_vertical_lines(

vertical_line_bboxes, text_line_height

)

horizontal_lines_bbox_clustered_filt, _, _, _ = (

filter_non_intersecting_lines(

horizontal_lines_bbox_clustered,

vertical_line_bboxes_clustered,

vertical_lines_width=text_line_height / 8,

)

)

h_lines, v_lines = extend_lines(

img_cv2,

horizontal_lines_bbox_clustered_filt,

vertical_line_bboxes_clustered,

text_line_height / 8,

)

for h in h_lines:

h.draw(img_cv2)

for v in v_lines:

v.draw(img_cv2)

cv2.imwrite(f"out/pg_{page_num}.png", img_cv2)

return h_lines, v_lines, ocr_data, ex_box, image_path

Tools

Ruff

300-300: Loop control variable idx not used within loop body

Rename unused idx to _idx

(B007)

316-316: Ambiguous variable name: l

(E741)

coderabbitai · 2024-08-21T05:14:30Z

marker/tables/table.py

+    from transformers import (
+        TableTransformerForObjectDetection,
+        DetrFeatureExtractor
+    )
+
+    # Model setup
+    THRESHOLD = 0.8
+    table_detect_model = TableTransformerForObjectDetection.from_pretrained(
+    "microsoft/table-transformer-detection"
+)
+    feature_extractor = DetrFeatureExtractor()
+
+    doc = fitz.open(filename)
+    length = len(doc)
+    if max_pages:
+        length = min(length, max_pages)
+
+    for i_pg in range(length):
+        extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)
+        if extracted_data is None:
+            continue
+
+        (
+            h_lines,
+            v_lines,
+            ocr_data,
+            table_bbox,
+            table_img_path,
+        ) = extracted_data
+        img = cv2.imread(table_img_path)
+
+        print("H Lines: ", len(h_lines))
+        print("H Lines: ", len(v_lines))
+
+        h_lines.sort(key=lambda l: l.y)
+        v_lines.sort(key=lambda l: l.x)
+        rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)
+        for p_r in rowwise_intersections:
+            for p in p_r:
+                p.draw(img)
+        boxes = detect_boxes(rowwise_intersections)
+        cells = get_cells(boxes, h_lines, v_lines)
+
+        words_original = [
+    {"x": x, "y": y, "width": width, "height": height, "text": text}
+        for x, y, width, height, text in zip(
+            ocr_data["left"],
+            ocr_data["top"],
+            ocr_data["width"],
+            ocr_data["height"],
+            ocr_data["text"],
+        )
+        if text.strip()
+    ]
+        fill_text_in_cells(words_original, cells, img)
+        # row wise cell segregation
+        rows = defaultdict(dict)
+        for cell in cells:
+            rows[cell.r][cell.c] = cell.text
+
+        table_df = pd.DataFrame.from_dict(rows, orient="index")
+        if table_df.empty:
+            print("The DataFrame is empty")
+            continue
+        table_df = table_df.drop_duplicates(keep='last')
+        table_df.columns = table_df.iloc[0]
+        table_df = table_df[1:].reset_index(drop=True)
+        table_df.to_csv(f"{i_pg}_table.csv", index=False)
+        md = table_df.to_markdown(index=False)
+
+        table_block = Block(
+                bbox=table_bbox,
+                block_type="Table",
+                pnum=i_pg,
+                lines=[Line(
+                    bbox=table_bbox,
+                    spans=[Span(
+                        bbox=table_bbox,
+                        span_id=f"{i_pg}_table",
+                        font="Table",
+                        font_size=0,
+                        font_weight=0,
+                        block_type="TABLE",
+                        text=md
+                    )]
+                )]
+            )
+        pages[i_pg].blocks.append(table_block)


Consider renaming ambiguous variable names.

The function table_detection is well-implemented, but consider renaming the variable l to something more descriptive to avoid ambiguity.

- h_lines.sort(key=lambda l: l.y) - v_lines.sort(key=lambda l: l.x) + h_lines.sort(key=lambda line: line.y) + v_lines.sort(key=lambda line: line.x)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def table_detection(filename: str, pages: List[Page], max_pages: int):

from transformers import (

TableTransformerForObjectDetection,

DetrFeatureExtractor

)

# Model setup

THRESHOLD = 0.8

table_detect_model = TableTransformerForObjectDetection.from_pretrained(

"microsoft/table-transformer-detection"

)

feature_extractor = DetrFeatureExtractor()

doc = fitz.open(filename)

length = len(doc)

if max_pages:

length = min(length, max_pages)

for i_pg in range(length):

extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)

if extracted_data is None:

continue

(

h_lines,

v_lines,

ocr_data,

table_bbox,

table_img_path,

) = extracted_data

img = cv2.imread(table_img_path)

print("H Lines: ", len(h_lines))

print("H Lines: ", len(v_lines))

h_lines.sort(key=lambda l: l.y)

v_lines.sort(key=lambda l: l.x)

rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)

for p_r in rowwise_intersections:

for p in p_r:

p.draw(img)

boxes = detect_boxes(rowwise_intersections)

cells = get_cells(boxes, h_lines, v_lines)

words_original = [

{"x": x, "y": y, "width": width, "height": height, "text": text}

for x, y, width, height, text in zip(

ocr_data["left"],

ocr_data["top"],

ocr_data["width"],

ocr_data["height"],

ocr_data["text"],

)

if text.strip()

]

fill_text_in_cells(words_original, cells, img)

# row wise cell segregation

rows = defaultdict(dict)

for cell in cells:

rows[cell.r][cell.c] = cell.text

table_df = pd.DataFrame.from_dict(rows, orient="index")

if table_df.empty:

print("The DataFrame is empty")

continue

table_df = table_df.drop_duplicates(keep='last')

table_df.columns = table_df.iloc[0]

table_df = table_df[1:].reset_index(drop=True)

table_df.to_csv(f"{i_pg}_table.csv", index=False)

md = table_df.to_markdown(index=False)

table_block = Block(

bbox=table_bbox,

block_type="Table",

pnum=i_pg,

lines=[Line(

bbox=table_bbox,

spans=[Span(

bbox=table_bbox,

span_id=f"{i_pg}_table",

font="Table",

font_size=0,

font_weight=0,

block_type="TABLE",

text=md

)]

)]

)

pages[i_pg].blocks.append(table_block)

def table_detection(filename: str, pages: List[Page], max_pages: int):

from transformers import (

TableTransformerForObjectDetection,

DetrFeatureExtractor

)

# Model setup

THRESHOLD = 0.8

table_detect_model = TableTransformerForObjectDetection.from_pretrained(

"microsoft/table-transformer-detection"

)

feature_extractor = DetrFeatureExtractor()

doc = fitz.open(filename)

length = len(doc)

if max_pages:

length = min(length, max_pages)

for i_pg in range(length):

extracted_data = extract_table(doc, i_pg, feature_extractor, table_detect_model, THRESHOLD=THRESHOLD)

if extracted_data is None:

continue

(

h_lines,

v_lines,

ocr_data,

table_bbox,

table_img_path,

) = extracted_data

img = cv2.imread(table_img_path)

print("H Lines: ", len(h_lines))

print("H Lines: ", len(v_lines))

h_lines.sort(key=lambda line: line.y)

v_lines.sort(key=lambda line: line.x)

rowwise_intersections = detect_rowwise_intersection(h_lines, v_lines)

for p_r in rowwise_intersections:

for p in p_r:

p.draw(img)

boxes = detect_boxes(rowwise_intersections)

cells = get_cells(boxes, h_lines, v_lines)

words_original = [

{"x": x, "y": y, "width": width, "height": height, "text": text}

for x, y, width, height, text in zip(

ocr_data["left"],

ocr_data["top"],

ocr_data["width"],

ocr_data["height"],

ocr_data["text"],

)

if text.strip()

]

fill_text_in_cells(words_original, cells, img)

# row wise cell segregation

rows = defaultdict(dict)

for cell in cells:

rows[cell.r][cell.c] = cell.text

table_df = pd.DataFrame.from_dict(rows, orient="index")

if table_df.empty:

print("The DataFrame is empty")

continue

table_df = table_df.drop_duplicates(keep='last')

table_df.columns = table_df.iloc[0]

table_df = table_df[1:].reset_index(drop=True)

table_df.to_csv(f"{i_pg}_table.csv", index=False)

md = table_df.to_markdown(index=False)

table_block = Block(

bbox=table_bbox,

block_type="Table",

pnum=i_pg,

lines=[Line(

bbox=table_bbox,

spans=[Span(

bbox=table_bbox,

span_id=f"{i_pg}_table",

font="Table",

font_size=0,

font_weight=0,

block_type="TABLE",

text=md

)]

)]

)

pages[i_pg].blocks.append(table_block)

Tools

Ruff

213-213: Ambiguous variable name: l

(E741)

214-214: Ambiguous variable name: l

(E741)

Added table detection and parser

ee9069f

Gautam-Rajeev merged commit bbf24fd into master Aug 21, 2024

coderabbitai bot reviewed Aug 21, 2024

View reviewed changes

coderabbitai bot mentioned this pull request Sep 12, 2024

Included a condition to check for empty list before mean #16

Closed

coderabbitai bot mentioned this pull request Sep 24, 2024

Detecting language page-wise #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table Detection and Parser #11

Table Detection and Parser #11

TakshPanchal commented Aug 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2024 •

edited

Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

coderabbitai bot Aug 21, 2024

-    def fromCoords(x1, y1, x2, y2):
-        return Rectangle(x1, y1, x2 - x1, y2 - y1)
-    def fromPoints(top_left_point: Point, bottom_right_point: Point):
+    @classmethod
+    def fromCoords(cls, x1, y1, x2, y2):
+        return cls(x1, y1, x2 - x1, y2 - y1)
+    @classmethod
+    def fromPoints(cls, top_left_point: Point, bottom_right_point: Point):
+        return cls(
+            top_left_point.x,
+            top_left_point.y,
+            abs(top_left_point.x - bottom_right_point.x),
+            abs(top_left_point.y - bottom_right_point.y),
+        )

Table Detection and Parser #11

Table Detection and Parser #11

Conversation

TakshPanchal commented Aug 21, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Aug 21, 2024 • edited Loading

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Aug 21, 2024

Choose a reason for hiding this comment

TakshPanchal commented Aug 21, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)