Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnsupervisedReadingOrder orders 2 blocks on the same row out of order #836

Closed
davebrokit opened this issue May 20, 2024 · 2 comments
Closed

Comments

@davebrokit
Copy link
Contributor

davebrokit commented May 20, 2024

Unsupervised reading order may order 2 blocks on the same row out of order. It doesn't try to reorder the blocks when they are on the same row, and uses the default order the elements were passed in.

The problem seems to be caused here:

Due to the ordering of the if statements the code will always select IntervalRelations.PrecedesI or IntervalRelations.Precedes and ignore other cases.

I can raise a PR with a fix if this is the case but wanted to check with the experts (but would prefer if you guys did it as it has a potential big impact).

Minimum test to check scenario

using NUnit.Framework;
using UglyToad.PdfPig.Content;
using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
using UglyToad.PdfPig.DocumentLayoutAnalysis;
using UglyToad.PdfPig.Core;

namespace ReadingOrderDectorTests
{
    public class ReadingOrderDectorTest
    {
        [Test]
        public void ReadingOrderDoesNotOrderRowContents()
        {
            var letterA = new Letter("a",
                new PdfRectangle(new PdfPoint(0, 0), new PdfPoint(10, 10)),
                new PdfPoint(0, 0),
                new PdfPoint(10, 0),
                10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
            var leftTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterA }) }) });

            var letterB = new Letter("b",
                new PdfRectangle(new PdfPoint(100, 0), new PdfPoint(110, 10)),
                new PdfPoint(100, 0),
                new PdfPoint(110, 0),
                10, 1, null, TextRenderingMode.NeitherClip, null, null, 0, 0);// These don't matter
            var rightTextBlock = new TextBlock(new[] { new TextLine(new[] { new Word(new[] { letterB }) }) });

            // We deliberately submit in the wrong order
            var textBlocks = new List<TextBlock>() { rightTextBlock, leftTextBlock };

            var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(5, UnsupervisedReadingOrderDetector.SpatialReasoningRules.RowWise);
            var orderedBlocks = unsupervisedReadingOrderDetector.Get(textBlocks);

            var ordered = orderedBlocks.OrderBy(x => x.ReadingOrder).ToList();
            Assert.That(ordered[0].BoundingBox.Left, Is.EqualTo(0));
            Assert.That(ordered[1].BoundingBox.Left, Is.EqualTo(100));
        }
    }
}
@BobLd
Copy link
Collaborator

BobLd commented May 20, 2024

Hi @davebrokit thanks a lot for that, I haven't checked but what you say seems to make sense. Happy for you to create a PR, I'll review it. If you can add tests, that'd be amazing. Thx!

@davebrokit
Copy link
Contributor Author

PR raised

BobLd pushed a commit that referenced this issue May 29, 2024
… of order (#841)

* #836 Fix UnsupervisedReadingOrder orders 2 blocks on the same row out of order
Add images for documentation

* Update Documentation: Additional example, Reference to wiki

* Change code formating to C# on documentation

* Fix link in documentation

* Fix Spelling

---------

Co-authored-by: David <David@david>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants