Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fatal parse error analysing document using invoice model - "not recognized as a valid DateTime" #27137

Closed
kweebtronic opened this issue Feb 22, 2022 · 9 comments
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Cognitive - Form Recognizer customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service This issue points to a problem in the service.

Comments

@kweebtronic
Copy link

kweebtronic commented Feb 22, 2022

Library name and version

Azure.AI.FormRecognizer 4.0.0-beta.3

Describe the bug

Send a document to the analyser for 'invoice' processing.
Service responds without error but SDK throws exception due to parsing issue.

Expected behavior

SDK should return the analysed document information, with best efforts at recognising data types
This should be a TRY parse, not fail everything because of one dubious value.
Analysis model should be flexible enough to return values just as text, if they are 'date-ish' or 'number-ish'

Actual behavior

SDK throws System.FormatException:

The string 'yyyy-08-21' was not recognized as a valid DateTime. There is an unknown word starting at index '0'.
at System.DateTimeParse.Parse(ReadOnlySpan`1 s, DateTimeFormatInfo dtfi, DateTimeStyles styles, TimeSpan& offset)\r\n   
at System.DateTimeOffset.Parse(String input, IFormatProvider formatProvider, DateTimeStyles styles)\r\n   
at Azure.Core.TypeFormatters.ParseDateTimeOffset(String value, String format)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.DocumentField.DeserializeDocumentField(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.DocumentField.DeserializeDocumentField(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.DocumentField.DeserializeDocumentField(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.AnalyzedDocument.DeserializeAnalyzedDocument(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.AnalyzeResult.DeserializeAnalyzeResult(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.AnalyzeResultOperation.DeserializeAnalyzeResultOperation(JsonElement element)\r\n   
at Azure.AI.FormRecognizer.DocumentAnalysis.DocumentAnalysisRestClient.<GetAnalyzeDocumentResultAsync>d__11.MoveNext()\r\n   

Reproduction Steps

Submitting financial document so won't provide the source data, but the stack trace should be sufficient to trace the root cause... this is a vanilla call to the client

            try
            {
                var apiResponse = await _documentAnalysisClient.StartAnalyzeDocumentFromUriAsync("prebuilt-invoice", uri);

                await apiResponse.WaitForCompletionAsync();

                return apiResponse.Value;
            }
            catch (Exception e)
            {
                log.LogError($"{e.GetType()}\n{e.Message}\n{e.StackTrace}");
            }

Environment

.NET SDK (reflecting any global.json):
Version: 6.0.200
Commit: 4c30de7899

Runtime Environment:
OS Name: Windows
OS Version: 10.0.22000
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\6.0.200\

Host (useful for support):
Version: 6.0.2
Commit: 839cdfb0ec

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 22, 2022
@azure-sdk azure-sdk added Client This issue points to a problem in the data-plane of the library. Cognitive - Form Recognizer needs-team-triage Workflow: This issue needs the team to triage. labels Feb 22, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 22, 2022
@jsquire jsquire added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-team-triage Workflow: This issue needs the team to triage. labels Feb 22, 2022
@jsquire
Copy link
Member

jsquire commented Feb 22, 2022

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@kinelski
Copy link
Member

Hello @kweebtronic.

Thank you for bringing this matter to our attention. This issue has been reported by other customers in the past, and we are currently considering the possibility of adding Try methods as you suggested (here's the GitHub issue tracking this work: #24596).

For the time being, have you tried accessing the DocumentField.Content property? It contains the text of the field in string format so you can try parsing it in your code.

@kinelski kinelski added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Feb 22, 2022
@kweebtronic
Copy link
Author

To access DocumentField.Content I need to navigate sub-properties of the response class (response.Value.Documents[].Fields[])

Am I supposed to write some sort of try-catch logic for resolving the API response classes?
Doesn't seem like a robust API if that's the case

Still fails:

                var apiResponse = await _documentAnalysisClient.StartAnalyzeDocumentFromUriAsync("prebuilt-invoice", uri);

                await apiResponse.WaitForCompletionAsync();

                foreach (var (fieldName, field) in apiResponse.Value.Documents[0].Fields)
                {
                    Console.WriteLine($"{fieldName}: {field.Content}");
                }

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Feb 22, 2022
@kinelski
Copy link
Member

@kweebtronic Now I see I misunderstood the issue, so please disregard my previous reply. The behavior you're describing seems to be a bug.

From my understanding, the service is returning a document field with "type": "Date" and "valueDate": "yyyy-08-21". Is that correct?

@kinelski kinelski added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Feb 23, 2022
@kweebtronic
Copy link
Author

kweebtronic commented Feb 24, 2022

@kinelski yes, traced the HTTP response and yes for some reason the API has returned that as a date value.
The SDK should be robust enough to still provide me a value response without falling over on this one dubious item though

Extract of detected invoice item (from HTTP trace of SDK call to API - request still failed to return value to my class):

                     {
                        "type": "object",
                        "valueObject":
                        {
                           "Amount":
                           {
                              "type": "currency",
                              "valueCurrency":
                              {
                                 "amount": 233.73
                              },
                              "content": "233.73",
                              "boundingRegions": [
                                 {
                                    "pageNumber": 3,
                                    "boundingBox": [6.6843, 5.4088, 7.0591, 5.4088, 7.0591, 5.4989, 6.6843, 5.4989]
                                 }
                              ],
                              "confidence": 0.954,
                              "spans": [
                                 {
                                    "offset": 6305,
                                    "length": 6
                                 }
                              ]
                           },
                           "Date":
                           {
                              "type": "date",
                              "valueDate": "yyyy-08-21",
                              "content": "Aug 21",
                              "boundingRegions": [
                                 {
                                    "pageNumber": 3,
                                    "boundingBox": [0.9255, 5.4069, 1.2953, 5.4069, 1.2953, 5.524, 0.9255, 5.524]
                                 }
                              ],
                              "confidence": 0.946,
                              "spans": [
                                 {
                                    "offset": 6269,
                                    "length": 6
                                 }
                              ]
                           },
                           "Description":
                           {
                              "type": "string",
                              "valueString": "HILLS FRESH",
                              "content": "HILLS FRESH",
                              "boundingRegions": [
                                 {
                                    "pageNumber": 3,
                                    "boundingBox": [1.573, 5.3962, 2.4617, 5.3962, 2.4617, 5.4982, 1.573, 5.4982]
                                 }
                              ],
                              "confidence": 0.955,
                              "spans": [
                                 {
                                    "offset": 6276,
                                    "length": 11
                                 }
                              ]
                           }
                        },
                        "content": "14 Aug 21 HILLS FRESH MUNDARING DC AUS 233.73",
                        "boundingRegions": [
                           {
                              "pageNumber": 3,
                              "boundingBox": [0.7626, 5.3962, 7.0591, 5.3962, 7.0591, 5.524, 0.7626, 5.524]
                           }
                        ],
                        "confidence": 0.964,
                        "spans": [
                           {
                              "offset": 6266,
                              "length": 45
                           }
                        ]
                     }

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Feb 24, 2022
@kinelski
Copy link
Member

kinelski commented Mar 10, 2022

@kweebtronic Apologies for the delayed response. I have discussed this matter with the service team and confirmed it's a bug. They already have a fix but deployment is expected to take around two weeks, so I'll get back to you when that happens.

Once the fix is in place, you won't be able to access the field date value with DocumentField.AsDate as our samples suggest. This only affects "incomplete" dates that can't be parsed by the SDK such as "yyyy-08-21". In these cases, you'll need to access the string representation of the date in DocumentField.Content and parse it in your code if necessary.

If you're blocked by this bug and need a fix asap, you could use an HTTP policy to intercept the service response and manually remove the dates causing the bug:

internal class DateFixPolicy : HttpPipelineSynchronousPolicy
{
    public override void OnReceivedResponse(HttpMessage message)
    {
        if (message.Response.ContentStream != null)
        {
            byte[] bytes = message.Response.Content.ToArray();
            string content = Encoding.UTF8.GetString(bytes);
            string modifiedContent = Regex.Replace(content, "\"valueDate\":\"yyyy-[0-9]{2}-[0-9]{2}\"", "\"valueDate\":null");

            message.Response.ContentStream.Dispose();
            message.Response.ContentStream = new MemoryStream(Encoding.UTF8.GetBytes(modifiedContent));
        }

        base.OnReceivedResponse(message);
    }
}

You need to set it in the client options like this:

var options = new DocumentAnalysisClientOptions();
options.AddPolicy(new DateFixPolicy(), HttpPipelinePosition.PerCall);

var client = new DocumentAnalysisClient(<endpoint>, <credential>, options);

@kinelski kinelski added bug This issue requires a change to an existing behavior in the product in order to be resolved. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Mar 10, 2022
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 10, 2022
@kinelski kinelski added Service This issue points to a problem in the service. and removed Client This issue points to a problem in the data-plane of the library. labels Mar 10, 2022
@kinelski kinelski removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 10, 2022
@kinelski
Copy link
Member

@kweebtronic The fix has been deployed. Could you confirm if you can still repro the bug?

@kinelski kinelski added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Mar 16, 2022
@kweebtronic
Copy link
Author

Thanks @kinelski , I am now able to retrieve the document analysis value with the SDK, without error

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Mar 17, 2022
@kinelski
Copy link
Member

That's great! Feel free to open another issue if you have any other issues.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Cognitive - Form Recognizer customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service This issue points to a problem in the service.
Projects
None yet
Development

No branches or pull requests

4 participants