Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onResult words being reset while still listening #552

Open
righteoustales opened this issue Sep 4, 2024 · 69 comments
Open

onResult words being reset while still listening #552

righteoustales opened this issue Sep 4, 2024 · 69 comments

Comments

@righteoustales
Copy link

Context: flutter 3.16.2 on IOS (iphone12 running 17.6.1) using speech_to_text (6.6.2).

With a listen call set with options as follows:
SpeechListenOptions options = SpeechListenOptions(
listenMode: ListenMode.dictation,
partialResults: true,
onDevice: true,
);
await _speechToText.listen(onResult: _onSpeechResult,
listenOptions: options);

I am seeing the buffer of words returned via:
void '_onSpeechResult(SpeechRecognitionResult result)'

get reset (all words deleted) before the listen times out. This happens if there is a short pause between words spoken - not a long pause at all, maybe 2 seconds at most.

For example, if I speak "add 1+2+3+4 (brief pause)+5", the words returned up until the pause are "add 1+2+3+4", but after the pause the SpeechRecognitionResult is reset and returns "+5" only.

The listen is active throughout this (ie. didn't stop)
I check result.isFinal and it is set to 'false' for each callback above as well.

Is this normal? Any idea how to prevent it or if preventing it isn't possible how to recognize when it is occurring so I can code around it?

Thanks in advance.
-Gerald

@sowens-csd
Copy link
Contributor

Do you get a callback with the final set of results isFinal: true? If so what's the full content of the results at that point? Is the "+5" the only element in the recognition results list? I haven't yet tried to reproduce this but I haven't seen this behaviour before.

@righteoustales
Copy link
Author

righteoustales commented Sep 4, 2024

Do you get a callback with the final set of results isFinal: true?
No. But the listen was still active so I didn't expect to.

If so what's the full content of the results at that point? Is the "+5" the only element in the recognition results list?
Yes

I haven't yet tried to reproduce this but I haven't seen this behaviour before.
Here's some more info that I think may be helpful. If I change my SpeechListenOptions to only specify onDevice: true, then I see exactly the behavior reported above. But, if I change onDevice to false I can wait longer than 2 mins and not see this behavior. And that is true even if I turn off the wifi on my laptop. I tested this several times this AM and was able to pause before the "+5" for at least 2 mins and still not see the string returned being reset to empty. I also did not see the listen timeout for the duration of the 2 mins+ that I paused before saying "+5". Switching back to onDevice: true causes the problem to reoccur every time.

How is onDevice different such that it might cause this? I couldn't find a definition for what setting it to true does in the API doc, but intuited that it would not call the cloud for the purpose of voice recognition. But, given it works to set this as onDevice: false and with no internet connectivity, perhaps I intuited wrongly?

Btw, I both downgraded to 6.5.1 and upgraded to 7.0.0 as part of my testing. All have the same behavior.

@sowens-csd
Copy link
Contributor

Thanks for the details. I'll try to reproduce and let you know what I find. If you have a chance to try stopping the recognition and finding out what the recognition result is when final is true that would be interesting.

You're mostly correct about the behaviour of onDevice, I should update the docs to provide more details. with onDevice: true recognition MUST be done on device and will fail completely if the device cannot do that. When false it is up to the device to decide, some or most may happen on device, particularly with newer devices, but no guarantees are made.

@righteoustales
Copy link
Author

If I issue stop() the final result (ie. when the SpeechRecognitionResult's finalResult == true) is the "+5".

@flutterocks
Copy link

flutterocks commented Sep 12, 2024

I'm also experiencing the exact same thing. been using this plugin for about a month now and only recently have started noticing this issue. maybe something changed on apples side? also version 6.6.2 on iOS.

i am using something close to the example app but keep track of every time 'isFinal' is true and keep a history of the whole sentence, for some reason now when i pause and it clears, isFinal is false.

Seems to be an apple issue: https://forums.developer.apple.com/forums/thread/762952
and https://forums.developer.apple.com/forums/thread/731761

@righteoustales
Copy link
Author

It's unfortunate that the first issue above ../762952 reports that this issue does not occur on ios 17, but started with 18. That is not what we are seeing. I submitted a comment over there informing them that I am seeing the same behavior on ios 17.

@flutterocks What version of IOS are you running?

@sowens-csd
Copy link
Contributor

thanks @flutterocks, this is helpful. @righteoustales is this a problem that started for you relatively recently? The 731761 thread implies the issue only happens with on device recognition, is that what you are seeing?

If it's an iOS issue then I doubt I can do anything useful at the plugin level to help resolve it unless there's been an API change that I missed.

@righteoustales
Copy link
Author

thanks @flutterocks, this is helpful. @righteoustales is this a problem that started for you relatively recently? The 731761 thread implies the issue only happens with on device recognition, is that what you are seeing?

--- That's what I documented above in this thread as well. I only recently started using this flutter library so I can't speak to the history of it working or not. It noticed it was broken immediately after setting the on-device flag to true.

@flutterocks
Copy link

@righteoustales I've recently updated to iOS 18, which is probably why I am only experiencing this now. Though I have onDevice set to false.

Perhaps this 'broken' experience happens onDevice is true on older iOS versions, and as of iOS 18 it happens in either case. Or maybe for some reason iOS 18 will heavily favour onDevice regardless of the value flag.

Regardless of what might be causing this, I agree @sowens-csd, there isn't much this package can do to resolve. Though I will implement the suggestion from 731761, using timestamp to help determine if the result is 'final', likely in combination with comparing against the previous result (to prevent accidentally marking as final if there is latency)

Some pseudo code of what I'm thinking:

likelyFinal = (prevResult.recognizedWords.length > currResult.recognizedWords.length) && ((currResult.timeStamp - prevResult.timeStamp) > X

where i will experiment with X to find what works, likely will have a value of ~1-2 seconds. I'll implement it this weekend on my end.

Given seems to be impacting a lot of users, i could see value in having this directly inside speech_to_text in addition to finalResult but i'll leave that up to @sowens-csd given the bloat this would introduce

@righteoustales
Copy link
Author

I'm disappointed to hear that the workaround of setting the flag to false doesn't even work in ios18. @flutterocks, are you the person who reported it over on the apple forum and to whom I replied? It's a different name there, but I'm sure we all have multiple names that we use spanning various forums over the years.

@flutterocks
Copy link

@righteoustales not me no, i just found the threads from some googling to see if the issue was flutter specific or apple

@sowens-csd
Copy link
Contributor

@flutterocks so your thinking in that work around you suggested is that Apple is essentially starting a new recognition? So the goal would be to deliver the previous final results in some way so that the user knows they should be stored and that a new set of results will start? It's an interesting idea. Naively I was hoping that Apple would fix their implementation, but that could of course take a while. One problem is that I've seen some fairly long delays to the final results and that the speech recognition engine will not infrequently reinterpret previous results based on new context, which could result in false positives from that test. Also it would have to be iOS specific since the other engines don't have the same failure mode.

I agree that the impact of the failure is fairly large, it would be good to be able to help mitigate it.

@righteoustales
Copy link
Author

righteoustales commented Sep 13, 2024

@sowens-csd @flutterocks I was going to point out something similar to the "One problem" comment above. It doesn't work to save what was there previously for comparison as the recognition logic frequently reinterprets what the text first delivered (and second and third) said the more that you speak. For example, in my example above, if I had said:

"add 347.12 + 1"

You can watch in real time as the the first number is first recognized as 300, then 347, and so on as the recognition logic is processing. Given that, it can become very difficult to use comparison to distinguish between a reinterpretation of everything said so far versus when it is simply throwing away all of the preceding text and starting fresh. Have either of you found a way to tell the difference between the two? Does looking at the segment timestamp as proposed above actually work? I don't think that comparing to the previous result is going to work.

This feels like an Apple bug unless they manifest data with the results returned that can reliably be used to prevent the loss of previously spoken text.

@flutterocks
Copy link

Just downloaded apple's SpokenWord demo mentioned in 762952 and I'm experiencing the same issue, so fully confirmed it has to do with apple.

I inspected the results and here's some interesting findings:

  • isFinal is always false
  • speechRecognitionMetadata is null except for when I take a few seconds pause, aka when I would expect isFinal to be true
  • when the bestTranscription clears, timestamp becomes 0 for the new partial transcription

note: onDevice is true in this demo

There's a few options to explore:

  • infer isFinal from the presence of speechRecognitionMetadata (though not sure if this behaviour is consistent on older iOS and if not onDevice)
  • inter isFinal from succeeding transcriptions containing timestamp = 0

In any case these solutions would likely be temporary until / if apple fixes their bug. I'll probably personally wait until iOS18 officially releases next week to see if this still happens, but @righteoustales you're experiencing on 17.6.1. @righteoustales Can you download the apple sample and see if the same behaviour i described above happens?

@righteoustales
Copy link
Author

Sure.

Without any change, it manifests the problem discussed in this thread. Broken.

With only this change:

Screenshot 2024-09-13 at 5 40 55 PM

it does not. Not broken.

@righteoustales
Copy link
Author

Just to confirm are you (@flutterocks ) saying that the one-line change that I did above does not help at all on ios18? Ie. that it drops the text equally whether that flag is set to true or false? And, if so, have you also tried testing it with network connectivity/wifi completely disabled? Any difference then?

@flutterocks
Copy link

flutterocks commented Sep 14, 2024

Correct, even with requiresOnDeviceRecognition = false I'm experience the text dropping behaviour and isFinal always is false that is the case with wifi connected

just tested without wifi and same thing, which is expected given your scenario is ondevice

@righteoustales
Copy link
Author

Thanks for confirming and trying that additional test.

Btw, I also updated https://developer.apple.com/forums/thread/731761 with my own comments/test experience.

@righteoustales
Copy link
Author

That apple forum update I did has not yet been approved for some reason. Slackers. LOL.

I also messed around with setting the task hint between unspecified, dictation, search, and confirmation. None of them help.

@flutterocks
Copy link

flutterocks commented Sep 14, 2024

@righteoustales are you able to confirm if the following behaves the same for you? (specifically the last two points)

I inspected the results and here's some interesting findings:

  • isFinal is always false
  • speechRecognitionMetadata is null except for when I take a few seconds pause, aka when I would expect isFinal to be true
  • when the bestTranscription clears, timestamp becomes 0 for the new partial transcription

@righteoustales
Copy link
Author

All of the above 3 assertions are true for me as well.

@righteoustales
Copy link
Author

Given how old those two forum questions are on the apple developer forums and the complete absence of any acknowledgment from Apple on either, I'm not feeling very hopeful that they will do anything on this. But, I don't frequent their forums much. Any experience otherwise that is more hopeful than my conclusion here?

My current plan is to see how things look when ios 18 is released and decision accordingly given that. I think I read that that release is imminent, like maybe next week.

@flutterocks
Copy link

flutterocks commented Sep 15, 2024

@righteoustales Meant to release on the 16th i believe. I too will wait for that and hope for the best...

Btw do the speechRecognitionMetadata and timestamp behave the same for you regardless of what requiresOnDeviceRecognition is set to?

@righteoustales
Copy link
Author

righteoustales commented Sep 15, 2024 via email

@righteoustales
Copy link
Author

quick update after upgrading to ios18 since it was released today:

This api is now broken as described herein regardless of whether the requireOnDeviceRecognition flag is set to true or false. Goodbye friendly workaround.

@righteoustales
Copy link
Author

righteoustales commented Sep 16, 2024

I also updated the two apple forum issues. Maybe a bit of activity there will flush them out of the woodwork to comment on it, but I doubt it.

Summary of where we are from my perspective:

I question whether a developer would ever want this throw-away behavior, but will say with considerable certainty that they would for sure not want it if their task hint was set to "dictation".

Given that, I'm wondering if it is worthwhile for this speech_to_text (flutter) feature to deal with this (what I'm calling a) bug by noticing the deletion/start-over and then mitigating it by (re)prepending the words thrown away. And, if not comfortable with doing that for all cases, then perhaps doing so if the developer indicates they want it (via taskhint or other).

Without something of this nature, the speech-to-text results seem pretty unusable because as noted earlier in this thread the caller of this flutter api:

  1. does not have access to the metadata properties (e.g. timestamp reset to 0 visible only via IOS api objects) that indicate the reset occurred
  2. also cannot simply compare the current to the previous results to see what was dropped due to the ongoing changes that occur as recognition refines the words recognized.
  3. Also, for those that don't enable partial results, the words tossed will never be seen because they are deleted before listening is stopped.

Thoughts?

@sowens-csd
Copy link
Contributor

@righteoustales I'd have to agree with everything you said. Sure seems like a bug to me, at least a pretty major breaking behaviour change if it's not a bug. Supporting some mitigation in the plugin seems like the right path forward. Should Apple fix this then I'd think the mitigation would revert to a no-op since hopefully the timestamp reset would stop happening. I'll try to put together a beta and hopefully some folks can give it a try.

@sowens-csd sowens-csd added bug Something isn't working iOS labels Sep 17, 2024
@flutterocks
Copy link

flutterocks commented Sep 20, 2024

I've loaded in the beta version but it seems that the experience is the same broken one, there might be some caching going on so I'll test around some more.

Edit: still experiencing the broken behaviour

@righteoustales
Copy link
Author

righteoustales commented Sep 20, 2024

I've loaded in the beta version but it seems that the experience is the same broken one, there might be some caching going on so I'll test around some more.

Edit: still experiencing the broken behaviour

Interesting. I'm not seeing that. In my testing, I didn't have any dropped words at all so far. I'm wondering what is different.

UPDATE: does 'flutter pub deps' show the correct library version included as specified in your pubspec.yaml?

@kakobayashi
Copy link

Thank you for addressing this issue.
I tested with version 7.1.0-beta.1 on iOS 17.6.1 and iOS 18.0, and I did not experience any word drops.
It worked well in my environment.

@righteoustales
Copy link
Author

@sowens-csd This popped up today on Stack Overflow. Sharing in case it is useful for comparison.

https://stackoverflow.com/questions/79005416/sfspeechrecognitionresult-discards-previous-transcripts-when-making-long-pauses/79005417#79005417

@flutterocks
Copy link

flutterocks commented Sep 25, 2024

@righteoustales deps is showing 7.1.0-beta.1. recognizedWords is still only showing the last words I say, after taking a pause. same broken behaviour.

@righteoustales are you accessing the words from a different variable?

update: 🤦 i had partialResults set to false from some earlier testing. with partial results set to true the beta works as expected

@flutterocks
Copy link

@sowens-csd I suggest a user configurable separator - my use case has the user dictating a long string of text, many sentences. For me, a separator of ". " would make the results a lot more accurate. I.e. if a user is taking a long pause, it's likely the start of a new sentence.

I.e. the phrase
"The cat jumps off of the table" .... "he landed on both feet"

in the current implementation becomes: "The cat jumps off of the table He landed on both feet"

whereas it would make more sense as: "The cat jumps off of the table. He landed on both feet"

This is for my use case, I can see other use cases wanting a different separator hence suggestion for user configurable - but I would argue, that a period separator should be default - especially given the current capitalization behaviour.

@sowens-csd
Copy link
Contributor

@flutterocks interesting suggestion, I'll think about how to implement it. Your result about partial results is very interesting. I think it should generate the same result whether partial results is true or false so I'll check that.

@righteoustales
Copy link
Author

I set partialResults = true in order to display what is being said as it is said to the app user, so not testing for the 'false' option there.

Regarding, the request for the addition of punctuation at pauses, I can see scenarios where that is beneficial and a lot where it is not. If you decide to do that please allow for disabling it. I suspect you would do less harm by un-capitalizing the first word after the pause bug is detected than by adding punctuation in reaction to the apple-bug-induced capitalizing.

@flutterocks
Copy link

@sowens-csd to implement, can't we we use a configurable variable in place of += " " here https://github.com/csdcorp/speech_to_text/blame/main/speech_to_text/darwin/Classes/SpeechToTextPlugin.swift#L895-L904

@righteoustales
Copy link
Author

Clearly doable, but @sowens-csd stated earlier in this thread that he would like for the mitigation(s) to not require changing the api in a way that would no longer make sense when/if Apple fixes their bug. That's the challenging part.

@flutterocks
Copy link

@righteoustales Not sure where your hostility is coming from, just looking to find a solution here :)

Stephen says

Should Apple fix this then I'd think the mitigation would revert to a no-op since hopefully the timestamp reset would stop happening.

That is not the same as keeping the API unchanged.

It should be perfectly acceptable to temp add additional options to the API now to fix the issue, as long as they're no op once (if) apple fixes the bug, especially given some users will not update their OS and continue to be on the bugged versions of iOS.

@righteoustales
Copy link
Author

Hostility? I thought we were discussing pros and cons of potential changes as is commonly done in api discussions. I wasn't aware that talking about tradeoffs on github was now considered hostile sparring among combatants.

Dang it. I forgot to put on my best medieval armor too. I can never get the timing of these things right.

@lukyanov
Copy link

lukyanov commented Sep 27, 2024

I noticed that the behavior depends on the locale. If I set localeId as en, it works as expected, meaning recognizedWords would always contain the whole phrase spoken while listening.

If I change the locale to en_US, it starts resetting in the middle of the phrase when you pause shortly (as described in the original post).

It also works as expected for fr, but not for ru.

That might be the reason why some people can't reproduce the issue.

P.S. I think I saw the same behavior even before upgrading to iOS 18, but I'm not 100% sure.

@lukyanov
Copy link

Confirming that the beta fixes the issue. Thank you!

@righteoustales
Copy link
Author

@lukyanov It's bizarre that the localId affects this as reported. Interesting find.

@lukyanov
Copy link

@sowens-csd One issue (?) with the beta I stumbled upon. When you don't speak anything, before onError callback returned error_no_match, now it's error_retry.

@sowens-csd
Copy link
Contributor

@lukyanov that behaviour differs from the official release? That's very odd. I can see how it could be an OS difference but not sure what I changed in the plugin to change that behaviour. I'll check it out.

@sowens-csd
Copy link
Contributor

So I have an idea for the phrase aggregation that would allow it to be more customizable. The idea is to add an optional function parameter to the listen method. That function would receive an array of phrases and return an aggregate using whatever rules it likes. By default the current behaviour of the beta of using a space would be implemented. If a function is provided it would override that behaviour. To support that I'd change the definition of the SpeechRecognitionWords class to add an optional List<String>? recognizedPhrases property.

This would allow good customization of the behaviour but at the cost of more potential complexity for users of the plugin, even in just understanding the use of the new parameter and property. Add to that they might not even be useful for long if Apple fixes the bug. The new property can be pretty much ignored by users because it will almost always be null and even when not null it should be redundant with the recognizedWords property. For these reasons I've been hesitant to suggest it. However I think it is the best solution to how to provide customization for aggregation of the phrases.

The signature for the new aggregation function would be something like:

String aggregate( List<String> recognizedPhrases );

So listen would add a new optional named onAggregate parameter with that signature.

Thoughts?

@righteoustales
Copy link
Author

righteoustales commented Oct 2, 2024

I think it is a thoughtful compromise that provides for different behaviors using the same mechanism that you use for the default behavior while also not requiring code changes by existing users of the plugin. The only downside of it that occurs to me is the likelihood that you would want to deprecate it if Apple agrees that this is a bug and fixes it. I am also curious why you would specify it there versus in SpeechListenOptions?

Another possible approach would be to add a completely different method that could be used to install that same optional function and not modify the listen() method at all.

Advantages of the separate method (top of mind) could be:

  • you could give it a name that indicates the reason it exists - ie. something that communicates that it is a mitigation and might go away if Apple fixes the bug
  • doesn't require changing listen at all
  • could be extended itself in interesting ways if useful going forward w/o reservation since not part of the listen method per se.

I've seen this approach used in libraries historically (like opengl) when new functionality is being exposed that needs to be done short-term as an extension call but for which that same extension call will later become unnecessary.

I think your proposal works fine too. Just brainstorming a bit with you since you asked for thoughts.

@sowens-csd
Copy link
Contributor

Good thought. Normally I'd avoid that style of implementation because it's a bit hidden and not directly correlated with the action but in this case that seems like a feature not a bug. Thanks!

@righteoustales
Copy link
Author

Good thought. Normally I'd avoid that style of implementation because it's a bit hidden and not directly correlated with the action but in this case that seems like a feature not a bug. Thanks!

Agree. I'm happy to converse on pros/cons/ideas/brain-farts, especially given you are doing the heavy lifting. I appreciate both your work and your thoughtful approach.

Funny anecdote:
When I worked in the SQL Server team at MSFT, one of my peers who worked on DB metadata created a temporary function early in the dev cycle with a name like "UpgradeMyDatabaseMetaDataWithNoPossibilityOfRevertingItEver() for all of us to use. The hilarity of that checkin as folks across the org noticed it was sheer gold. He was a legend.

@sowens-csd
Copy link
Contributor

lol, I love it! That function name def gets an 11/10. Would love to know who won the prize for being the first person to complain about their database upgrade not reverting.

@righteoustales
Copy link
Author

They were probably drowned out by the dozens of faux requests from the rest of us amidst hallway guffaws. It was a fun time.

@righteoustales
Copy link
Author

@sowens-csd Btw, did you see this? It was posted 4 hours ago.

https://developer.apple.com/forums/thread/762952?page=2

@righteoustales
Copy link
Author

I just added an update on that Apple thread based on upgrading to the latest ios beta. You can read it there, but the short story is this: the behavior seems to have reverted to what I observed originally on 17.6.

requiresOnDeviceRecognition = true -- same bug
requiresOnDeviceRecognition = false -- no bug

@sowens-csd
Copy link
Contributor

I had not seen that, thanks for pointing it out. Looks like they are actively working on it, that's good news.

@sowens-csd
Copy link
Contributor

7.0.0-beta.2 is now live on pub.dev. It has the new aggregator behaviour that can be overridden using SpeechToText.unexpectedPhraseAggregator. See the example app main.dart for a usage example.

@sowens-csd
Copy link
Contributor

sowens-csd commented Oct 3, 2024

@lukyanov I just tried to reproduce your result with en_US and fr_CA and in both cases it behaved as expected. For both locales I spoke and paused and saw the app properly aggregate multiple phrases. This was on iOS 18 with onDevice false.

I also just tried to reproduce the error_retry issue without success. I am getting error_no_match.

Any other tips to reproduce?

@righteoustales
Copy link
Author

Quick update. I was prompted to upgrade to the last ios beta (22B5069a) last night. I installed it this am. The issue still exists when requiresOnDeviceRecognition is true. Setting it to false still seems to mitigate the word-loss behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants