-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User Report: Learning Isn't Happening #215
Comments
Are the logs Processing Any Log File? foyle/app/pkg/analyze/analyzer.go Line 200 in 4288e91
That is logged in debug mode so we'd have to bump the log level. @sourishkrout can you try the following
ExplanationThe Analyzer is the component that processes the logs. It produces traces which are then what we learn from. The message in discord showing no "Build block log" indicated no logs for blocks were being produced. So now we are trying to verify that the Analyzer is processing logs so we can narrow down the problem. |
Negative. I don't see any https://gist.github.com/sourishkrout/bc7cdccaa828056b3f5c622679ac49bb By the way, ☝️ would make a good troubleshooting notebook inside the |
@jlewi I am not sure what this logic does, but it appears my logs always wind up in this condition (see below): foyle/app/pkg/analyze/analyzer.go Lines 207 to 209 in 4288e91
|
Oh, looks like that's just tracking "savepoints" to avoid double processing. |
Q: |
Here we go. Logs! Still not learning, though. https://gist.github.com/sourishkrout/3cdecee8b123a436d35775cfbf4b8dac |
@sourishkrout. Your 100% correct it should be the Foyle Logs not the RunMe logs. I should have caught that.
That's correct. I should have done that in the first place rather than copying and pasting buggy queries. Here you go. Its a bit rough but try running through it and then sharing the output. |
Here are my results @jlewi: https://gist.github.com/sourishkrout/0f77867bb35df94d1b0eb4490c8fc6ff |
Thanks @sourishkrout It looks like you set Does the formula in the notebook not work for you?
Your earlier trace indicates that a BlockLog was properly computed for your cell so that's good. In this particular case I wouldn't expect learning to occur. If you look at the BlockLog you can see that contents of the generatedBlock and the executedBlock are the same. The only difference is some whitespace. You can see here that when processing examples (BlockLogs) we strip whitespace and then compare the contents. If they are the same then we don't learn from that example. So here's the things to try
|
Okay, so now again with Foyle & Runme on latest Changed the troubleshooting notebook slightly also to include a full trace of my most recent log. Is there anything you can spot in the gist, @jlewi? |
@sourishkrout so thanks to your report I was able to identify two bugs that I fixed in #233
So please retry using the latest on main which has the fix. The updated command for checking the blocklog is here Your logs indicate the BlockLog is being generated
So I suspect the blocklog is there so once we start using the updated CURL command we should be able to fetch it and see what's going on. Note that the location of the troubleshooting guide has changed since I moved it into the docs website. |
Where? Did you forget to include the link? Here's an updated Runme session gist: https://gist.github.com/sourishkrout/babd0e9f32ad8079869d4534f8b3bb4b#file-troubleshoot_learning-01j7mayp580arr60njbes5fbq6-md |
Oh sorry. I did forget the link; |
Oh great looks like you found the correct link sorry about that. It looks like things are working now.
So it looks like #233 fixed the issue with example.binpb files not being computed. Are you observing something that leads you to think learning isn't working? |
Sorry the example showing up in the output isn't from my machine. The cell is echoing the static content from the original doc since it's plain text. Should have skipped executing it. The issue I'm observing is still the same where the previously accepted, corrected, and successfully executed does not seem to get retained as part of learning. |
Ahh my mistake. So it looks like cell
So there are a couple possibilities
Blocks get enqueued here by invoking the blockNotifier. That will not happen if the blockNotifier is nil. Block notifier gets set here Analyzer.Run is called here. I can't see any reason why Learner.Enqueue wouldn't be getting invoked. Next StepsAt this point the best way to probably debug this would be to set a couple breakpoints and see what code paths are getting executed. My suggestion would be to start as follows
Let me know if you want to pair. |
* We need to enqueue blocks in the Learner when we detect an execution event * If we don't then blocks don't get checked to see if we can use them for learning. * The reason we were occasionally seeing learning occur without this is there's some small probabity * The generation event for the block will be processed by an Analyzer and enqueue the item in the learner * The execution event is processed by the Analyzer * The learner reconciles the generation event * i.e. since the Analyzer and learner run in parallel its possible the execution event gets detected before the learner has had a chance to handle the block in which case the execution event will be part of the blocklog when it is processed by the learner. * We shouldn't notify the learner after processing a log entry about how a block was generated. * This is a legacy from when block generation was logged in Foyle logs but cell executions were logged in RunMe logs * In that case we didn't have any guarantee in the order in which block generation and execution events would be processed * Now that execution events are logged in Foyle Logs we can be reasonably assured that log entries for cell execution will be logged after the log for how the cell was generated. Therefore we can wait until encountering an execution event to process the block in the learner. * This is most likely the most recent bug in #215
@sourishkrout I found and fixed another bug #241 due to the logging changes. Fingers crossed that fixes things for you. |
🤞 testing this now. |
No dice. However, the additional cells in the troubleshooting guide might be helpful, @jlewi. I'll try the breakpoints and see what happens. I could pair tmrw afternoon if that's easier. |
It looks like I never get past this line, even though one of the two occasions was definitely a generated block: https://github.com/jlewi/foyle/blob/main/app/pkg/learn/learner.go#L158 |
I'm sorry this is so frustrating. I really appreciate your patience.
Yes. I think I know what is going on. "learner_enqueued_total" is only 1. The learner gets notified here. My assumption was that by the time the execution event gets processed the BlockLog has already been processed But that's not guaranteed because we read the log lines in chunks and only update the blocklog after reading the entire chunk foyle/app/pkg/analyze/analyzer.go Line 307 in 6035823
So why was it working for me and not you?So here are some examples that were produced for me
Two things are true in my case
I think those two things contribute to the bug being masked. In particular, the bug means the first time the learner gets notified due to the first execution event the generated block won't be part of the log. The subsequent cell executions lead to additional learner notifications being fired. Debug logs increase the number of logs in between the two execution events. As a result, were more likely to hit the chunk limit in between execution events and therefore combine the log entries and update the BlockLog with the generated block. |
The problem is described in detail here #215 (comment) In #241 I changed where the learner gets notified of a block to process. Rather notifiying it when updating a blockLog with the generated block, I moved the update to do it after an execution event was processed. My assumption was that by the time the execution event gets processed the BlockLog has already been processed. But that's not guaranteed because we read the log lines in chunks and only update the blocklog with the generated block after reading the entire chunk. However, the executed field of the blockLog is updated as soon as the line is processed. As a result, we were notifying the learner after the first execution event at which point the block log hasn't been updated with the generated log. The solution is to notify the learner after execution events are processed and after generated events are processed. The learner will check if all required fields are present (i.e. both generated and executed block fields). If not the item is ignored. Therefore to make our code less brittle and more robust we can just fire notifications to the learner after either event. Related to #215
The problem is described in detail here #215 (comment) In #241 I changed where the learner gets notified of a block to process. Rather notifiying it when updating a blockLog with the generated block, I moved the update to do it after an execution event was processed. My assumption was that by the time the execution event gets processed the BlockLog has already been processed. But that's not guaranteed because we read the log lines in chunks and only update the blocklog with the generated block after reading the entire chunk. However, the executed field of the blockLog is updated as soon as the line is processed. As a result, we were notifying the learner after the first execution event at which point the block log hasn't been updated with the generated log. The solution is to notify the learner after execution events are processed and after generated events are processed. The learner will check if all required fields are present (i.e. both generated and executed block fields). If not the item is ignored. Therefore to make our code less brittle and more robust we can just fire notifications to the learner after either event. Related to #215
@sourishkrout Latest fix is #244 |
I can definitely see more activity in the logs now. However, my If you're available tomorrow afternoon, we could pair up to work through it. Looking over my shoulder might help, too. |
So we're making progress. I think I found the next bug. Your counters show that there were two learning examples.
So we know the code reached here foyle/app/pkg/learn/learner.go Line 176 in d67a104
If something went wrong in the subsequent processing we should have seen an error. I think the bug is that exampleFiles is empty then the for loop never iterates and we don't try to create any exampleFiles foyle/app/pkg/learn/learner.go Line 177 in d67a104
getExampleFiles will return an empty list if getTrainingDirs is empty getTrainingDirs returns empty if Learner in your configuration is empty foyle/app/pkg/config/config.go Line 247 in d67a104
This is legacy code. Prior to #211 you had to set the location of the RunMe logs and learner could be nil. Post #211 you shouldn't need to set learner. My config looks like
Fix incoming. |
* The bug is described in #215 (comment) * The bug is that `Config.getTrainingDirs` returns no training directories if config.learner == nil * Prior to #211 config.learner would be non-nil because we had to set the path of the RunMe logs * However, now that we no longer depend on RunMe logs config.learner could be nil and this would return no training directories. In which case learner.Reconcile would not attempt to save any examples * The fix is to allow config.Learner to be nil in GetTrainingDirs and return a suitable default. We also need to ensure the training directory gets created if it doesn't exist. * This is fixed by jlewi/monogo#23 which updates LocalFileHelper to create the directory if it doesn't exist. * My suspicion is that I never hit this bug because i originally created my ~/.foyle/training directory using a version of the code which wasn't using FileHelper and explicitly checked and created the directory. I suspect when I refactored the code to support saving examples to GCS thats when the code to ensure the directory exists got dropped.
@sourishkrout latest fix is #245 |
@sourishkrout I'm marking this as fixed. Let me know if your still having problems. |
See this thread.
https://discord.com/channels/1102639988832735374/1278048402567069728
User queried the logs and didn't see any entries for BuildLog.
The text was updated successfully, but these errors were encountered: