User Report: Learning Isn't Happening #215

jlewi · 2024-08-28T19:55:25Z

See this thread.
https://discord.com/channels/1102639988832735374/1278048402567069728

User queried the logs and didn't see any entries for BuildLog.

jlewi · 2024-08-28T19:56:30Z

Are the logs Processing Any Log File?

foyle/app/pkg/analyze/analyzer.go

Line 200 in 4288e91

log.V(logs.Debug).Info("Processing log file", "path", path)

That is logged in debug mode so we'd have to bump the log level.

@sourishkrout can you try the following

Run Foyle with debug logs enabled
```
foyle serve --log=debug
```
Execute a couple cells / accept a suggestion

Search the logs for the above message

 jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"

Explanation

The Analyzer is the component that processes the logs. It produces traces which are then what we learn from. The message in discord showing no "Build block log" indicated no logs for blocks were being produced. So now we are trying to verify that the Analyzer is processing logs so we can narrow down the problem.

sourishkrout · 2024-08-28T21:11:41Z

Negative. I don't see any Processing log file logs:

https://gist.github.com/sourishkrout/bc7cdccaa828056b3f5c622679ac49bb

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

sourishkrout · 2024-08-28T21:19:02Z

@jlewi I am not sure what this logic does, but it appears my logs always wind up in this condition (see below):

foyle/app/pkg/analyze/analyzer.go

Lines 207 to 209 in 4288e91

    
           if len(lines) == 0 { 
        
           	return nil 
        
           }

sourishkrout · 2024-08-28T21:21:23Z

Oh, looks like that's just tracking "savepoints" to avoid double processing.

sourishkrout · 2024-08-28T21:22:24Z

3. Search the logs for the above message
jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"

Q: Processing log file should appear in Foyle's log and not in the Runme AI's logs, no?

sourishkrout · 2024-08-28T21:27:03Z

Here we go. Logs! Still not learning, though.

https://gist.github.com/sourishkrout/3cdecee8b123a436d35775cfbf4b8dac

jlewi · 2024-08-28T22:26:58Z

@sourishkrout. Your 100% correct it should be the Foyle Logs not the RunMe logs. I should have caught that.

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

That's correct. I should have done that in the first place rather than copying and pasting buggy queries.

Here you go.

https://github.com/jlewi/foyle/blob/ff414043c8996f30a7885ccaf85ac4b26b5916ed/developer_guides/troubleshoot_learning.md

Its a bit rough but try running through it and then sharing the output.

sourishkrout · 2024-08-28T23:26:58Z

Here are my results @jlewi:

https://gist.github.com/sourishkrout/0f77867bb35df94d1b0eb4490c8fc6ff

jlewi · 2024-08-29T02:45:22Z

Thanks @sourishkrout

It looks like you set LASTLOG to the RunMe logs here
So all log statements below that are checking the wrong log file.

Does the formula in the notebook not work for you?

export LASTLOG=~/.foyle/logs/raw/$(ls -t ~/.foyle/logs/raw | head -n 1 )

Your earlier trace indicates that a BlockLog was properly computed for your cell so that's good.

In this particular case I wouldn't expect learning to occur. If you look at the BlockLog you can see that contents of the generatedBlock and the executedBlock are the same. The only difference is some whitespace.

You can see here that when processing examples (BlockLogs) we strip whitespace and then compare the contents. If they are the same then we don't learn from that example.

So here's the things to try

Enable debug logs by running with "--level=debug"
Accept a suggested edit cell
Make a non whitespace change to the cell
Use the notebook to get the block trace for that cell
- Verify the executed and generated contents are different in more than just whitespace
Check the logs for that cell ; there should be a message saying it learned from that example

Related to #215 and #214

sourishkrout · 2024-09-12T16:38:24Z

Okay, so now again with Foyle & Runme on latest main:

https://gist.github.com/sourishkrout/f075e1a7eb7b23e84854383323f666ec#file-troubleshoot_learning-01j7kh2c6177p74pfkcd6w3ns8-md

Changed the troubleshooting notebook slightly also to include a full trace of my most recent log. Is there anything you can spot in the gist, @jlewi?

jlewi · 2024-09-12T20:38:51Z

@sourishkrout so thanks to your report I was able to identify two bugs that I fixed in #233

The first is a bug in the documentation; the command to check the blocklogs has changed because we now use the ConnectProtocol
- The updated command is here
The second is there was a bug in adding execution events to the blocklog

So please retry using the latest on main which has the fix. The updated command for checking the blocklog is here
https://foyle.io/docs/learning/troubleshoot_learning/#did-block-logs-get-created

Your logs indicate the BlockLog is being generated

{"severity":"info","time":1726158688.428434,"caller":"analyze/analyzer.go:658","function":"github.com/jlewi/foyle/app/pkg/analyze.buildBlockLog","message":"Building block log","blockId":"01J7KH6B5YJ111KEXAAHD97XY8","block":{"id":"01J7KH6B5YJ111KEXAAHD97XY8","gen_trace_id":"da7a66391bad2f64f6f66e056eed6772","exit_code":0,"eval_mode":false,"resource_version":"c7db69f7-869b-4439-92e2-09ba254bc364","suggestion_status":"SuggestionStatusUnknown"}}

So I suspect the blocklog is there so once we start using the updated CURL command we should be able to fetch it and see what's going on.

Note that the location of the troubleshooting guide has changed since I moved it into the docs website.

sourishkrout · 2024-09-13T00:05:38Z

The updated command is here

Where? Did you forget to include the link?

Here's an updated Runme session gist: https://gist.github.com/sourishkrout/babd0e9f32ad8079869d4534f8b3bb4b#file-troubleshoot_learning-01j7mayp580arr60njbes5fbq6-md

jlewi · 2024-09-13T01:10:21Z

Oh sorry. I did forget the link;
https://foyle.io/docs/learning/troubleshoot_learning/#did-block-logs-get-created

jlewi · 2024-09-13T01:14:38Z

Oh great looks like you found the correct link sorry about that.

It looks like things are working now.

The output shows that you have "example.binpb"

So it looks like #233 fixed the issue with example.binpb files not being computed.

Are you observing something that leads you to think learning isn't working?
Are you just not observing the quality you'd expect?

sourishkrout · 2024-09-13T03:41:50Z

Sorry the example showing up in the output isn't from my machine. The cell is echoing the static content from the original doc since it's plain text. Should have skipped executing it.

The issue I'm observing is still the same where the previously accepted, corrected, and successfully executed does not seem to get retained as part of learning.

jlewi · 2024-09-13T19:36:14Z

Ahh my mistake.

So it looks like cell 01J7KQPBYCT9VM2KFBY48JC7J0 is your cell?

it looks like the block log is available and indicates different commands
- Generated command was jq
- Executed was a CURl
There is no log message corresponding to this line indicating a new examples was detected

So there are a couple possibilities

The block is never enqueued
The block is enqueued but it is never reconciled
Reconcile is invoked but execution is aborted before reaching this line

Blocks get enqueued here by invoking the blockNotifier. That will not happen if the blockNotifier is nil.

Block notifier gets set here

Analyzer.Run is called here.

I can't see any reason why Learner.Enqueue wouldn't be getting invoked.

Next Steps

At this point the best way to probably debug this would be to set a couple breakpoints and see what code paths are getting executed. My suggestion would be to start as follows

Set a breakpoint here (

foyle/app/pkg/analyze/analyzer.go

Line 642 in 8c57b00

if a.blockNotifier != nil {
- This will tell us whether the block is being enqueued
Set a breakpoint in learner.Reconcile

foyle/app/pkg/learn/learner.go

Line 124 in 8c57b00

log := logs.FromContext(ctx)

Let me know if you want to pair.

* We need to enqueue blocks in the Learner when we detect an execution event * If we don't then blocks don't get checked to see if we can use them for learning. * The reason we were occasionally seeing learning occur without this is there's some small probabity * The generation event for the block will be processed by an Analyzer and enqueue the item in the learner * The execution event is processed by the Analyzer * The learner reconciles the generation event * i.e. since the Analyzer and learner run in parallel its possible the execution event gets detected before the learner has had a chance to handle the block in which case the execution event will be part of the blocklog when it is processed by the learner. * We shouldn't notify the learner after processing a log entry about how a block was generated. * This is a legacy from when block generation was logged in Foyle logs but cell executions were logged in RunMe logs * In that case we didn't have any guarantee in the order in which block generation and execution events would be processed * Now that execution events are logged in Foyle Logs we can be reasonably assured that log entries for cell execution will be logged after the log for how the cell was generated. Therefore we can wait until encountering an execution event to process the block in the learner. * This is most likely the most recent bug in #215

jlewi · 2024-09-14T21:18:47Z

@sourishkrout I found and fixed another bug #241 due to the logging changes. Fingers crossed that fixes things for you.

sourishkrout · 2024-09-17T18:54:25Z

🤞 testing this now.

sourishkrout · 2024-09-17T19:01:15Z

No dice. However, the additional cells in the troubleshooting guide might be helpful, @jlewi.

https://gist.github.com/sourishkrout/cceb035482da5da3ec5e4ce294da9207#file-troubleshoot_learning-01j80nc4fqpm7c5ffxyfpx5r5s-md

I'll try the breakpoints and see what happens. I could pair tmrw afternoon if that's easier.

sourishkrout · 2024-09-17T21:16:32Z

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

https://github.com/jlewi/foyle/blob/main/app/pkg/learn/learner.go#L158

jlewi · 2024-09-17T21:43:13Z

I'm sorry this is so frustrating. I really appreciate your patience.

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

Yes.

I think I know what is going on. "learner_enqueued_total" is only 1.
In #241 I changed when we notify the learner because we should only need to notify the learner after processing the executed log entry because that should necessarily be after the generated log entry.

The learner gets notified here. My assumption was that by the time the execution event gets processed the BlockLog has already been processed

But that's not guaranteed because we read the log lines in chunks and only update the blocklog after reading the entire chunk

foyle/app/pkg/analyze/analyzer.go

Line 307 in 6035823

a.combineAndCheckpoint(ctx, path, offset, traceIDs)

So why was it working for me and not you?

So here are some examples that were produced for me

-rw-r--r--    1 jlewi  staff  30077 Sep 16 18:24 01J7YMFGD47MEG6QN46NN0GXKQ.example.binpb
-rw-r--r--    1 jlewi  staff  33214 Sep 16 18:25 01J7YQ57WRP73B0ND35NZY3MVG.example.binpb
-rw-r--r--    1 jlewi  staff   7953 Sep 17 14:34 01J80YESW7NYV5N6DX0168HNAQ.example.binpb

Two things are true in my case

I ended up executing the cell multiple times because I needed to iterate to get to the correct command
I have debug logs enabled.

I think those two things contribute to the bug being masked. In particular, the bug means the first time the learner gets notified due to the first execution event the generated block won't be part of the log. The subsequent cell executions lead to additional learner notifications being fired. Debug logs increase the number of logs in between the two execution events. As a result, were more likely to hit the chunk limit in between execution events and therefore combine the log entries and update the BlockLog with the generated block.

The problem is described in detail here #215 (comment) In #241 I changed where the learner gets notified of a block to process. Rather notifiying it when updating a blockLog with the generated block, I moved the update to do it after an execution event was processed. My assumption was that by the time the execution event gets processed the BlockLog has already been processed. But that's not guaranteed because we read the log lines in chunks and only update the blocklog with the generated block after reading the entire chunk. However, the executed field of the blockLog is updated as soon as the line is processed. As a result, we were notifying the learner after the first execution event at which point the block log hasn't been updated with the generated log. The solution is to notify the learner after execution events are processed and after generated events are processed. The learner will check if all required fields are present (i.e. both generated and executed block fields). If not the item is ignored. Therefore to make our code less brittle and more robust we can just fire notifications to the learner after either event. Related to #215

jlewi · 2024-09-17T22:19:38Z

@sourishkrout Latest fix is #244

sourishkrout · 2024-09-17T23:27:53Z

@sourishkrout Latest fix is #244

I can definitely see more activity in the logs now. However, my ~/.foyle/training remains empty, and learning still does not take effect. I made sure that in the example below, I would edit the cell a few times to achieve the desired outcome.

If you're available tomorrow afternoon, we could pair up to work through it. Looking over my shoulder might help, too.

https://gist.github.com/sourishkrout/f42a1b63578f4c95701b8ca332360caf#file-troubleshoot_learning-01j814mfadm7dw0m5b1jdgny3w-md

jlewi · 2024-09-18T00:54:15Z

So we're making progress. I think I found the next bug.

Your counters show that there were two learning examples.

learner_blocks_processed{status="learn"} 2
learner_blocks_processed{status="nochange"} 1
learner_blocks_processed{status="notgenerated"} 1
learner_blocks_processed{status="unexecuted"} 10

So we know the code reached here

foyle/app/pkg/learn/learner.go

Line 176 in d67a104

cellsProcessed.WithLabelValues("learn").Inc()

If something went wrong in the subsequent processing we should have seen an error.

I think the bug is that exampleFiles is empty then the for loop never iterates and we don't try to create any exampleFiles

foyle/app/pkg/learn/learner.go

Line 177 in d67a104

expectedFiles := l.getExampleFiles(b.GetId())

getExampleFiles will return an empty list if getTrainingDirs is empty
https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/learn/learner.go#L252C29-L252C44

getTrainingDirs returns empty if Learner in your configuration is empty

foyle/app/pkg/config/config.go

Line 247 in d67a104

if c.Learner == nil {

This is legacy code. Prior to #211 you had to set the location of the RunMe logs and learner could be nil. Post #211 you shouldn't need to set learner.

My config looks like

learner:
    logDirs: []

Fix incoming.

* The bug is described in #215 (comment) * The bug is that `Config.getTrainingDirs` returns no training directories if config.learner == nil * Prior to #211 config.learner would be non-nil because we had to set the path of the RunMe logs * However, now that we no longer depend on RunMe logs config.learner could be nil and this would return no training directories. In which case learner.Reconcile would not attempt to save any examples * The fix is to allow config.Learner to be nil in GetTrainingDirs and return a suitable default. We also need to ensure the training directory gets created if it doesn't exist. * This is fixed by jlewi/monogo#23 which updates LocalFileHelper to create the directory if it doesn't exist. * My suspicion is that I never hit this bug because i originally created my ~/.foyle/training directory using a version of the code which wasn't using FileHelper and explicitly checked and created the directory. I suspect when I refactored the code to support saving examples to GCS thats when the code to ensure the directory exists got dropped.

jlewi · 2024-09-18T04:18:08Z

@sourishkrout latest fix is #245

jlewi · 2024-09-20T22:53:37Z

Two other issues we ran into during our pair programming
#249
#248

jlewi · 2024-10-02T22:40:25Z

@sourishkrout I'm marking this as fixed. Let me know if your still having problems.

jlewi mentioned this issue Aug 28, 2024

Troubleshooting guide for learning. #216

Merged

jlewi mentioned this issue Aug 29, 2024

Add RPC to Query the Logs #217

Open

jlewi added a commit that referenced this issue Aug 30, 2024

Troubleshooting guide for learning. (#216)

0a7b477

Related to #215 and #214

This was referenced Sep 13, 2024

OTEL for monitoring learning process #238

Open

Learning Bug Fix: Need to enqueue blocks after execution events #241

Merged

jlewi mentioned this issue Sep 17, 2024

Fix Race Condition In Learning #244

Merged

jlewi mentioned this issue Sep 18, 2024

Bug Fix: Learning doesn't happen because exampleDirs is empty #245

Merged

jlewi closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Report: Learning Isn't Happening #215

User Report: Learning Isn't Happening #215

jlewi commented Aug 28, 2024

jlewi commented Aug 28, 2024 •

edited

Loading

sourishkrout commented Aug 28, 2024 •

edited

Loading

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

jlewi commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

jlewi commented Aug 29, 2024

sourishkrout commented Sep 12, 2024 •

edited

Loading

jlewi commented Sep 12, 2024 •

edited

Loading

sourishkrout commented Sep 13, 2024

jlewi commented Sep 13, 2024

jlewi commented Sep 13, 2024 •

edited

Loading

sourishkrout commented Sep 13, 2024 •

edited

Loading

jlewi commented Sep 13, 2024

jlewi commented Sep 14, 2024

sourishkrout commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

jlewi commented Sep 17, 2024

jlewi commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

jlewi commented Sep 18, 2024

jlewi commented Sep 18, 2024

jlewi commented Sep 20, 2024

jlewi commented Oct 2, 2024

User Report: Learning Isn't Happening #215

User Report: Learning Isn't Happening #215

Comments

jlewi commented Aug 28, 2024

jlewi commented Aug 28, 2024 • edited Loading

Explanation

sourishkrout commented Aug 28, 2024 • edited Loading

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

jlewi commented Aug 28, 2024

sourishkrout commented Aug 28, 2024

jlewi commented Aug 29, 2024

sourishkrout commented Sep 12, 2024 • edited Loading

jlewi commented Sep 12, 2024 • edited Loading

sourishkrout commented Sep 13, 2024

jlewi commented Sep 13, 2024

jlewi commented Sep 13, 2024 • edited Loading

sourishkrout commented Sep 13, 2024 • edited Loading

jlewi commented Sep 13, 2024

Next Steps

jlewi commented Sep 14, 2024

sourishkrout commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

jlewi commented Sep 17, 2024

So why was it working for me and not you?

jlewi commented Sep 17, 2024

sourishkrout commented Sep 17, 2024

jlewi commented Sep 18, 2024

jlewi commented Sep 18, 2024

jlewi commented Sep 20, 2024

jlewi commented Oct 2, 2024

jlewi commented Aug 28, 2024 •

edited

Loading

sourishkrout commented Aug 28, 2024 •

edited

Loading

sourishkrout commented Sep 12, 2024 •

edited

Loading

jlewi commented Sep 12, 2024 •

edited

Loading

jlewi commented Sep 13, 2024 •

edited

Loading

sourishkrout commented Sep 13, 2024 •

edited

Loading