Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Report: Learning Isn't Happening #215

Closed
jlewi opened this issue Aug 28, 2024 · 27 comments
Closed

User Report: Learning Isn't Happening #215

jlewi opened this issue Aug 28, 2024 · 27 comments

Comments

@jlewi
Copy link
Owner

jlewi commented Aug 28, 2024

See this thread.
https://discord.com/channels/1102639988832735374/1278048402567069728

User queried the logs and didn't see any entries for BuildLog.

@jlewi
Copy link
Owner Author

jlewi commented Aug 28, 2024

Are the logs Processing Any Log File?

log.V(logs.Debug).Info("Processing log file", "path", path)

That is logged in debug mode so we'd have to bump the log level.

@sourishkrout can you try the following

  1. Run Foyle with debug logs enabled

    foyle serve --log=debug
    
  2. Execute a couple cells / accept a suggestion

  3. Search the logs for the above message

     jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"
    

Explanation

The Analyzer is the component that processes the logs. It produces traces which are then what we learn from. The message in discord showing no "Build block log" indicated no logs for blocks were being produced. So now we are trying to verify that the Analyzer is processing logs so we can narrow down the problem.

@sourishkrout
Copy link

sourishkrout commented Aug 28, 2024

Negative. I don't see any Processing log file logs:

https://gist.github.com/sourishkrout/bc7cdccaa828056b3f5c622679ac49bb

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

@sourishkrout
Copy link

@jlewi I am not sure what this logic does, but it appears my logs always wind up in this condition (see below):

if len(lines) == 0 {
return nil
}

image

@sourishkrout
Copy link

Oh, looks like that's just tracking "savepoints" to avoid double processing.

@sourishkrout
Copy link

3. Search the logs for the above message
jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"

Q: Processing log file should appear in Foyle's log and not in the Runme AI's logs, no?

@sourishkrout
Copy link

Here we go. Logs! Still not learning, though.

https://gist.github.com/sourishkrout/3cdecee8b123a436d35775cfbf4b8dac

@jlewi
Copy link
Owner Author

jlewi commented Aug 28, 2024

@sourishkrout. Your 100% correct it should be the Foyle Logs not the RunMe logs. I should have caught that.

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

That's correct. I should have done that in the first place rather than copying and pasting buggy queries.

Here you go.

https://github.com/jlewi/foyle/blob/ff414043c8996f30a7885ccaf85ac4b26b5916ed/developer_guides/troubleshoot_learning.md

Its a bit rough but try running through it and then sharing the output.

@sourishkrout
Copy link

@jlewi
Copy link
Owner Author

jlewi commented Aug 29, 2024

Thanks @sourishkrout

It looks like you set LASTLOG to the RunMe logs here
So all log statements below that are checking the wrong log file.

Does the formula in the notebook not work for you?

export LASTLOG=~/.foyle/logs/raw/$(ls -t ~/.foyle/logs/raw | head -n 1 )

Your earlier trace indicates that a BlockLog was properly computed for your cell so that's good.

In this particular case I wouldn't expect learning to occur. If you look at the BlockLog you can see that contents of the generatedBlock and the executedBlock are the same. The only difference is some whitespace.

You can see here that when processing examples (BlockLogs) we strip whitespace and then compare the contents. If they are the same then we don't learn from that example.

So here's the things to try

  1. Enable debug logs by running with "--level=debug"
  2. Accept a suggested edit cell
  3. Make a non whitespace change to the cell
  4. Use the notebook to get the block trace for that cell
    • Verify the executed and generated contents are different in more than just whitespace
  5. Check the logs for that cell ; there should be a message saying it learned from that example

jlewi added a commit that referenced this issue Aug 30, 2024
@sourishkrout
Copy link

sourishkrout commented Sep 12, 2024

Okay, so now again with Foyle & Runme on latest main:

https://gist.github.com/sourishkrout/f075e1a7eb7b23e84854383323f666ec#file-troubleshoot_learning-01j7kh2c6177p74pfkcd6w3ns8-md

Changed the troubleshooting notebook slightly also to include a full trace of my most recent log. Is there anything you can spot in the gist, @jlewi?

@jlewi
Copy link
Owner Author

jlewi commented Sep 12, 2024

@sourishkrout so thanks to your report I was able to identify two bugs that I fixed in #233

  1. The first is a bug in the documentation; the command to check the blocklogs has changed because we now use the ConnectProtocol

    • The updated command is here
  2. The second is there was a bug in adding execution events to the blocklog

So please retry using the latest on main which has the fix. The updated command for checking the blocklog is here
https://foyle.io/docs/learning/troubleshoot_learning/#did-block-logs-get-created

Your logs indicate the BlockLog is being generated

{"severity":"info","time":1726158688.428434,"caller":"analyze/analyzer.go:658","function":"github.com/jlewi/foyle/app/pkg/analyze.buildBlockLog","message":"Building block log","blockId":"01J7KH6B5YJ111KEXAAHD97XY8","block":{"id":"01J7KH6B5YJ111KEXAAHD97XY8","gen_trace_id":"da7a66391bad2f64f6f66e056eed6772","exit_code":0,"eval_mode":false,"resource_version":"c7db69f7-869b-4439-92e2-09ba254bc364","suggestion_status":"SuggestionStatusUnknown"}}

So I suspect the blocklog is there so once we start using the updated CURL command we should be able to fetch it and see what's going on.

Note that the location of the troubleshooting guide has changed since I moved it into the docs website.

@sourishkrout
Copy link

  • The updated command is here

Where? Did you forget to include the link?

Here's an updated Runme session gist: https://gist.github.com/sourishkrout/babd0e9f32ad8079869d4534f8b3bb4b#file-troubleshoot_learning-01j7mayp580arr60njbes5fbq6-md

@jlewi
Copy link
Owner Author

jlewi commented Sep 13, 2024

@jlewi
Copy link
Owner Author

jlewi commented Sep 13, 2024

Oh great looks like you found the correct link sorry about that.

It looks like things are working now.

  • The output shows that you have "example.binpb"

So it looks like #233 fixed the issue with example.binpb files not being computed.

Are you observing something that leads you to think learning isn't working?
Are you just not observing the quality you'd expect?

@sourishkrout
Copy link

sourishkrout commented Sep 13, 2024

Sorry the example showing up in the output isn't from my machine. The cell is echoing the static content from the original doc since it's plain text. Should have skipped executing it.

The issue I'm observing is still the same where the previously accepted, corrected, and successfully executed does not seem to get retained as part of learning.

@jlewi
Copy link
Owner Author

jlewi commented Sep 13, 2024

Ahh my mistake.

So it looks like cell 01J7KQPBYCT9VM2KFBY48JC7J0 is your cell?

  1. it looks like the block log is available and indicates different commands
    • Generated command was jq
    • Executed was a CURl
  2. There is no log message corresponding to this line indicating a new examples was detected

So there are a couple possibilities

  1. The block is never enqueued
  2. The block is enqueued but it is never reconciled
  3. Reconcile is invoked but execution is aborted before reaching this line

Blocks get enqueued here by invoking the blockNotifier. That will not happen if the blockNotifier is nil.

Block notifier gets set here

Analyzer.Run is called here.

I can't see any reason why Learner.Enqueue wouldn't be getting invoked.

Next Steps

At this point the best way to probably debug this would be to set a couple breakpoints and see what code paths are getting executed. My suggestion would be to start as follows

  1. Set a breakpoint here (

    if a.blockNotifier != nil {

    • This will tell us whether the block is being enqueued
  2. Set a breakpoint in learner.Reconcile

    log := logs.FromContext(ctx)

Let me know if you want to pair.

jlewi added a commit that referenced this issue Sep 14, 2024
* We need to enqueue blocks in the Learner when we detect an execution
event
* If we don't then blocks don't get checked to see if we can use them
for learning.
* The reason we were occasionally seeing learning occur without this is
there's some small probabity
* The generation event for the block will be processed by an Analyzer
and enqueue the item in the learner
   * The execution event is processed by the Analyzer
   * The learner reconciles the generation event
* i.e. since the Analyzer and learner run in parallel its possible the
execution event gets detected before the learner has had a chance to
handle the block in which case the execution event will be part of the
blocklog when it is processed by the learner.

* We shouldn't notify the learner after processing a log entry about how
a block was generated.
* This is a legacy from when block generation was logged in Foyle logs
but cell executions were logged in RunMe logs
* In that case we didn't have any guarantee in the order in which block
generation and execution events would be processed
* Now that execution events are logged in Foyle Logs we can be
reasonably assured that log entries for cell execution will be logged
after the log for how the cell was generated. Therefore we can wait
until encountering an execution event to process the block in the
learner.

* This is most likely the most recent bug in #215
@jlewi
Copy link
Owner Author

jlewi commented Sep 14, 2024

@sourishkrout I found and fixed another bug #241 due to the logging changes. Fingers crossed that fixes things for you.

@sourishkrout
Copy link

🤞 testing this now.

@sourishkrout
Copy link

No dice. However, the additional cells in the troubleshooting guide might be helpful, @jlewi.

https://gist.github.com/sourishkrout/cceb035482da5da3ec5e4ce294da9207#file-troubleshoot_learning-01j80nc4fqpm7c5ffxyfpx5r5s-md

I'll try the breakpoints and see what happens. I could pair tmrw afternoon if that's easier.

@sourishkrout
Copy link

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

https://github.com/jlewi/foyle/blob/main/app/pkg/learn/learner.go#L158

@jlewi
Copy link
Owner Author

jlewi commented Sep 17, 2024

I'm sorry this is so frustrating. I really appreciate your patience.

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

Yes.

I think I know what is going on. "learner_enqueued_total" is only 1.
In #241 I changed when we notify the learner because we should only need to notify the learner after processing the executed log entry because that should necessarily be after the generated log entry.

The learner gets notified here. My assumption was that by the time the execution event gets processed the BlockLog has already been processed

But that's not guaranteed because we read the log lines in chunks and only update the blocklog after reading the entire chunk

a.combineAndCheckpoint(ctx, path, offset, traceIDs)

So why was it working for me and not you?

So here are some examples that were produced for me

-rw-r--r--    1 jlewi  staff  30077 Sep 16 18:24 01J7YMFGD47MEG6QN46NN0GXKQ.example.binpb
-rw-r--r--    1 jlewi  staff  33214 Sep 16 18:25 01J7YQ57WRP73B0ND35NZY3MVG.example.binpb
-rw-r--r--    1 jlewi  staff   7953 Sep 17 14:34 01J80YESW7NYV5N6DX0168HNAQ.example.binpb

Two things are true in my case

  • I ended up executing the cell multiple times because I needed to iterate to get to the correct command
  • I have debug logs enabled.

I think those two things contribute to the bug being masked. In particular, the bug means the first time the learner gets notified due to the first execution event the generated block won't be part of the log. The subsequent cell executions lead to additional learner notifications being fired. Debug logs increase the number of logs in between the two execution events. As a result, were more likely to hit the chunk limit in between execution events and therefore combine the log entries and update the BlockLog with the generated block.

jlewi added a commit that referenced this issue Sep 17, 2024
The problem is described in detail here
#215 (comment)

In #241 I changed where the learner gets notified of a block to process. Rather notifiying it when
updating a blockLog with the generated block, I moved the update to do it after an execution event
was processed.

My assumption was that by the time the execution event gets processed the BlockLog has already been processed.

But that's not guaranteed because we read the log lines in chunks and only update the blocklog with the
generated block after reading the entire chunk. However, the executed field of the blockLog is updated as soon
as the line is processed. As a result, we were notifying the learner after the first execution event at which
point the block log hasn't been updated with the generated log.

The solution is to notify the learner after execution events are processed and after generated events are processed. The learner will check if all required fields are present (i.e. both generated and executed block fields). If not the item is ignored.

Therefore to make our code less brittle and more robust we can just fire notifications to the learner after either event.

Related to #215
jlewi added a commit that referenced this issue Sep 17, 2024
The problem is described in detail here
#215 (comment)

In #241 I changed where the learner gets notified of a block to process.
Rather notifiying it when updating a blockLog with the generated block,
I moved the update to do it after an execution event was processed.

My assumption was that by the time the execution event gets processed
the BlockLog has already been processed.

But that's not guaranteed because we read the log lines in chunks and
only update the blocklog with the generated block after reading the
entire chunk. However, the executed field of the blockLog is updated as
soon as the line is processed. As a result, we were notifying the
learner after the first execution event at which point the block log
hasn't been updated with the generated log.

The solution is to notify the learner after execution events are
processed and after generated events are processed. The learner will
check if all required fields are present (i.e. both generated and
executed block fields). If not the item is ignored.

Therefore to make our code less brittle and more robust we can just fire
notifications to the learner after either event.

Related to #215
@jlewi
Copy link
Owner Author

jlewi commented Sep 17, 2024

@sourishkrout Latest fix is #244

@sourishkrout
Copy link

@sourishkrout Latest fix is #244

I can definitely see more activity in the logs now. However, my ~/.foyle/training remains empty, and learning still does not take effect. I made sure that in the example below, I would edit the cell a few times to achieve the desired outcome.

If you're available tomorrow afternoon, we could pair up to work through it. Looking over my shoulder might help, too.

https://gist.github.com/sourishkrout/f42a1b63578f4c95701b8ca332360caf#file-troubleshoot_learning-01j814mfadm7dw0m5b1jdgny3w-md

@jlewi
Copy link
Owner Author

jlewi commented Sep 18, 2024

So we're making progress. I think I found the next bug.

Your counters show that there were two learning examples.

learner_blocks_processed{status="learn"} 2
learner_blocks_processed{status="nochange"} 1
learner_blocks_processed{status="notgenerated"} 1
learner_blocks_processed{status="unexecuted"} 10

So we know the code reached here

cellsProcessed.WithLabelValues("learn").Inc()

If something went wrong in the subsequent processing we should have seen an error.

I think the bug is that exampleFiles is empty then the for loop never iterates and we don't try to create any exampleFiles

expectedFiles := l.getExampleFiles(b.GetId())

getExampleFiles will return an empty list if getTrainingDirs is empty
https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/learn/learner.go#L252C29-L252C44

getTrainingDirs returns empty if Learner in your configuration is empty

if c.Learner == nil {

This is legacy code. Prior to #211 you had to set the location of the RunMe logs and learner could be nil. Post #211 you shouldn't need to set learner.

My config looks like

learner:
    logDirs: []

Fix incoming.

jlewi added a commit that referenced this issue Sep 18, 2024
* The bug is described in
#215 (comment)

* The bug is that `Config.getTrainingDirs` returns no training
directories if config.learner == nil
* Prior to #211 config.learner
would be non-nil because we had to set the path of the RunMe logs
* However, now that we no longer depend on RunMe logs config.learner
could be nil and this would return no training directories. In which
case learner.Reconcile would not attempt to save any examples
* The fix is to allow config.Learner to be nil in GetTrainingDirs and
return a suitable default.

We also need to ensure the training directory gets created if it doesn't
exist.
* This is fixed by jlewi/monogo#23 which updates
LocalFileHelper to create the directory if it doesn't exist.
* My suspicion is that I never hit this bug because i originally created
my ~/.foyle/training directory using a version of the code which wasn't
using FileHelper and explicitly checked and created the directory. I
suspect when I refactored the code to support saving examples to GCS
thats when the code to ensure the directory exists got dropped.
@jlewi
Copy link
Owner Author

jlewi commented Sep 18, 2024

@sourishkrout latest fix is #245

@jlewi
Copy link
Owner Author

jlewi commented Sep 20, 2024

Two other issues we ran into during our pair programming
#249
#248

@jlewi
Copy link
Owner Author

jlewi commented Oct 2, 2024

@sourishkrout I'm marking this as fixed. Let me know if your still having problems.

@jlewi jlewi closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants