-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data 2816/get training job logs #534
Conversation
…16/get-training-job-logs
@@ -134,3 +137,19 @@ message DeleteCompletedTrainingJobRequest { | |||
} | |||
|
|||
message DeleteCompletedTrainingJobResponse {} | |||
|
|||
message TrainingJobLogEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed in the scope we used the common log entry [0] but here we redefine it. Is there a reason why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually a whole slew of differences between them and us. Namely, these are the fields we have available to us:
{
"insertId": "1ie943xfro8sio",
"jsonPayload": {
"attrs": {
"tag": "workerpool0-0"
},
"levelname": "ERROR",
"message": "2024-07-09 19:47:33.018974: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero"
},
"labels": {
"compute.googleapis.com/resource_id": "6425219297779396081",
"compute.googleapis.com/resource_name": "cmle-training-12984007680247343913",
"compute.googleapis.com/zone": "us-central1-f",
"ml.googleapis.com/job_id/log_area": "root",
"ml.googleapis.com/tpu_worker_id": "",
"ml.googleapis.com/trial_id": "",
"ml.googleapis.com/trial_type": ""
},
"logName": "projects/staging-cloud-web-app/logs/workerpool0-0",
"receiveTimestamp": "2024-07-09T19:48:00.02068423Z",
"resource": {
"labels": {
"job_id": "6458394034401443840",
"project_id": "staging-cloud-web-app",
"task_name": "workerpool0-0"
},
"type": "ml_job"
},
"severity": "ERROR",
"timestamp": "2024-07-09T19:47:56.597393583Z"
}
If you compare this with the LogEntry
which looks like
type LogEntry struct {
state protoimpl.MessageState
sizeCache protoimpl.SizeCache
unknownFields protoimpl.UnknownFields
Host string `protobuf:"bytes,1,opt,name=host,proto3" json:"host,omitempty"`
Level string `protobuf:"bytes,2,opt,name=level,proto3" json:"level,omitempty"`
Time *timestamppb.Timestamp `protobuf:"bytes,3,opt,name=time,proto3" json:"time,omitempty"`
LoggerName string `protobuf:"bytes,4,opt,name=logger_name,json=loggerName,proto3" json:"logger_name,omitempty"`
Message string `protobuf:"bytes,5,opt,name=message,proto3" json:"message,omitempty"`
Caller *structpb.Struct `protobuf:"bytes,6,opt,name=caller,proto3" json:"caller,omitempty"`
Stack string `protobuf:"bytes,7,opt,name=stack,proto3" json:"stack,omitempty"`
Fields []*structpb.Struct `protobuf:"bytes,8,rep,name=fields,proto3" json:"fields,omitempty"`
}
we don't have a Stack, Fields, Caller, Host, LoggerName.
We could use the same type and just pass in the fields we have, but I figured it was worth making our own object so that we would have our own control over what it looked like.
Like, if GCP changes what they offer us, we can change our type no problemo.
But curious to hear your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the increased control over training logs. My main concern was related to if we made a tool/feature for robot logs and wanted to use it for training logs, but I think there are reasonable ways around that and also not a concern for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
* main: (22 commits) DATA-3094: Add CLI args to SubmitCustomTrainingJobRequest (#550) bump min golang version (#548) DATA-3054: Added UpdateBoundingBox (#547) RSDK-8595 update docs with new behavior (#546) APP-3604 build endpoints for external oauth and repos (#544) add last updated to fragment proto (#540) APP-5851: Add optional visibility field to `CreateFragmentRequest` (#543) APP-5314 - add dates and limit to GetRobotPartLogsRequest (#538) RSDK-7903 Add unhealthy state and error field (#539) RSDK-8381 add viam server version to API (#537) [APP-5417]: Undo APP-5417 (#536) [APP-5417]: Accommodate pagination for GetRobotPartHistory in api (#535) Data 2816/get training job logs (#534) DATA-2792 - Add URL to UpdateRegistryItemRequest, but make it optional (#532) RSDK-8228 Add config status to machine status API (#531) Add revision field to config (#530) [APP-5420] Add ListMachineFragments request and response (#525) [APP-5400]: Add GetFragmentHistory proto (#524) RSDK-7899 Add MachineStatus to RobotService (#527) RSDK-8083: add LogPatternConfig field to RobotConfig API (#526) ...
* main: (22 commits) DATA-3094: Add CLI args to SubmitCustomTrainingJobRequest (#550) bump min golang version (#548) DATA-3054: Added UpdateBoundingBox (#547) RSDK-8595 update docs with new behavior (#546) APP-3604 build endpoints for external oauth and repos (#544) add last updated to fragment proto (#540) APP-5851: Add optional visibility field to `CreateFragmentRequest` (#543) APP-5314 - add dates and limit to GetRobotPartLogsRequest (#538) RSDK-7903 Add unhealthy state and error field (#539) RSDK-8381 add viam server version to API (#537) [APP-5417]: Undo APP-5417 (#536) [APP-5417]: Accommodate pagination for GetRobotPartHistory in api (#535) Data 2816/get training job logs (#534) DATA-2792 - Add URL to UpdateRegistryItemRequest, but make it optional (#532) RSDK-8228 Add config status to machine status API (#531) Add revision field to config (#530) [APP-5420] Add ListMachineFragments request and response (#525) [APP-5400]: Add GetFragmentHistory proto (#524) RSDK-7899 Add MachineStatus to RobotService (#527) RSDK-8083: add LogPatternConfig field to RobotConfig API (#526) ...
Context: https://viam.atlassian.net/browse/DATA-2816?atlOrigin=eyJpIjoiYjE3ZDE4NTk1MThhNDk0Mjk5NzcyOWMxMDYwNjUyZDkiLCJwIjoiaiJ9
This adds the endpoint to GetTrainingJobLogs with the following configs:
App PR: https://github.com/viamrobotics/app/pull/5433