Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data 2816/get training job logs #534

Merged
merged 9 commits into from
Jul 23, 2024

Conversation

etai-shuchatowitz
Copy link
Member

@etai-shuchatowitz etai-shuchatowitz commented Jul 22, 2024

Context: https://viam.atlassian.net/browse/DATA-2816?atlOrigin=eyJpIjoiYjE3ZDE4NTk1MThhNDk0Mjk5NzcyOWMxMDYwNjUyZDkiLCJwIjoiaiJ9

This adds the endpoint to GetTrainingJobLogs with the following configs:

message TrainingJobLogEntry {
  string level = 1;
  google.protobuf.Timestamp time = 2;
  string message = 3;
}

message GetTrainingJobLogsRequest {
  string id = 1;
  optional string page_token = 2;
}

message GetTrainingJobLogsResponse {
  repeated TrainingJobLogEntry logs = 1;
  string next_page_token = 2;
}

App PR: https://github.com/viamrobotics/app/pull/5433

@etai-shuchatowitz etai-shuchatowitz added the ready-for-protos add this when you want protos to compile on every commit label Jul 22, 2024
@github-actions github-actions bot added the safe to test committer is a member of this org label Jul 22, 2024
@etai-shuchatowitz etai-shuchatowitz added ready-for-protos add this when you want protos to compile on every commit and removed ready-for-protos add this when you want protos to compile on every commit labels Jul 22, 2024
@etai-shuchatowitz etai-shuchatowitz added safe to test committer is a member of this org and removed safe to test committer is a member of this org labels Jul 22, 2024
@@ -134,3 +137,19 @@ message DeleteCompletedTrainingJobRequest {
}

message DeleteCompletedTrainingJobResponse {}

message TrainingJobLogEntry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in the scope we used the common log entry [0] but here we redefine it. Is there a reason why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually a whole slew of differences between them and us. Namely, these are the fields we have available to us:

{
  "insertId": "1ie943xfro8sio",
  "jsonPayload": {
	"attrs": {
  	"tag": "workerpool0-0"
	},
	"levelname": "ERROR",
	"message": "2024-07-09 19:47:33.018974: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero"
  },
  "labels": {
	"compute.googleapis.com/resource_id": "6425219297779396081",
	"compute.googleapis.com/resource_name": "cmle-training-12984007680247343913",
	"compute.googleapis.com/zone": "us-central1-f",
	"ml.googleapis.com/job_id/log_area": "root",
	"ml.googleapis.com/tpu_worker_id": "",
	"ml.googleapis.com/trial_id": "",
	"ml.googleapis.com/trial_type": ""
  },
  "logName": "projects/staging-cloud-web-app/logs/workerpool0-0",
  "receiveTimestamp": "2024-07-09T19:48:00.02068423Z",
  "resource": {
	"labels": {
  	"job_id": "6458394034401443840",
  	"project_id": "staging-cloud-web-app",
  	"task_name": "workerpool0-0"
	},
	"type": "ml_job"
  },
  "severity": "ERROR",
  "timestamp": "2024-07-09T19:47:56.597393583Z"
}

If you compare this with the LogEntry which looks like

type LogEntry struct {
	state         protoimpl.MessageState
	sizeCache     protoimpl.SizeCache
	unknownFields protoimpl.UnknownFields

	Host       string                 `protobuf:"bytes,1,opt,name=host,proto3" json:"host,omitempty"`
	Level      string                 `protobuf:"bytes,2,opt,name=level,proto3" json:"level,omitempty"`
	Time       *timestamppb.Timestamp `protobuf:"bytes,3,opt,name=time,proto3" json:"time,omitempty"`
	LoggerName string                 `protobuf:"bytes,4,opt,name=logger_name,json=loggerName,proto3" json:"logger_name,omitempty"`
	Message    string                 `protobuf:"bytes,5,opt,name=message,proto3" json:"message,omitempty"`
	Caller     *structpb.Struct       `protobuf:"bytes,6,opt,name=caller,proto3" json:"caller,omitempty"`
	Stack      string                 `protobuf:"bytes,7,opt,name=stack,proto3" json:"stack,omitempty"`
	Fields     []*structpb.Struct     `protobuf:"bytes,8,rep,name=fields,proto3" json:"fields,omitempty"`
}

we don't have a Stack, Fields, Caller, Host, LoggerName.

We could use the same type and just pass in the fields we have, but I figured it was worth making our own object so that we would have our own control over what it looked like.

Like, if GCP changes what they offer us, we can change our type no problemo.

But curious to hear your thoughts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the increased control over training logs. My main concern was related to if we made a tool/feature for robot logs and wanted to use it for training logs, but I think there are reasonable ways around that and also not a concern for now.

Copy link
Member

@tahiyasalam tahiyasalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@etai-shuchatowitz etai-shuchatowitz merged commit 92f7346 into main Jul 23, 2024
6 checks passed
@etai-shuchatowitz etai-shuchatowitz deleted the DATA-2816/get-training-job-logs branch July 23, 2024 16:54
maxhorowitz added a commit that referenced this pull request Sep 16, 2024
* main: (22 commits)
  DATA-3094: Add CLI args to SubmitCustomTrainingJobRequest (#550)
  bump min golang version (#548)
  DATA-3054: Added UpdateBoundingBox (#547)
  RSDK-8595 update docs with new behavior (#546)
  APP-3604 build endpoints for external oauth and repos (#544)
  add last updated to fragment proto (#540)
  APP-5851: Add optional visibility field to `CreateFragmentRequest` (#543)
  APP-5314 - add dates and limit to GetRobotPartLogsRequest (#538)
  RSDK-7903 Add unhealthy state and error field (#539)
  RSDK-8381 add viam server version to API (#537)
  [APP-5417]: Undo APP-5417 (#536)
  [APP-5417]: Accommodate pagination for GetRobotPartHistory in api (#535)
  Data 2816/get training job logs (#534)
  DATA-2792 - Add URL to UpdateRegistryItemRequest, but make it optional (#532)
  RSDK-8228 Add config status to machine status API (#531)
  Add revision field to config (#530)
  [APP-5420] Add ListMachineFragments request and response (#525)
  [APP-5400]: Add GetFragmentHistory proto (#524)
  RSDK-7899 Add MachineStatus to RobotService (#527)
  RSDK-8083: add LogPatternConfig field to RobotConfig API (#526)
  ...
maxhorowitz added a commit that referenced this pull request Sep 16, 2024
* main: (22 commits)
  DATA-3094: Add CLI args to SubmitCustomTrainingJobRequest (#550)
  bump min golang version (#548)
  DATA-3054: Added UpdateBoundingBox (#547)
  RSDK-8595 update docs with new behavior (#546)
  APP-3604 build endpoints for external oauth and repos (#544)
  add last updated to fragment proto (#540)
  APP-5851: Add optional visibility field to `CreateFragmentRequest` (#543)
  APP-5314 - add dates and limit to GetRobotPartLogsRequest (#538)
  RSDK-7903 Add unhealthy state and error field (#539)
  RSDK-8381 add viam server version to API (#537)
  [APP-5417]: Undo APP-5417 (#536)
  [APP-5417]: Accommodate pagination for GetRobotPartHistory in api (#535)
  Data 2816/get training job logs (#534)
  DATA-2792 - Add URL to UpdateRegistryItemRequest, but make it optional (#532)
  RSDK-8228 Add config status to machine status API (#531)
  Add revision field to config (#530)
  [APP-5420] Add ListMachineFragments request and response (#525)
  [APP-5400]: Add GetFragmentHistory proto (#524)
  RSDK-7899 Add MachineStatus to RobotService (#527)
  RSDK-8083: add LogPatternConfig field to RobotConfig API (#526)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-protos add this when you want protos to compile on every commit safe to test committer is a member of this org
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants