A list of component-level metrics emitted by Diego. Contributors interested in adding new metrics should visit our contributor doc for a list of code conventions we follow.
Metric | Description | Unit |
---|---|---|
AuctioneerFailedCellStateRequests |
Cumulative number of cells the auctioneer failed to query for state. Emitted during each auction. | number |
AuctioneerFetchStatesDuration |
Time the auctioneer took to fetch state from all the cells when running its auction. Emitted during each auction. | ns |
AuctioneerLRPAuctionsFailed |
Cumulative number of LRP instances that the auctioneer failed to place on Diego cells. Emitted during each auction. | number |
AuctioneerLRPAuctionsStarted |
Cumulative number of LRP instances that the auctioneer successfully placed on Diego cells. Emitted during each auction. | number |
AuctioneerTaskAuctionsFailed |
Cumulative number of Tasks that the auctioneer failed to place on Diego cells. Emitted during each auction. | number |
AuctioneerTaskAuctionsStarted |
Cumulative number of Tasks that the auctioneer successfully placed on Diego cells. Emitted during each auction. | number |
LockHeld |
Whether an auctioneeer holds the auctioneer lock (in locket): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active auctioneer. | 0 or 1 (boolean) |
LockHeld. v1-locks-auctioneer_lock |
Whether an auctioneeer holds the auctioneer lock (in consul): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active auctioneer. | 0 or 1 (boolean) |
LockHeldDuration. v1-locks-auctioneer_lock |
Time the active auctioneeer has held the auctioneer lock. Emitted periodically by the active auctioneer. | ns |
RequestCount |
Cumulative number of requests the auctioneer has handled through its API. Emitted periodically. | number |
RequestLatency |
Time the auctioneer took to handle requests to its API endpoints. Emitted when the auctioneer handles requests. | ns |
Metric | Description | Unit |
---|---|---|
BBSMasterElected |
Emitted once when the BBS is elected as master. | number (always 1) |
ConvergenceLRPDuration |
Time the BBS took to run the entire LRP convergence pass. Emitted periodically. | ns |
ConvergenceLRPRuns |
Cumulative number of times BBS has run its LRP convergence pass. Emitted periodically. | number |
ConvergenceTaskDuration |
Time the BBS took to run the entire Task convergence pass. Emitted periodically. | ns |
ConvergenceTaskRuns |
Cumulative number of times the BBS has run its Task convergence pass. Emitted periodically. | number |
ConvergenceTasksKicked |
Cumulative number of times the BBS has updated a Task during its Task convergence pass. Emitted periodically. | number |
ConvergenceTasksPruned |
Cumulative number of times the BBS has deleted a malformed Task Definition during its Task convergence pass. Emitted periodically. | number |
CrashedActualLRPs |
Total number of LRP instances that have crashed. Emitted periodically. | number |
CrashingDesiredLRPs |
Total number of DesiredLRPs that have at least one crashed instance. Emitted periodically. | number |
DBOpenConnections |
Number of open connections to the SQL database. Emitted every 60 seconds. | number |
DBQueriesFailed |
Cumulative number of SQL queries that failed. Emitted every 60 seconds. | number |
DBQueriesInFlight |
Maximum number of concurrent in flight queries in the last 60 seconds. Emitted every 60 seconds. | number |
DBQueriesTotal |
Cumulative number of SQL queries executed, including BEGIN , COMMIT , and ROLLBACK statements. Emitted every 60 seconds. |
number |
DBQueriesSucceeded |
Cumulative number of SQL queries that finished successfully. Emitted every 60 seconds. | number |
DBQueryDurationMax |
Maximum duration of all queries that have run in the last 60 seconds. Emitted every 60 seconds. | ns |
DBWaitDuration |
The total time blocked waiting for a new connection. Emitted every 60 seconds. | ns |
DBWaitCount |
The total number of connections waited for. Emitted every 60 seconds. | number |
Domain. <domain-name> |
Whether the <domain-name> domain is up-to-date, so that instances from that domain have been synchronized with DesiredLRPs for Diego to run. 1 means the domain is up-to-date, no data means it is not. Emitted periodically. |
always 1 when present |
EncryptionDuration |
Time the BBS took to ensure all BBS records are encrypted with the current active encryption key. Emitted each time a BBS becomes the active master. | ns |
LRPsClaimed |
Total number of LRP instances that have been claimed by some cell. Emitted periodically. | number |
LRPsDesired |
Total number of LRP instances desired across all LRPs. Emitted periodically. | number |
LRPsExtra |
Total number of LRP instances that are no longer desired but still have a BBS record. Emitted periodically. | number |
LRPsMissing |
Total number of LRP instances that are desired but have no record in the BBS. Emitted periodically. | number |
LRPsRunning |
Total number of LRP instances that are running on cells. Emitted periodically. | number |
LRPsUnclaimed |
Total number of LRP instances that have not yet been claimed by a cell. Emitted periodically. | number |
LockHeld |
Whether a BBS holds the BBS lock (in locket): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active BBS server. | 0 or 1 (boolean) |
LockHeld. v1-locks-bbs_lock |
Whether a BBS holds the BBS lock (in consul): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active BBS server. | 0 or 1 (boolean) |
LockHeldDuration. v1-locks-bbs_lock |
Time the active BBS has held the BBS lock (in consul). Emitted periodically by the active BBS server. | ns |
MigrationDuration |
Time the BBS took to run migrations against its persistence store. Emitted each time a BBS becomes the active master. | ns |
OpenFileDescriptors |
Current (non-cumulative) number of open file descriptors held by the BBS. Emitted periodically. | number |
PresentCells |
Total number of cells that are maintaining presence with Locket. Emitted periodically. | number |
RequestCount |
Cumulative number of requests the BBS has handled through its API. Emitted periodically. | number |
RequestLatency |
Maximum amount of time the BBS took to handle a request to one its API endpoints over a 60-second interval. Emitted every 60 seconds. | ns |
SuspectCells |
Total number of cells that are not maintaining their presences with Locket but for which the BBS has a record of at least one ActualLRP. Emitted periodically. | number |
SuspectClaimedActualLRPs |
Total number of Suspect LRP instances that have been claimed by some cell. Emitted periodically. | number |
SuspectRunningActualLRPs |
Total number of Suspect LRP instances that are running on cells. Emitted periodically. | number |
TasksCompleted |
Total number of Tasks that have completed. Emitted periodically. | number |
TasksPending |
Total number of Tasks that have not yet been placed on a cell. Emitted periodically. | number |
TasksResolving |
Total number of Tasks locked for deletion. Emitted periodically. | number |
TasksRunning |
Total number of Tasks running on cells. Emitted periodically. | number |
TasksSucceeded |
Cumulative number of tasks completed successfully. Note This metric has a cell-id tag that can be used to get the per cell metric. |
number |
TasksFailed |
Cumulative number of tasks that failed. Note This metric has a cell-id tag that can be used to get the per cell metric. |
number |
TasksStarted |
Cumulative number of tasks that has started so far. Note This metric has a cell-id tag that can be used to get the per cell metric. |
number |
Metric | Description | Unit |
---|---|---|
ActiveLocks |
Total number of active locks. Emitted periodically. | number |
ActivePresences |
Total number of active presences. Emitted periodically. | number |
DBOpenConnections |
Number of open connections to the SQL database. Emitted every 60 seconds. | number |
DBQueriesFailed |
Cumulative number of SQL queries that failed. Emitted every 60 seconds. | number |
DBQueriesInFlight |
Maximum number of concurrent in flight queries in the last 60 seconds. Emitted every 60 seconds. | number |
DBQueriesTotal |
Cumulative number of SQL queries executed, including BEGIN , COMMIT , and ROLLBACK statements. Emitted every 60 seconds. |
number |
DBQueriesSucceeded |
Cumulative number of SQL queries that finished successfully. Emitted every 60 seconds. | number |
DBQueryDurationMax |
Maximum duration of all queries that have run in the last 60 seconds. Emitted every 60 seconds. | ns |
LocksExpired |
Cumulative number of locks that have expired. Emitted every 60 seconds. | number |
PresenceExpired |
Cumulative number of presences that have expired. Emitted every 60 seconds. | number |
RequestsCancelled |
Cumulative number of requests of a particular type that have been cancelled by the client. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
RequestsStarted |
Cumulative number of requests of a particular type that have been made. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
RequestsSucceeded |
Cumulative number of requests of a particular type that have completed successfully. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
RequestsFailed |
Cumulative number of requests of a particular type that have failed for any reason. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
RequestsInFlight |
Number of requests of a particular type currently being handled by locket. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
RequestLatencyMax |
Maximum request latency emitted by a request of a particular type in the last 60 seconds. Currently tracking Lock , Release , Fetch , and FetchAll requests. Emitted every 60 seconds. |
number |
Metric | Description | Unit |
---|---|---|
AppInstanceExceededLogRateLimitCount |
Number of application instances that have exceeded the app log rate limit. Emitted once for each application instance that exceeds the log rate limit within the last 5 minute interval (metric only emitted if a app log rate limit has been set and an app instance has exceeded that limit). | number |
CapacityAllocatedDisk |
Amount of disk allocated to containers on this cell. Emitted periodically. | mebibytes |
CapacityAllocatedMemory |
Amount of memory allocated to containers on this cell. Emitted periodically. | mebibytes |
CapacityRemainingContainers |
Remaining number of containers this cell can host. Emitted periodically. | number |
CapacityRemainingDisk |
Remaining amount of disk available for this cell to allocate to containers. Emitted periodically. | mebibytes |
CapacityRemainingMemory |
Remaining amount of memory available for this cell to allocate to containers. Emitted periodically. | mebibytes |
CapacityTotalContainers |
Total number of containers this cell can host. Emitted periodically. | number |
CapacityTotalDisk |
Total amount of disk available for this cell to allocate to containers. Emitted periodically. | mebibytes |
CapacityTotalMemory |
Total amount of memory available for this cell to allocate to containers. Emitted periodically. | mebibytes |
CellUnhealthy |
Whether the cell has reached the healthcheck timeout against the garden backend. 1 signifies unhealthy. Emitted once. | 1 |
ContainerCompletedCount |
Number of containers exited on this cell. Emitted after container exits. | number |
ContainerCount |
Number of containers hosted on the cell. Emitted periodically. | number |
ContainerExitedOnTimeoutCount |
Number of containers on this cell exited after graceful shutdown interval. Emitted after container exits. | number |
ContainerUsageDisk |
Amount of disk used by containers on this cell. Emitted periodically. | mebibytes |
ContainerUsageMemory |
Amount of memory used by containers on this cell. Emitted periodically. | mebibytes |
CredCreationFailedCount |
Count of failed instance identity credential creations. Emitted after every failed credential creation. | number |
CredCreationSucceededCount |
Count of successful instance identity credential creations. Emitted after every successful credential creation. | number |
CredCreationSucceededDuration |
Time the rep took to create instance identity credentials. Emitted after every successful credential creation. | ns |
C2CCredCreationFailedCount |
Count of failed C2C credential creations. Emitted after every failed credential creation. | number |
C2CCredCreationSucceededCount |
Count of successful C2C credential creations. Emitted after every successful credential creation. | number |
C2CCredCreationSucceededDuration |
Time the rep took to create C2C credentials. Emitted after every successful credential creation. | ns |
ContainerSetupSucceededDuration |
Time the rep took to setup a container with the Garden backend. Emitted after every successful container setup. | ns |
ContainerSetupFailedDuration |
Time the rep took to setup a container with the Garden backend. Emitted after every failed container setup. | ns |
GardenContainerCreationFailedDuration |
Time the rep's Garden backend took to create a container. Emitted after every failed container creation. | ns |
GardenContainerCreationSucceededDuration |
Time the rep's Garden backend took to create a container. Emitted after every successful container creation. | ns |
GardenContainerDestructionFailedDuration |
Time the rep's Garden backend took to destroy a container. Emitted after every failed container destruction. | ns |
GardenContainerDestructionSucceededDuration |
Time the rep's Garden backend took to destroy a container. Emitted after every successful container destruction. | ns |
GardenHealthCheckFailed |
Whether the cell has failed to pass its healthcheck against the garden backend. 0 signifies healthy, and 1 signifies unhealthy. Emitted periodically. | 0 or 1 (boolean) |
RepBulkSyncDuration |
Time the cell rep took to synchronize the ActualLRPs it has claimed with its actual garden containers. Emitted periodically by each rep. | ns |
RequestsStarted |
Cumulative number of requests of a particular type that have been made. Currently tracking CancelTask , ContainerMetrics , Perform , Reset , State , and StopLRPInstance requests. Emitted every 60 seconds. |
number |
RequestsSucceeded |
Cumulative number of requests of a particular type that have completed successfully. Currently tracking CancelTask , ContainerMetrics , Perform , Reset , State , and StopLRPInstance requests. Emitted every 60 seconds. |
number |
RequestsFailed |
Cumulative number of requests of a particular type that have failed for any reason. Currently tracking CancelTask , ContainerMetrics , Perform , Reset , State , and StopLRPInstance requests. Emitted every 60 seconds. |
number |
RequestsInFlight |
Cumulative number of requests of a particular type that are in-flight by rep. Currently tracking CancelTask , ContainerMetrics , Perform , Reset , State , and StopLRPInstance requests. Emitted every 60 seconds. |
number |
RequestLatencyMax |
Maximum request latency emitted by a request of a particular type in the last 60 seconds. Currently tracking CancelTask , ContainerMetrics , Perform , Reset , State , and StopLRPInstance requests. Emitted every 60 seconds. |
number |
StalledGardenDuration |
Time the rep is waiting on its garden backend to become healthy during startup. Emitted only if garden not responsive when the rep starts up. | ns |
StartingContainerCount |
Number of containers currently in a Reserved, Initializing, or Created state. Emitted periodically. | number |
StrandedEvacuatingActualLRPs |
Evacuating ActualLPRs that timed out during the evacuation process. Emitted when evacuation doesn't complete successful. | number |
VolmanMountDuration |
Time volman took to mount a volume. Emitted by each rep when volumes are mounted. | ns |
VolmanMountDurationFor |
Time volman took to mount a volume with a specific volume driver. Emitted by each rep when volumes are mounted. | ns |
VolmanMountErrors |
Count of failed volume mounts. Emitted periodically by each rep. | number |
VolmanUnmountDuration |
Time volman took to unmount a volume. Emitted by each rep when volumes are mounted. | ns |
VolmanUnmountDurationFor |
Time volman took to unmount a volume with a specifc volume driver. Emitted by each rep when volumes are mounted. | ns |
VolmanUnmountErrors |
Count of failed volume unmounts. Emitted periodically by each rep. | number |
Metric | Description | Unit |
---|---|---|
AddressCollisions |
Number of detected conflicting routes. A conflicting route is a set of two distinct instances with the same IP address on the routing table. | number |
ConsulDownMode |
Whether the route-emitter is able to connect with the consul correctly. | 0 or 1 boolean |
HTTPRouteCount |
Number of HTTP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically when emitter is in local mode. | number |
HTTPRouteNATSMessagesEmitted |
Cumulative number of HTTP routing messages the route-emitter sends over NATS to the gorouter. | number |
InternalRouteNATSMessagesEmitted |
Cumulative number of internal routing messages the route-emitter sends over NATS to the service discovery controller. | number |
LockHeld. v1-locks-route_emitter_lock |
Whether a route-emitter holds its Consul lock: 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active route-emitter. | 0 or 1 (boolean) |
LockHeldDuration. v1-locks-route_emitter_lock |
Time the active route-emitter has held the Consul lock. Emitted periodically by the active route-emitter. | ns |
RouteEmitterSyncDuration |
Time the route-emitter took to perform its synchronization pass. Emitted periodically. | ns |
RoutesRegistered |
Cumulative number of NATS route registrations emitted from the route-emitter as it reacts to changes to LRPs. | number |
RoutesSynced |
Cumulative number of route registrations emitted from the route-emitter during its periodic route-table emission. | number |
RoutesTotal |
Number of combined HTTP and TCP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically. | number |
RoutesUnregistered |
Cumulative number of NATS route unregistrations emitted from the route-emitter as it reacts to changes to LRPs. | number |
TCPRouteCount |
Number of TCP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically when emitter is in local mode. | number |
Metric | Description | Unit |
---|---|---|
ssh-connections |
Total number of SSH connections an SSH proxy has established. Emitted periodically by each SSH proxy. | number |
These metrics are automatically emitted on all the Diego components.
Metric | Description | Unit |
---|---|---|
memoryStats.lastGCPauseTimeNS | Amount of time the Golang process paused for garbage collection. | ns |
memoryStats.numBytesAllocatedHeap | Number of bytes the Golang process has allocated on the heap. | bytes |
memoryStats.numBytesAllocatedStack | Number of bytes the Golang process has allocated on the stack. | bytes |
numGoRoutines | Number of goroutines the Golang process is running. | number |