-
Notifications
You must be signed in to change notification settings - Fork 0
Rollbar Handbook
In this guide we will cover the basic concepts behind Rollbar, show you how to manage errors in the system, and give general guidance for Point Guards who need to curate messages in the system.
In order to manage errors in Rollbar you will need an account. If you have recently joined, ask someone for access.
Rollbar is a essentially a logging system that records errors, tallies them, and can be used to understand the state of our system from the perspective of failure scenarios. It also includes tools that allow us to watch, resolve, and change the criticality of error messages. By using it correctly to manage the errors being reported it grants us the ability to quick identify, track, and fix errors in our production system.
In order to begin managing our error reporting through rollbar, you must first understand the different levels of criticality it provides us. The levels, in order of priority, are:
- Critical: Fatal errors that cause servers (or the entire app) to go down; these should be addressed immediately.
- Error: Unexpected failures; these should be addressed in the same day (most 500s).
- Warn: Expected and recoverable failures (i.e. most 400s never 500s); these can usually be ignored. All services should be fault tolerant to deployments or intermittent network failures.
- Info: Information about the service; these should never be reported.
- Debug: The lowest level. Information about the service; these should never be reported.
Notice that these levels are very similar to that of our production logging. This is no accident, as rollbar is essentially a log tallying system that provides a focused view on errors. Keep this in mind when managing errors, as it will help you make decisions on which errors should be marked as warnings, and which ones should be kept at error or critical.
As of March 9th 2016 we're following a Stranger Danger policy. All new critical and error items must be identified, triaged, assigned, and fixed. Fixes include bug fixes, removing error conditions, changing error level in code, etc... Each project must be triaged by the lead once a week.
As a project lead you will be responsible for managing and curating the errors in Rollbar. Since each of our projects reports every error they encounter to Rollbar it is hard to separate the signal from the noise. To help we limit the default views to the last 15 days of production.
As the lead you will be responsible for performing the following actions:
- Categorize any new items to correct level (critical, error, warning, etc.)
- Attempt to discern any correlations to recent deployments on the project by cross-referencing with GitHub
- Mark "resolved" any known spurious issues
- Assign all critical and error level items for immediate attention
- Keep the Rollbar token and library up to date for your project
To start you can adopt the following basic script to use when performing your duties:
- Login to Rollbar and access the "Dashboard" for the API
- Note any active production errors
- Investigate any new errors first, they might represent critical failures in the system
- Determine if any of the new errors being reported are non-serious and set their level
- Assign issues to responsible parties or yourself
The workflow given above is only one of many possible scripts one could follow to get the job done. As you gain more experience with the tool and our system you will probably find a method that fits you best.
And that's basically it. The tool is pretty good so it makes the job of exploring and getting a grasp on the errors a piece of cake. All you have to do is make sure you're not skimping on your duty and get the job done! The rest of this article contains a reference (with images) for each of the major aspects of the tool.
When you first login to Rollbar you should see a page that looks like this:
This is the Rollbar dashboard and it shows you basic trends about errors for a particular project in our infrastructure. From the dashboard you have the following basic actions you can perform:
- View trend information for project errors
- Dig deeper into a particular error
- Monitor account usage
The dashboard will only show errors for one project at a time. As a Point Guard you will need to monitor errors across all of the projects in our infrastructure. To do so use the project selection dropdown near the top left of the page.
The rollbar interface uses dark red to denote errors and yellow to denote warnings.
Dotted throughout the interface you will find "occurrences charts". These charts give you a quick visual representation of the number of errors occurring over time (usually a 24 hour period). While useful for seeing a short trend, the charts themselves can often be a bit misleading. Thus, you will need to use your brain and investigate what is going on before screaming "RED ALERT".
The runnable infrastructure currently has five environments (at the time of this writing), they are:
-
delta
(production) -
gamma
(infrastructure testing) -
epsilon
(infrastructure testing) -
staging
(our internal dogfooding environment)
As such it is important to note which environments you are looking at when exploring the errors in rollbar. The environment dropdown (shown above) allows you to choose specific environments to investigate. Mostly you'll want to keep an eye on production, but it may also be fruitful to occasional look at beta and staging to get an idea of what errors might be coming up (usually these environments are slightly ahead of production, code-wise).
When you click on an error in the dashboard you will be brought to the "Error Details" page. This page will give you a basic overview of the occurrences of the error (via charts) and a stack trace for the error (sometimes this will be missing).
These controls, located at the top of the error details page, give you the ability to change the level for an error, mark errors as resolved, mute errors, and create new pager-duty incidents from errors. This is the meat and potatoes of error curation, and you will use these tools to help separate the signal from the noise.
Finally, the items view gives you a way to query across multiple errors at a time. This can give you an understanding of what is happening from a particular perspective when grappling with the information presented by rollbar.