We'd love your help! Please join our weekly SIG meeting.
The OpenTelemetry Collector has two main target audiences:
- End-users, aiming to use an OpenTelemetry Collector binary.
- Collector distributions, consuming the APIs exposed by the OpenTelemetry core repository. Distributions can be an official OpenTelemetry community project, such as the OpenTelemetry Collector "core" and "contrib", or external distributions, such as other open-source projects building on top of the Collector or vendor-specific distributions.
End-users are the target audience for our binary distributions, as made available via the
opentelemetry-collector-releases repository. To
them, stability in the behavior is important, be it runtime or configuration. They are more numerous and harder to get
in touch with, making our changes to the collector more disruptive to them than to other audiences. As a general rule,
whenever you are developing OpenTelemetry Collector components (extensions, receivers, processors, exporters), you
should have end-users' interests in mind. Similarly, changes to code within packages like config
will have an impact
on this audience. Make sure to cause minimal disruption when doing changes here.
In this capacity, the opentelemetry-collector repository's public Go types, functions, and interfaces act as an API for other projects. In addition to the end-user aspect mentioned above, this audience also cares about API compatibility, making them susceptible to our refactorings, even though such changes wouldn't cause any impact to end-users. See the "Breaking changes" in this document for more information on how to perform changes affecting this audience.
This audience might use tools like the opentelemetry-collector-builder as part of their delivery pipeline. Be mindful that changes there might cause disruption to this audience.
We recommend that any PR (unless it is trivial) to be smaller than 500 lines (excluding go mod/sum changes) in order to help reviewers to do a thorough and reasonably fast reviews.
Consider submitting different PRs for (more details about adding new components here) :
- First PR should include the overall structure of the new component:
- Readme, configuration, and factory implementation usually using the helper factory structs.
- This PR is usually trivial to review, so the size limit does not apply to it.
- Second PR should include the concrete implementation of the component. If the size of this PR is larger than the recommended size consider splitting it in multiple PRs.
- Last PR should enable the new component and add it to the
otelcontribcol
binary by updating thecomponents.go
file. The component must be enabled only after sufficient testing, and there is enough confidence in the stability and quality of the component. - Once a new component has been added to the executable, please add the component to the OpenTelemetry.io registry.
Any refactoring work must be split in its own PR that does not include any behavior changes. It is important to do this to avoid hidden changes in large and trivial refactoring PRs.
Reporting bugs is an important contribution. Please make sure to include:
- Expected and actual behavior
- OpenTelemetry version you are running
- If possible, steps to reproduce
Please read project contribution guide for general practices for OpenTelemetry project.
Select a good issue from the links below (ordered by difficulty/complexity):
Comment on the issue that you want to work on so we can assign it to you and clarify anything related to it.
If you would like to work on something that is not listed as an issue (e.g. a new feature or enhancement) please first read our vision and roadmap to make sure your proposal aligns with the goals of the Collector, then create an issue and describe your proposal. It is best to do this in advance so that maintainers can decide if the proposal is a good fit for this repository. This will help avoid situations when you spend significant time on something that maintainers may decide this repo is not the right place for.
Follow the instructions below to create your PR.
In the interest of keeping this repository clean and manageable, you should
work from a fork. To create a fork, click the 'Fork' button at the top of the
repository, then clone the fork locally using git clone git@github.com:USERNAME/opentelemetry-collector.git
.
You should also add this repository as an "upstream" repo to your local copy, in order to keep it up to date. You can add this as a remote like so:
git remote add upstream https://github.com/open-telemetry/opentelemetry-collector.git
Verify that the upstream exists:
git remote -v
To update your fork, fetch the upstream repo's branches and commits, then merge
your main
with upstream's main
:
git fetch upstream
git checkout main
git merge upstream/main
Remember to always work in a branch of your local copy, as you might otherwise
have to contend with conflicts in main
.
Please also see GitHub workflow section of general project contributing guide.
Working with the project sources requires the following tools:
Fork the repo, checkout the upstream repo to your GOPATH by:
$ git clone git@github.com:open-telemetry/opentelemetry-collector.git
Add your fork as an origin:
$ cd opentelemetry-collector
$ git remote add fork git@github.com:YOUR_GITHUB_USERNAME/opentelemetry-collector.git
Run tests, fmt and lint:
$ make install-tools # Only first time.
$ make
Note: the default build target requires tools that are installed at $(go env GOPATH)/bin
, ensure that $(go env GOPATH)/bin
is included in your PATH
.
Checkout a new branch, make modifications, build locally, and push the branch to your fork to open a new PR:
$ git checkout -b feature
# edit
$ make
$ make fmt
$ git commit
$ git push fork feature
This project uses Go 1.17.* and CircleCI.
CircleCI uses the Makefile with the ci
target, it is recommended to
run it before submitting your PR. It runs gofmt -s
(simplify) and golint
.
The dependencies are managed with go mod
if you work with the sources under your
$GOPATH
you need to set the environment variable GO111MODULE=on
.
Although OpenTelemetry project as a whole is still in Alpha stage we consider OpenTelemetry Collector to be close to production quality and the quality bar for contributions is set accordingly. Contributions must have readable code written with maintainability in mind (if in doubt check Effective Go for coding advice). The code must adhere to the following robustness principles that are important for software that runs autonomously and continuously without direct interaction with a human (such as this Collector).
To keep naming patterns consistent across the project, naming patterns are enforced to make intent clear by:
- Methods that return a variable that uses the zero value or values provided via the method MUST have the prefix
New
. For example:func NewKinesisExporter(kpl aws.KinesisProducerLibrary)
allocates a variable that uses the variables passed on creation.func NewKeyValueBuilder()
SHOULD allocate internal variables to a safe zero value.
- Methods that return a variable that uses non zero value(s) that impacts business logic MUST use the prefix
NewDefault
. For example:func NewDefaultKinesisConfig()
would return a configuration that is the suggested default and can be updated without concern of a race condition.
- Methods that act upon an input variable MUST have a signature that reflect concisely the logic being done. For example:
func FilterAttributes(attrs []Attribute, match func(attr Attribute) bool) []Attribute
MUST only filter attributes out of the passed input slice and return a new slice with values thatmatch
returns true. It may not do more work than what the method name implies, ie, it must not key a global history of all the slices that have been filtered.
- Variable assigned in a package's global scope that is preconfigured with a default set of values MUST use
Default
as the prefix. For example:var DefaultMarshallers = map[string]pdata.Marshallers{...}
is defined with an exporters package that allows for converting an encoding name,zipkin
, and return the preconfigured marshaller to be used in the export process.
In order to simplify developing within the project, library recommendations have been set and should be followed.
Scenario | Recommended | Rationale |
---|---|---|
Hashing | "hashing/fnv" | The project adopted this as the default hashing method due to the efficiency and is reasonable for non cryptographic use |
Testing | Use t.Parallel() where possible |
Enabling more test to be run in parallel will speed up the feedback process when working on the project. |
Within the project, there are some packages that are yet to follow the recommendations and are being address, however, any new code should adhere to the recommendations.
Verify configuration during startup and fail fast if the configuration is invalid. This will bring the attention of a human to the problem as it is more typical for humans to notice problems when the process is starting as opposed to problems that may arise sometime (potentially long time) after process startup. Monitoring systems are likely to automatically flag processes that exit with failure during startup, making it easier to notice the problem. The Collector should print a reasonable log message to explain the problem and exit with a non-zero code. It is acceptable to crash the process during startup if there is no good way to exit cleanly but do your best to log and exit cleanly with a process exit code.
Do not crash or exit outside the main()
function, e.g. via log.Fatal
or os.Exit
,
even during startup. Instead, return detailed errors to be handled appropriately
by the caller. The code in packages other than main
may be imported and used by
third-party applications, and they should have full control over error handling
and process termination.
Do not crash or exit the Collector process after the startup sequence is finished. A running Collector typically contains data that is received but not yet exported further (e.g. is stored in the queues and other processors). Crashing or exiting the Collector process will result in losing this data since typically the receiver has already acknowledged the receipt for this data and the senders of the data will not send that data again.
Do not crash on bad input in receivers or elsewhere in the pipeline. Crash-only software is valid in certain cases; however, this is not a correct approach for Collector (except during startup, see above). The reason is that many senders from which Collector receives data have built-in automatic retries of the same data if no acknowledgment is received from the Collector. If you crash on bad input chances are high that after the Collector is restarted it will see the same data in the input and will crash again. This will likely result in infinite crashing loop if you have automatic retries in place.
Typically bad input when detected in a receiver should be reported back to the sender. If it is elsewhere in the pipeline it may be too late to send a response to the sender (particularly in processors which are not synchronously processing data). In either case it is recommended to keep a metric that counts bad input data.
Be rigorous in error handling. Don't ignore errors. Think carefully about each error and decide if it is a fatal problem or a transient problem that may go away when retried. Fatal errors should be logged or recorded in an internal metric to provide visibility to users of the Collector. For transient errors come up with a retrying strategy and implement it. Typically you will want to implement retries with some sort of exponential back-off strategy. For connection or sending retries use jitter for back-off intervals to avoid overwhelming your destination when network is restored or the destination is recovered. Exponential Backoff is a good library that provides all this functionality.
Log your component startup and shutdown, including successful outcomes (but don't overdo it, keep the number of success message to a minimum). This can help to understand the context of failures if they occur elsewhere after your code is successfully executed.
Use logging carefully for events that can happen frequently to avoid flooding the logs. Avoid outputting logs per a received or processed data item since this can amount to very large number of log entries (Collector is designed to process many thousands of spans and metrics per second). For such high-frequency events instead of logging consider adding an internal metric and increment it when the event happens.
Make log message human readable and also include data that is needed for easier understanding of what happened and in what context.
The components should avoid executing arbitrary external processes with arbitrary command line arguments based on user input, including input received from the network or input read from the configuration file. Failure to follow this rule can result in arbitrary remote code execution, compelled by malicious actors that can craft the input.
The following limitations are recommended:
- If an external process needs to be executed limit and hard-code the location where the executable file may be located, instead of allowing the input to dictate the full path to the executable.
- If possible limit the name of the executable file to be one from a hard-coded list defined at compile time.
- If command line arguments need to be passed to the process do not take the arguments from the user input directly. Instead, compose the command line arguments indirectly, if necessary, deriving the value from the user input. Limit as much as possible the possible space of values for command line arguments.
Out of the box, your users should be able to observe the state of your component.
The collector exposes an OpenMetrics endpoint at http://localhost:8888/metrics
where your data will land.
When using the regular helpers, you should have some metrics added around key
events automatically. For instance, exporters should have otelcol_exporter_sent_spans
tracked without your exporter doing anything.
Limit usage of CPU, RAM or other resources that the code can use. Do not write code that consumes resources in an uncontrolled manner. For example if you have a queue that can contain unprocessed messages always limit the size of the queue unless you have other ways to guarantee that the queue will be consumed faster than items are added to it.
Performance test the code for both normal use-cases under acceptable load and also for abnormal use-cases when the load exceeds acceptable many times. Ensure that your code performs predictably under abnormal use. For example if the code needs to process received data and cannot keep up with the receiving rate it is not acceptable to keep allocating more memory for received data until the Collector runs out of memory. Instead have protections for these situations, e.g. when hitting resource limits drop the data and record the fact that it was dropped in a metric that is exposed to users.
Collector does not yet support graceful shutdown but we plan to add it. All components
must be ready to shutdown gracefully via Shutdown()
function that all component
interfaces require. If components contain any temporary data they need to process
and export it out of the Collector before shutdown is completed. The shutdown process
will have a maximum allowed duration so put a limit on how long your shutdown
operation can take.
Cover important functionality with unit tests. We require that contributions do not decrease overall code coverage of the codebase - this is aligned with our goal to increase coverage over time. Keep track of execution time for your unit tests and try to keep them as short as possible.
To keep testing practices consistent across the project, it is advised to use these libraries under these circumstances:
- For assertions to validate expectations, use
"github.com/stretchr/testify/assert"
- For assertions that are required to continue the test, use
"github.com/stretchr/testify/require"
- For mocking external resources, use
"github.com/stretchr/testify/mock"
- For validating HTTP traffic interactions,
"net/http/httptest"
Integration testing is encouraged throughout the project, container images can be used in order to facilitate a local version. In their absence, it is strongly advised to mock the integration.
Using CGO is prohibited due to the lack of portability and complexity that comes with managing external libaries with different operating systems and configurations. However, if the package MUST use CGO, this should be explicitly called out within the readme with clear instructions on how to install the required libraries. Furthermore, if your package requires CGO, it MUST be able to compile and operate in a no op mode or report a warning back to the collector with a clear error saying CGO is required to work.
Whenever possible, we adhere to semver as our minimum standards. Even before v1, we strive to not break compatibility without a good reason. Hence, when a change is known to cause a breaking change, it MUST be clearly marked in the changelog, and SHOULD include a line instructing users how to move forward.
We also strive to perform breaking changes in two stages, deprecating it first (vM.N
) and breaking it in a subsequent
version (vM.N+1
).
- when we need to remove something, we MUST mark a feature as deprecated in one version, and MAY remove it in a subsequent one
- when renaming or refactoring types, functions, or attributes, we MUST create the new name and MUST deprecate the old one in one version (step 1), and MAY remove it in a subsequent version (step 2). For simple renames, the old name SHALL call the new one.
- when a feature is being replaced in favor of an existing one, we MUST mark a feature as deprecated in one version, and MAY remove it in a subsequent one.
Deprecation notice SHOULD contain a version starting from which the deprecation takes effect for tracking purposes. For
example, if GetFoo
function is going to be deprecated in v0.45.0
version, it gets the following godoc line:
// Deprecated: [v0.45.0] Use MustGetFoo instead.
func GetFoo() Bar {
When deprecating a feature affecting end-users, consider first deprecating the feature in one version, then hiding it behind a feature flag in a subsequent version, and eventually removing it after yet another version. This is how it would look like, considering that each of the following steps is done in a separate version:
- Mark the feature as deprecated, add a short lived feature flag with the feature enabled by default
- Change the feature flag to disable the feature by default, deprecating the flag at the same time
- Remove the feature and the flag
- Current version
v0.N
hasfunc GetFoo() Bar
- We now decided that
GetBar
is a better name. As such, onv0.N+1
we add a newfunc GetBar() Bar
function, changing the existingfunc GetFoo() Bar
to be an alias of the new function. Additionally, a log entry with a warning is added to the old function, along with an entry to the changelog. - On
v0.N+2
, we MAY removefunc GetFoo() Bar
.
- Current version
v0.N
hasfunc GetFoo() Foo
- We now need to also return an error. We do it by creating a new function that will be equivalent to the existing one,
so that current users can easily migrate to that:
func MustGetFoo() Foo
, which panics on errors. We release this inv0.N+1
, deprecating the existingfunc GetFoo() Foo
with it, adding an entry to the changelog and perhaps a log entry with a warning. - On
v0.N+2
, we changefunc GetFoo() Foo
tofunc GetFoo() (Foo, error)
.
- Current version
v0.N
hasfunc GetFoo() Foo
- We now decided to do something that might be blocking as part of
func GetFoo() Foo
, so, we start accepting a context:func GetFooWithContext(context.Context) Foo
. We release this inv0.N+1
, deprecating the existingfunc GetFoo() Foo
with it, adding an entry to the changelog and perhaps a log entry with a warning. The existingfunc GetFoo() Foo
is changed to callfunc GetFooWithContext(context.Background()) Foo
. - On
v0.N+2
, we changefunc GetFoo() Foo
tofunc GetFoo(context.Context) Foo
if desired or remove it entirely if needed.
An entry into the Changelog is required for the following reasons:
- Changes made to the behaviour of the component
- Changes to the configuration
- Changes to default settings
- New components being added
It is reasonable to omit an entry to the changelog under these circuimstances:
- Updating test to remove flakiness or improve coverage
- Updates to the CI/CD process
If there is some uncertainty with regards to if a changelog entry is needed, the recomendation is to create an entry to in the event that the change is important to the project consumers.
See release for details.
If you are adding any new images, please use Excalidraw. It's a free and open source web application and doesn't require any account to get started. Once you've created the design, while exporting the image, make sure to tick "Embed scene into exported file" option. This allows the image to be imported in an editable format for other contributors later.
Build fails due to dependency issues, e.g.
go: github.com/golangci/golangci-lint@v1.31.0 requires
github.com/tommy-muehle/go-mnd@v1.3.1-0.20200224220436-e6f9a994e8fa: invalid pseudo-version: git fetch --unshallow -f origin in /root/go/pkg/mod/cache/vcs/053b1e985f53e43f78db2b3feaeb7e40a2ae482c92734ba3982ca463d5bf19ce: exit status 128:
fatal: git fetch-pack: expected shallow list
go env GOPROXY
should return https://proxy.golang.org,direct
. If it does not, set it as an environment variable:
export GOPROXY=https://proxy.golang.org,direct
While the rules in this and other documents in this repository is what we strive to follow, we acknowledge that rules may be unfeasible to be applied in some situations. Exceptions to the rules on this and other documents are acceptable if consensus can be obtained from approvers in the pull request they are proposed. A reason for requesting the exception MUST be given in the pull request. Until unanimity is obtained, approvers and maintainers are encouraged to discuss the issue at hand. If a consensus (unanimity) cannot be obtained, the maintainers are then tasked in getting a decision using its regular means (voting, TC help, ...).