-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into collector_architecture
- Loading branch information
Showing
35 changed files
with
714 additions
and
127 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
--- | ||
title: | ||
"Making observability fun: How we increased engineers' confidence in incident | ||
management using a game" | ||
linkTitle: Skyscanner using OTel Demo | ||
date: 2024-02-26 | ||
author: >- | ||
[Jordi Bisbal Ansaldo](https://github.com/jordibisbal8) (Skyscanner) | ||
cSpell:ignore: Ansaldo Bisbal Jordi runbooks Skyscanner upskilled Yankova | ||
--- | ||
|
||
At [Skyscanner](https://www.skyscanner.net), as in many organizations, teams | ||
tend to follow specific runbooks for individual failure modes. With modern and | ||
complex distributed systems, this has the downside of most of the errors being | ||
unknowns, which makes runbooks only partially applicable. | ||
|
||
After migrating our telemetry data to the OpenTelemetry standards at Skyscanner, | ||
we now have richer instrumentation and can rely on observability directly. As a | ||
result, we are ready to adopt a new | ||
[observability mindset](https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/), | ||
which requires training our engineers to work effectively with the new | ||
ecosystem. This allows them to react efficiently to any known or unknown issues, | ||
even under pressure. | ||
|
||
To achieve this, we believe that the best way to gain knowledge isn’t through | ||
one-time viewings of documents or videos. Instead, it’s through practical | ||
exercises that include situations with never-before-seen (or at least rarely | ||
seen) problems. This helps the company reduce the time to mitigate an issue | ||
(TTM), which starts when a first responder acknowledges the incident, until | ||
users stop suffering from the incident. | ||
|
||
## Environment | ||
|
||
To begin with, we need to set up an environment that demonstrates the best | ||
practices for monitoring and debugging using OpenTelemetry instrumentation and | ||
observability. For this, we propose the use of the official | ||
[OpenTelemetry Demo](/docs/demo/), which is a realistic example of a distributed | ||
system called Astronomy Shop. Thanks to the | ||
[OpenTelemetry Protocol](/docs/specs/otlp/) (OTLP), it allows us to simply point | ||
the standard OTLP exporter in the Collector to | ||
[New Relic](https://newrelic.com/), our chosen observability platform at | ||
Skyscanner which, like other platforms, is fully embracing open standards to | ||
ingest telemetry data. | ||
|
||
This system contains regressions that can be injected into the platform and | ||
helps us demonstrate the importance of Service Levels Objectives (SLOs), | ||
tracing, logs, metrics, etc. For instance, we can observe traffic flow through | ||
various components, as shown in the image below. Since part of the OpenTelemetry | ||
ecosystem is open source, we can easily introduce any new features that will be | ||
reviewed by OpenTelemetry contributors. | ||
|
||
![Distributed tracing example in Astronomy shop](tracing-example.png) | ||
|
||
## Observability game day | ||
|
||
Once the environment is set up, we can introduce the Observability Game Day, an | ||
initiative based on the Wheel of Misfortune practices that Google uses and | ||
describes in the [Site Reliability Engineering book](https://sre.google/books/). | ||
|
||
This game simulates a production incident, where a moderator known as the game | ||
master (GM) conducts the session and someone from the audience spins the wheel | ||
and explains an incident or outage. The participants are then divided into teams | ||
and tasked with identifying and resolving the issue as quickly as possible. If | ||
the solution is not optimal, the GM can help by introducing a new tool or view, | ||
which gives a different perspective on how to tackle the incident (knowledge | ||
sharing). This exercise can be repeated multiple times for different incidents. | ||
|
||
![Wheel of misfortune example](wheel.png) | ||
|
||
## Results | ||
|
||
The Observability Game Day has already been completed by multiple Skyscanner | ||
teams, where each team observability expert (ambassador) runs the session. The | ||
participants have given extremely positive feedback, where 90% of the responders | ||
say that after the Game Day, they feel more confident debugging production | ||
systems and would love to have further sessions. | ||
|
||
- Hugely valuable to run against real services and to compare and contrast | ||
different debugging methods. I'm certain everyone, regardless of skill level, | ||
will have got something out of the session - I know I did! Thank you for | ||
taking the time to set this up and promoting it for us - | ||
[Dominic Fraser](https://github.com/dominicfraser) (Senior Software Engineer) | ||
- It is a really great (company-wide) initiative to get people upskilled in | ||
observability and OpenTelemetry/New Relic and I personally found it very | ||
useful, as well as a lot of fun! :D - Polly Yankova (Software Engineer) | ||
|
||
In addition, we learned that: | ||
|
||
1. OTLP makes it incredibly simple to integrate a standard application with an | ||
observability vendor. Just point it to the right endpoint and job done. | ||
2. Our winning teams relied primarily on tracing data to analyze regressions | ||
that helped them understand the root cause faster. Tracing FTW! | ||
3. Front-end engineers found the Game Day lacked focus on client-side | ||
observability, so we decided to contribute upstream (see next steps below). | ||
This was my first contribution to the project, and it was a great experience! | ||
Maintainers were very welcoming and helped me to test and release. Thanks! | ||
|
||
## Next steps | ||
|
||
The next action is to run sessions for all the engineering teams in the company | ||
and convert them into a Skyscanner learning course. This way, the content can be | ||
used during the onboarding process for new joiners or even reviewed at any time | ||
as a refresher for those who have been in the company longer. In addition, after | ||
observing common feedback, we identified that it would be beneficial to extend | ||
the current incidents to include more front-end-specific ones, such as incidents | ||
triggered by browser traffic. To achieve this, we have contributed to the | ||
OpenTelemetry Demo and enabled these features for other interested parties. For | ||
more information, please have a look at the | ||
[raised PR](https://github.com/open-telemetry/opentelemetry-demo/pull/1345). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,147 @@ | ||
--- | ||
title: | ||
Join us for OpenTelemetry Talks and Activities at KubeCon + CloudNativeCon | ||
Europe 2024 | ||
linkTitle: KubeCon EU '24 | ||
date: 2024-02-28 | ||
# prettier-ignore | ||
cSpell:ignore: Aiven Alexandre Anusha Arbiv Beemer Benedikt Blanco Bongartz Chekuri Coralogix Cosmonic Dyrmishi Jiekun Joonas Kanal Kolachala Kowall Machado Magno Marcin Matej Mirabella Narapureddy Nenashev Oleg Oluwalolope Outshift Pismo Purvi Quwan Reddy Ridwan Rollouts Ryanair Skyscanner Sodkiewicz Soluções Srikanth Tecnológicas Yosef | ||
author: '[Severin Neumann](https://github.com/svrnm) (Cisco)' | ||
--- | ||
|
||
The OpenTelemetry project maintainers, members of the governance committee, and | ||
technical committee are thrilled to be at [KubeCon + CloudNativeCon Europe][] | ||
and at the co-located | ||
[Observability Day](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/co-located-events/observability-day/) | ||
in Paris from March 19 - 22, 2024. | ||
|
||
Read on to learn about all the things related OpenTelemetry during KubeCon. | ||
|
||
This post may be updated as we receive notice of other activities, please check | ||
it again right before KubeCon! | ||
|
||
## KubeCon Talks and Maintainer Sessions | ||
|
||
- **[OpenTelemetry: Project Updates, Next Steps, and AMA](https://sched.co/1R2mK)**<br> | ||
by Severin Neumann, Cisco; Austin Parker, Honeycomb; Trask Stalnaker, | ||
Microsoft; Daniel Gomez Blanco, Skyscanner; Alolita Sharma, Apple<br> | ||
Wednesday, March 20 • 11:15 - 11:50 | ||
- **[Distributed Tracing with Jaeger and OpenTelemetry](https://sched.co/1YhfT)**<br> | ||
by Pavol Loffay, Red Hat & Jonah Kowall, Aiven<br> Wednesday, March 20 • | ||
12:10 - 12:45 | ||
- **[Disintegrated Telemetry: The Pains of Monitoring Asynchronous Workflows](https://sched.co/1YeNV)**<br> | ||
by Johannes Tax, Grafana Labs<br> Wednesday, March 20 • 16:30 - 17:05 | ||
- **[From RUM to Front-End Observability with OpenTelemetry](https://sched.co/1YeOH)**<br> | ||
by Purvi Kanal, Honeycomb<br> Thursday, March 21 • 11:00 - 11:35 | ||
- **[Tutorial: Exploring the Power of Distributed Tracing with OpenTelemetry on Kubernetes](https://sched.co/1YePA)**<br> | ||
by Pavol Loffay & Benedikt Bongartz, Red Hat; Matej Gera, Coralogix; Anthony | ||
Mirabella, AWS; Anusha Reddy Narapureddy, Apple<br> Thursday, March 21 • | ||
14:30 - 16:00 | ||
- **[Prometheus and OpenTelemetry: Better Together](https://sched.co/1YePz)**<br> | ||
by Adriana Villela, ServiceNow Cloud Observability & Reese Lee, New Relic<br> | ||
Thursday, March 21 • 16:30 - 17:05 | ||
- **[Observable Feature Rollouts with OpenTelemetry and OpenFeature](https://sched.co/1YeSC)**<br> | ||
by Daniel Dyla & Michael Beemer, Dynatrace<br> Friday, March 22 • 16:00 - | ||
16:35 | ||
|
||
## Observability Day | ||
|
||
_[Observability Day][] fosters collaboration, discussion, and knowledge sharing | ||
of cloud-native observability projects_. This event will be held on March 19, | ||
2024 from 9:00 - 17:35. There will be several sessions on OpenTelemetry as well: | ||
|
||
- **[Welcome + Project Updates](https://sched.co/1YGT9)**<br> by Eduardo Silva, | ||
FluentBit & Austin Parker, honeycomb.io<br> Tuesday, March 19th • 09:00 - | ||
09:20 | ||
- **[Dude, Where’s My Error?: How OpenTelemetry Records Errors, and Why It Does It Like That](https://sched.co/1YFeM)**<br> | ||
by Adriana Villela, ServiceNow Cloud Observability (formerly Lightstep) & | ||
Reese Lee, New Relic<br> Tuesday, March 19th • 10:00 - 10:25 | ||
- **[How to Think About Instrumentation Overhead](https://sched.co/1YFfb)**<br> | ||
by Jason Plumb, Splunk<br> Tuesday, March 19th • 11:05 - 11:30 | ||
- **[TTChat’s Story: Connect Metrics, Logs and Traces with eBPF](https://sched.co/1YFfe)**<br> | ||
by Zhu Jiekun, Quwan<br> Tuesday, March 19th • 11:05 - 11:30 | ||
- **[Panel: OpenTelemetry: Realizing the Value of Open Standards](https://sched.co/1YFgW)**<br> | ||
by Daniel Gomez Blanco, Skyscanner; Marcin Sodkiewicz, Ryanair; Iris Dyrmishi, | ||
Miro; Hope Oluwalolope, Microsoft<br> Tuesday, March 19th • 12:15 - 12:50 | ||
- **[Telemetry Showdown: Fluent Bit Vs. OpenTelemetry Collector - a Comprehensive Benchmark Analysis](https://sched.co/1YFhI)**<br> | ||
by Henrik Rexed, Dynatrace<br> Tuesday, March 19th • 13:30 - 13:55 | ||
- **[Monitoring Serverless Workloads with OpenTelemetry and Prometheus](https://sched.co/1YFhh)**<br> | ||
by Ridwan Sharif, Google<br> Tuesday, March 19th • 14:05 - 14:30 | ||
- **[Observability at the Edge: Instrumenting WebAssembly with OpenTelemetry](https://sched.co/1YFik)**<br> | ||
by Dan Norris & Joonas Bergius, Cosmonic<br> Tuesday, March 19th • 15:15 - | ||
15:40 | ||
- **[Real-World Sampling – Lessons Learned After Reducing ~80% of Our O11y Costs](https://sched.co/1YFii)**<br> | ||
by Juraci Paixão Kröhling, Grafana Labs & Alexandre Magno Prado Machado, Pismo | ||
Soluções Tecnológicas<br> Tuesday, March 19th • 15:15 - 15:40 | ||
- **[⚡ Lightning Talk: Not Just Enterprise. Modern Java App CI/CD Observability with OTel, Quarkus and Gradle](https://sched.co/1YFin)**<br> | ||
by Oleg Nenashev, WireMock<br> Tuesday, March 19th • 15:45 - 15:50 | ||
- **[Shift Into an Observability Mindset with OpenTelemetry](https://sched.co/1YFjB)**<br> | ||
by Daniel Gomez Blanco, Skyscanner<br> Tuesday, March 19th • 15:45 - 16:15 | ||
- **[⚡ Lightning Talk: Federated Search Over Distributed Observability Data](https://sched.co/1YFjC)**<br> | ||
by Kalyan Kolachala, Intuit<br> Tuesday, March 19th • 15:55 - 16:00 | ||
- **[⚡ Lightning Talk: Application Security Through the Lens of OpenTelemetry - Yosef Arbiv, Outshift by Cisco](https://sched.co/1YFf5)**<br> | ||
by Kalyan Kolachala, Intuit<br> Tuesday, March 19th • 16:05 - 16:10 | ||
- **[Lazy Robots: Telemetry Buffering on Android](https://sched.co/1YFk3)**<br> | ||
by Cesar Munoz, Elastic & Jason Plumb, Splunk<br> Tuesday, March 19th • | ||
17:00 - 17:25 | ||
- **[OpAMP in Action: User Configurable Observability Pipelines](https://sched.co/1YFk6)**<br> | ||
by Srikanth Chekuri, SigNoz<br> Tuesday, March 19th • 17:00 - 17:25 | ||
|
||
{{% alert title="Important access note" color="danger" %}} | ||
|
||
You need an _in-person all-access_ pass for on-site access to **Observability | ||
Day**. For details, see [KubeCon registration][]. If you have a virtual ticket, | ||
you will be able to follow **Observability Day** through a live stream. | ||
|
||
[kubecon registration]: | ||
https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/register/ | ||
|
||
{{% /alert %}} | ||
|
||
## OpenTelemetry Observatory | ||
|
||
Drop by and say _"Hi!"_ at OpenTelemetry Observatory presented by Splunk in the | ||
Expo Hall. This will be a place for informal chats, meetups, and other | ||
discussions led by OpenTelemetry community members and maintainers. Check out | ||
the schedule of activities [here](https://shorturl.at/qEUX1). | ||
|
||
If you’d like to participate and lead a discussion or short presentation, out to | ||
the | ||
[OpenTelemetry End User Working Group](https://cloud-native.slack.com/archives/C01RT3MSWGZ) | ||
to indicate your interest. | ||
|
||
You can help us improve the project by sharing your thoughts and feedback about | ||
your OpenTelemetry adoption, implementation, and usage. | ||
|
||
To join a feedback session, book online below: | ||
|
||
- [End User Feedback Sessions 1](https://calendly.com/otel-euwg/end-user-feedback-sessions-1?month=2024-03) | ||
- [End User Feedback Sessions 2](https://calendly.com/otel-euwg/end-user-feedback-sessions-2?month=2024-03) | ||
- [End User Feedback Sessions 3](https://calendly.com/otel-euwg/end-user-feedback-sessions-3?month=2024-03) | ||
- [End User Feedback Sessions 4](https://calendly.com/otel-euwg/end-user-feedback-sessions-4?month=2024-03) | ||
- [End User Feedback Sessions 5](https://calendly.com/otel-euwg/end-user-feedback-sessions-5?month=2024-03) | ||
|
||
A maximum of 5 participants will join one SIG maintainer to provide feedback for | ||
that SIG. Sessions will be recorded and posted on the | ||
[OTel YouTube channel](https://youtube.com/@otel-official). The final SIG list | ||
is still TBD, so check back here often! | ||
|
||
We will create action items from your comments as appropriate. Check | ||
[#otel-user-research][] in CNCF's Slack instance for results and action item | ||
updates to come after KubeCon EU. | ||
|
||
Back by popular demand! We'll be recording | ||
[Humans of OTel interviews](/blog/2023/humans-of-otel/) at the OTel Observatory. | ||
If you'd like to share your experiences as an OpenTelemetry practitioner or | ||
maintainer, sign up for an interview session | ||
[here](https://calendly.com/otel-euwg/humans-of-otel). | ||
|
||
Come join us to listen, learn, and get involved in OpenTelemetry. | ||
|
||
See you in Paris! | ||
|
||
[#otel-user-research]: https://cloud-native.slack.com/archives/C01RT3MSWGZ | ||
[KubeCon + CloudNativeCon Europe]: | ||
https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/ | ||
[Observability Day]: | ||
https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/co-located-events/observability-day/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.