title | layout | sectionid | canonical |
---|---|---|---|
About Druid |
simple_page |
druid |
Apache Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications.
Sub-second OLAP Queries Druid’s unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute groupings, and extremely fast aggregations.
Real-time Streaming Ingestion Druid employs lock-free ingestion to allow for simultaneous ingestion and querying of high dimensional, high volume data sets. Explore events immediately after they occur.
Power Analytic Applications Druid has numerous features built for multi-tenancy. Power user-facing analytic applications designed to be used by thousands of concurrent users.
Cost Effective Druid is extremely cost effective at scale and has numerous features built in for cost reduction. Trade off cost and performance with simple configuration knobs.
Highly Available Druid is used to back SaaS implementations that need to be up all the time. Druid supports rolling updates so your data is still available and queryable during software updates. Scale up or down without data loss.
Scalable Existing Druid deployments handle trillions of events, petabytes of data, and thousands of queries every second.
Organizations have deployed Druid to analyze user, server, and marketplace events across a variety of industries including media, telecommunications, security, banking, health care, and retail. Druid is a good fit if you have the following requirements:
- You are building an application that requires fast aggregations and OLAP queries
- You want to do real-time analysis
- You have lots of data (trillions of events, petabytes of data)
- You need a data store that is always available with no single point of failure
Druid is partially inspired by existing analytic data stores such as Google's BigQuery/Dremel, Google's PowerDrill, and search infrastructure. Druid indexes all ingested data in a custom column format optimized for aggregations and filters. A Druid cluster is composed of various types of processes (called nodes), each designed to do a small set of things very well.
For a comprehensive look at Druid architecture, please read our white paper from 2014.
- Druid vs Elasticsearch
- Druid vs Key/Value Stores (HBase/Cassandra)
- Druid vs Redshift
- Druid vs Spark
- Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto)
Existing production Druid clusters have scaled to the following:
- 3+ trillion events/month
- 3M+ events/sec through Druid's real-time ingestion
- 100+ PB of raw data
- 50+ trillion events
- Thousands of queries per second for applications used by thousands of users
- Tens of thousands of cores