Skip to content

Wind-Gone/awesome-olap-paper

Repository files navigation

Awesome-OLAP-Paper 666

Awesome OLAP Paper visitor badge GitHub Repo stars GitHub Repo forks

Introduction

A curated paper list of awesome Online Analytical Processing database systems, theory, frameworks, resources, tools and other awesomeness, for database researchers/engineers.

Contributing

The repository is under construction. Welcome new PR, please conform to the committed rules:

paperName(with pdf link) [MeetingName Year] Github link if it has open-sourced code (optional)

Acknowledge

Thanks to all authors of the paper/repository I cite :D

Table of Content

Query-Aware Database Generation

  1. QAGen: Generating Query-Aware Test Databases [SIGMOD 07]
  2. Generating Targeted Queries for Database Testing [SIGMOD 08]
  3. Generating Databases for Query Workloads [VLDB 10]
  4. Data Generation using Declarative Constraints [SIGMOD 11]
  5. MyBenchmark: generating databases for query workloads [VLDB 14]
  6. Scalable and Dynamic Regeneration of Big Data Volumes [EDBT 18]
  7. Touchstone: Generating Enormous Query-Aware Test Databases [OSDI 18]
  8. Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
  9. Projection-Compliant Database Generation [VLDB 22]
  10. SAM: Database Generation from Query Workloads with Supervised Autoregressive Models [SIGMOD 22]
  11. Mirage: Generating Enormous Databases for Complex Workloads [ICDE 24]

Privacy

  1. PrivSyn: Differentially Private Data Synthesis [ATC 21]
  2. Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
  3. Data Synthesis via Differentially Private Markov Random Fields [VLDB 21]
  4. PrivLava: Synthesizing Relational Data with Foreign Keys under Differential Privacy [SIGMOD 23]
  5. Privacy-Enhanced Database Synthesis for Benchmark Publishing [arXiv 24]

Survey

  1. Synthetic Data Generation for Enterprise DBMS [ICDE 23]

Query Schedule

  1. Self-Tuning Query Scheduling for Analytical Workloads [SIGMOD 21]
  2. Memory Efficient Scheduling of Query Pipeline Execution [CIDR 22]
  3. LSched: A Workload-Aware Learned Query Scheduler for Analytical Database Systems [SIGMOD 22]
  4. Rotary: A Resource Arbitration Framework for Progressive Iterative Analytics [ICDE 23]

Query Optimization

  1. Sampling-Based Query Re-Optimization [SIGMOD 16]
  2. Kepler: Robust Learning for Parametric Query Optimization [SIGMOD 23]
  3. Rethink Query Optimization in HTAP Databases [SIGMOD 24]
  4. Optimizing Nested Recursive Queries [SIGMOD 24]
  5. Efficient Enumeration of Recursive Plans in Transformation-based Query Optimizers [VLDB 24]
  6. ROME: Robust Query Optimization via Parallel Multi-Plan Execution [SIGMOD 24]
  7. Presto’s History-based Query Optimizer [VLDB 24]

Query Rewrite

  1. QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting [VLDB 23]
  2. SlabCity: Whole-Query Optimization using Program Synthesis [VLDB 23]
  3. GEqO: ML-Accelerated Semantic Equivalence Detection [SIGMOD 24]
  4. Proving Query Equivalence Using Linear Integer Arithmetic [SIGMOD 24]
  5. QED: A Powerful Query Equivalence Decider for SQL [VLDB 24]
  6. VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints [OOPSLA 24]

Cardinality Estimation

Histogram

  1. Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries [None 87]
  2. Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results [ACM Transactions on Database Systems 93]
  3. On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications [ICDT 99]
  4. Independence is good: Dependency-based histogram synopses for high-dimensional data [SIGMOD 01]
  5. STHoles: a multidimensional workload-aware histogram [SIGMOD 01]
  6. A multi-dimensional histogram for selectivity estimation and fast approximate query answering [CASCON 03]
  7. The history of histograms (abridged) [VLDB 03]
  8. ISOMER: Consistent histogram construction using query feedback [ICDE 06]
  9. Join Over Histograms [Alberto Dell'Era 07]
  10. Improving accuracy and robustness of self-tuning histograms by subspace clustering [ICDE 16]
  11. LHist: Towards Learning Multidimensional Histogram for Massive Spatial Data [ICDE 21]

Sampling

  1. Two-Level Sampling for Join Size Estimation [SIGMOD 17]
  2. Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing [SIGMOD 21]

Others

  1. Access path selection in a relational database management system [SIGMOD 79]
  2. Approximating multi-dimensional aggregate range queries over real attributes [SIGMOD 00]
  3. Selectivity estimators for multidimensional range queries over real attributes [VLDB 05]
  4. Plan Bouquets: Query Processing without Selectivity Estimation [SIGMOD 14]
  5. Exact Cardinality Query Optimization with Bounded Execution Cost [SIGMOD 19]
  6. JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation [SIGMOD 23]
  7. Efficient and Effective Cardinality Estimation for Skyline Family [SIGMOD 23]

Survey

  1. Preventing bad plans by bounding the impact of cardinality estimation errors [VLDB 09]
  2. Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsof SQL Server [VLDB 23]

Join Order

  1. Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges [VLDB 23]
  2. Efficiently Computing Join Orders with Heuristic Search [SIGMOD 23]
  3. Ready to Leap (by Co-Design)? Join Order Optimisation on Quantum Hardware [SIGMOD 23]
  4. Quantum-Inspired Digital Annealing for Join Ordering [VLDB 24]
  5. POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance [VLDB 24]
  6. Sub-optimal Join Order Identification with L1-error [SIGMOD 24]

Join Algorithms

  1. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems [VLDB 12]
  2. Leapfrog Triejoin: a worst-case optimal join algorithm [International Conference on Database Theory 12]
  3. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory [SIGMOD 16]
  4. Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems [SIGMOD 18]
  5. Adopting Worst-Case Optimal Joins in Relational Database Systems [VLDB 20]
  6. Free Join: Unifying Worst-Cast Optimal and Traditional Joins [arXiv 23]
  7. Reservoir Sampling over Joins [SIGMOD 24]

Cost Model

  1. LEO – DB2’s LEarning Optimizer [VLDB 11]
  2. Predicting query execution time: are optimizer cost models really unusable? [ICDE 13]
  3. Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads [VLDB 13]
  4. Forecasting the cost of processing multi-join queries via hashing for main-memory databases [SoCC 15]
  5. Query Performance Prediction for Concurrent Queries using Graph Embedding [VLDB 20]
  6. Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload [arXiv 21]
  7. Rethinking Learned Cost Models: Why Start from Scratch? [SIGMOD 24]
  8. Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD 24]

View

  1. Foreign Keys Open the Door for Faster Incremental View Maintenance [SIGMOD 23]

Survey

  1. How Good Are Query Optimizers, Really? [VLDB 15]
  2. Cardinality Estimation: An Experimental Survey [VLDB 17]
  3. A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration [VLDB 21]
  4. Have query optimizers hit the wall? [VLDB Journal 22]
  5. Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [VLDB 22]
  6. Data dependencies for query optimization: a survey [VLDB Journal 22]
  7. Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis [VLDB 23]

Index

  1. SQL Server Column Store Indexes [SIGMOD 11]
  2. Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation [SIGMOD 18]

Query Execution

  1. MonetDB/X100: Hyper-Pipelining Query Execution [CIDR 05]
  2. Materialization Strategies in the Vertica Analytic Database: Lessons Learned [ICDE 13]
  3. Rethinking SIMD Vectorization for In-Memory Databases [SIGMOD 15]
  4. Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? [SIGMOD 17]
  5. Building Advanced SQL Analytics From Low-Level Plan Operators [SIGMOD 21]
  6. SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore Platforms [VLDB 22]
  7. ChainedFilter: Combining Membership Filters by Chain Rule [SIGMOD 24]
  8. Saving Money for Analytical Workloads in the Cloud [VLDB 24]
  9. Adaptive and Robust Query Execution for Lakehouses at Scale [VLDB 24]

Data Dependency Search

  1. Discovering Functional Dependencies through Hitting Set Enumeration [SIGMOD 24]

Query Compilation

  1. How to Architect a Query Compiler [SIGMOD 16]
  2. Adaptive Execution of Compiled Queries [ICDE 18]

Bugs Detection

  1. Detecting Logic Bugs of Join Optimizations in DBMS [SIGMOD 23 Best Paper]
  2. Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database Construction [VLDB 24]
  3. Keep It Simple: Testing Databases via Differential Query Plans [SIGMOD 24]
  4. Sedar: Obtaining High-Quality Seeds for DBMS Fuzzing via Cross-DBMS SQL Transfer [ICSE 24]
  5. Plume: Efficient and Complete Black-Box Checking of Weak Isolation Levels [OOPSLA2 2024]
  6. PUPPY: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction [ICSE 25]
  7. Understanding and Detecting SQL Function Bugs [EuroSys 25]
  8. Understanding and Reusing Test Suites Across Database Systems [SIGMOD 25]

Static Analysis

  1. Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach [PACMPL 24]

Storage

  1. What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines [VLDB 23]
  2. An Empirical Evaluation of Columnar Storage Formats [VLDB 24]

LSM-Tree

  1. Dissecting, Designing, and Optimizing LSM-based Data Stores [SIGMOD 22 Tutorial]
  2. Magma: A High Data Density Storage Engine Used in Couchbase [VLDB 22]
  3. CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated Infrastructure [SIGMOD 24]
  4. CAMAL: Optimizing LSM-trees via Active Learning [SIGMOD 25]

Proxy

  1. Tigger: A Database Proxy That Bounces With User-Bypass [VLDB 23]

Data Loading

  1. ConnectorX: Accelerating Data Loading From Databases to Dataframes [VLDB 22]

Database Kernel

  1. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics [CIDR 21]
  2. Disaggregated Database Systems [VLDB 23 Tutorial]
  3. GPU Database Systems Characterization and Optimization [VLDB 24]
  4. The Art of Latency Hiding in Modern Database Engines [VLDB 24]
  5. DoppelGanger++: Towards Fast Dependency Graph Generation for Database Replay [SIGMOD 24]

Others

MVCC

  1. Scalable Garbage Collection for In-Memory MVCC Systems [VLDB 13]
  2. Rethinking serializable multiversion concurrency control [VLDB 15]
  3. An Empirical Evaluation of In-Memory Multi-Version Concurrency Control [VLDB 17]
  4. Accelerating Analytical Processing in MVCC using Fine-Granular High-Frequency Virtual Snapshotting [SIGMOD 18]
  5. Long-lived Transactions Made Less Harmful [SIGMOD 20]
  6. Rethink the Scan in MVCC Databases [SIGMOD 21]
  7. Diva: Making MVCC Systems HTAP-Friendly [SIGMOD 22]
  8. Memory-Optimized Multi-Version Concurrency Control for Disk-Based Database Systems [VLDB 22]
  9. Scalable and Robust Snapshot Isolation for High-Performance Storage Engines [VLDB 23]
  10. One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version Storage [SIGMOD 23]

HTAP

System Architecture

Linear Consistency
  1. HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots [ICDE 12]
  2. TiDB: A raft-based htap database [VLDB 20]
  3. OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster [VLDB 23]
Sequential Consistency
  1. BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications [SIGMOD 17]
  2. F1 Lightning: HTAP as a Service [VLDB 20]
  3. Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing [ATC 21]
  4. ByteHTAP: ByteDance’s HTAP System with High Data Freshness and Strong Data Consistency [VLDB 22]
Session Consistency
  1. PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database [VLDB 18]
  2. PolarDB-IMCI: A Cloud-Native HTAP Database System at Alibaba [SIGMOD 23]
Survey
  1. HTAP Databases: What is New and What is Next [SIGMOD 22]
  2. Data Sharing Model and Optimization Strategies in HTAP Database Systems [Journal of Software 23]
  3. HTAP Databases: A Survey [TKDE 24]
  4. A survey on hybrid transactional and analytical processing [VLDB Journal 24]
  5. Survey on Benchmarking Ability of HTAP Benchmarks [Journal of Software 24]

Kernel Optimization

  1. TiQuE: Improving the Transactional Performance of Analytical Systems for True Hybrid Workloads [VLDB 23]
  2. Log Replaying for Real-Time HTAP: An Adaptive Epoch-based Two-Stage Framework [ICDE 24]
  3. Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAP [VLDB 24]

Result Replay

  1. DoppelGanger++: Towards Fast Dependency Graph Generation for Database Replay [SIGMOD 24]

Benchmark

OLTP

  1. Dike: A Benchmark Suite for Distributed Transactional Databases [SIGMOD 23]
  2. DBPA: A Benchmark for Transactional Database Performance Anomalies [SIGMOD 23]

OLAP

  1. Why You Should Run TPC-DS: A Workload Analysis [VLDB 07]
  2. The Making of TPC-DS [VLDB 06]
  3. TPC-DS, Taking Decision Support Benchmarking to the Next Level [SIGMOD 02]
  4. Generating Thousands of Benchmark Queries in Seconds [VLDB 04]

HTAP

  1. How Good is My HTAP System? [SIGMOD 22]
  2. OLxPBench: Real-time, Semantically Consistent, and Domain-specific are Essential in Benchmarking, Designing, and Implementing HTAP Systems [ICDE 22]

Others

  1. M2Bench: A Database Benchmark for Multi-Model Analytic Workloads [VLDB 23]
  2. Cloud Analytics Benchmark [VLDB 23]
  3. Pollock: A Data Loading Benchmark [VLDB 23]
  4. VeriBench: Analyzing the Performance of Database Systems with Verifiability [VLDB 23]
  5. TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications [VLDB 23]
  6. CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance [VLDB 23]
  7. FEBench: A Benchmark for Real-Time Relational Data Feature Extraction [VLDB 23]
  8. TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems [VLDB 23]
  9. ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems [VLDB 23]

Multi-Model

  1. Multi-model Databases: A New Journey to Handle the Variety of Data [CSUR 19]
  2. M2Bench: A Database Benchmark for Multi-Model Analytic Workloads [VLDB 23]
  3. MMSBench-Net: Scenario-Based Evaluation of Multi-Model Database Systems [23]
  4. MMDBench: A Benchmark for Hybrid Query in Multimodal Database [24]

Time Series

  1. An Experimental Evaluation of Anomaly Detection in Time Series [VLDB 24]

Vector Database

Survey

  1. Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL [ICDE 24]
  2. Survey of Vector Database Management Systems [VLDBJ 24]
  3. Vector Database Management Techniques and Systems [SIGMOD 24]

Algorithm

  1. FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework [VLDB 24]

Distributed Systems

  1. Consistency in Non-Transactional Distributed Storage Systems [arXiv 15]
  2. NOC-NOC: Towards Performance-optimal Distributed Transactions [SIGMOD 24]
  3. Native Distributed Databases: Problems, Challenges and Opportunities [VLDB 24 Tutorial]

Star History

Star History Chart