A curated paper list of awesome Online Analytical Processing database systems, theory, frameworks, resources, tools and other awesomeness, for database researchers/engineers.
The repository is under construction. Welcome new PR, please conform to the committed rules:
paperName(with pdf link) [MeetingName Year] Github link if it has open-sourced code (optional)
Thanks to all authors of the paper/repository I cite :D
- Awesome-OLAP-Paper
- QAGen: Generating Query-Aware Test Databases [SIGMOD 07]
- Generating Targeted Queries for Database Testing [SIGMOD 08]
- Generating Databases for Query Workloads [VLDB 10]
- Data Generation using Declarative Constraints [SIGMOD 11]
- MyBenchmark: generating databases for query workloads [VLDB 14]
- Scalable and Dynamic Regeneration of Big Data Volumes [EDBT 18]
- Touchstone: Generating Enormous Query-Aware Test Databases [OSDI 18]
- Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
- Projection-Compliant Database Generation [VLDB 22]
- SAM: Database Generation from Query Workloads with Supervised Autoregressive Models [SIGMOD 22]
- Mirage: Generating Enormous Databases for Complex Workloads [ICDE 24]
- PrivSyn: Differentially Private Data Synthesis [ATC 21]
- Synthesizing Linked Data Under Cardinality and Integrity Constraints [SIGMOD 21]
- Data Synthesis via Differentially Private Markov Random Fields [VLDB 21]
- PrivLava: Synthesizing Relational Data with Foreign Keys under Differential Privacy [SIGMOD 23]
- Privacy-Enhanced Database Synthesis for Benchmark Publishing [arXiv 24]
- Self-Tuning Query Scheduling for Analytical Workloads [SIGMOD 21]
- Memory Efficient Scheduling of Query Pipeline Execution [CIDR 22]
- LSched: A Workload-Aware Learned Query Scheduler for Analytical Database Systems [SIGMOD 22]
- Rotary: A Resource Arbitration Framework for Progressive Iterative Analytics [ICDE 23]
- Sampling-Based Query Re-Optimization [SIGMOD 16]
- Kepler: Robust Learning for Parametric Query Optimization [SIGMOD 23]
- Rethink Query Optimization in HTAP Databases [SIGMOD 24]
- Optimizing Nested Recursive Queries [SIGMOD 24]
- Efficient Enumeration of Recursive Plans in Transformation-based Query Optimizers [VLDB 24]
- ROME: Robust Query Optimization via Parallel Multi-Plan Execution [SIGMOD 24]
- Presto’s History-based Query Optimizer [VLDB 24]
- QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting [VLDB 23]
- SlabCity: Whole-Query Optimization using Program Synthesis [VLDB 23]
- GEqO: ML-Accelerated Semantic Equivalence Detection [SIGMOD 24]
- Proving Query Equivalence Using Linear Integer Arithmetic [SIGMOD 24]
- QED: A Powerful Query Equivalence Decider for SQL [VLDB 24]
- VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints [OOPSLA 24]
- Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries [None 87]
- Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results [ACM Transactions on Database Systems 93]
- On Rectangular Partitionings in Two Dimensions: Algorithms, Complexity, and Applications [ICDT 99]
- Independence is good: Dependency-based histogram synopses for high-dimensional data [SIGMOD 01]
- STHoles: a multidimensional workload-aware histogram [SIGMOD 01]
- A multi-dimensional histogram for selectivity estimation and fast approximate query answering [CASCON 03]
- The history of histograms (abridged) [VLDB 03]
- ISOMER: Consistent histogram construction using query feedback [ICDE 06]
- Join Over Histograms [Alberto Dell'Era 07]
- Improving accuracy and robustness of self-tuning histograms by subspace clustering [ICDE 16]
- LHist: Towards Learning Multidimensional Histogram for Massive Spatial Data [ICDE 21]
- Two-Level Sampling for Join Size Estimation [SIGMOD 17]
- Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing [SIGMOD 21]
- Access path selection in a relational database management system [SIGMOD 79]
- Approximating multi-dimensional aggregate range queries over real attributes [SIGMOD 00]
- Selectivity estimators for multidimensional range queries over real attributes [VLDB 05]
- Plan Bouquets: Query Processing without Selectivity Estimation [SIGMOD 14]
- Exact Cardinality Query Optimization with Bounded Execution Cost [SIGMOD 19]
- JoinSketch: A Sketch Algorithm for Accurate and Unbiased Inner-Product Estimation [SIGMOD 23]
- Efficient and Effective Cardinality Estimation for Skyline Family [SIGMOD 23]
- Preventing bad plans by bounding the impact of cardinality estimation errors [VLDB 09]
- Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsof SQL Server [VLDB 23]
- Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges [VLDB 23]
- Efficiently Computing Join Orders with Heuristic Search [SIGMOD 23]
- Ready to Leap (by Co-Design)? Join Order Optimisation on Quantum Hardware [SIGMOD 23]
- Quantum-Inspired Digital Annealing for Join Ordering [VLDB 24]
- POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance [VLDB 24]
- Sub-optimal Join Order Identification with L1-error [SIGMOD 24]
- Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems [VLDB 12]
- Leapfrog Triejoin: a worst-case optimal join algorithm [International Conference on Database Theory 12]
- An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory [SIGMOD 16]
- Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems [SIGMOD 18]
- Adopting Worst-Case Optimal Joins in Relational Database Systems [VLDB 20]
- Free Join: Unifying Worst-Cast Optimal and Traditional Joins [arXiv 23]
- Reservoir Sampling over Joins [SIGMOD 24]
- LEO – DB2’s LEarning Optimizer [VLDB 11]
- Predicting query execution time: are optimizer cost models really unusable? [ICDE 13]
- Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads [VLDB 13]
- Forecasting the cost of processing multi-join queries via hashing for main-memory databases [SoCC 15]
- Query Performance Prediction for Concurrent Queries using Graph Embedding [VLDB 20]
- Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload [arXiv 21]
- Rethinking Learned Cost Models: Why Start from Scratch? [SIGMOD 24]
- Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD 24]
- How Good Are Query Optimizers, Really? [VLDB 15]
- Cardinality Estimation: An Experimental Survey [VLDB 17]
- A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration [VLDB 21]
- Have query optimizers hit the wall? [VLDB Journal 22]
- Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [VLDB 22]
- Data dependencies for query optimization: a survey [VLDB Journal 22]
- Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis [VLDB 23]
- SQL Server Column Store Indexes [SIGMOD 11]
- Column Sketches: A Scan Accelerator for Rapid and Robust Predicate Evaluation [SIGMOD 18]
- MonetDB/X100: Hyper-Pipelining Query Execution [CIDR 05]
- Materialization Strategies in the Vertica Analytic Database: Lessons Learned [ICDE 13]
- Rethinking SIMD Vectorization for In-Memory Databases [SIGMOD 15]
- Access Path Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I Probe? [SIGMOD 17]
- Building Advanced SQL Analytics From Low-Level Plan Operators [SIGMOD 21]
- SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore Platforms [VLDB 22]
- ChainedFilter: Combining Membership Filters by Chain Rule [SIGMOD 24]
- Saving Money for Analytical Workloads in the Cloud [VLDB 24]
- Adaptive and Robust Query Execution for Lakehouses at Scale [VLDB 24]
- How to Architect a Query Compiler [SIGMOD 16]
- Adaptive Execution of Compiled Queries [ICDE 18]
- Detecting Logic Bugs of Join Optimizations in DBMS [SIGMOD 23 Best Paper]
- Detecting Metadata-Related Logic Bugs in Database Systems via Raw Database Construction [VLDB 24]
- Keep It Simple: Testing Databases via Differential Query Plans [SIGMOD 24]
- Sedar: Obtaining High-Quality Seeds for DBMS Fuzzing via Cross-DBMS SQL Transfer [ICSE 24]
- Plume: Efficient and Complete Black-Box Checking of Weak Isolation Levels [OOPSLA2 2024]
- PUPPY: Finding Performance Degradation Bugs in DBMSs via Limited-Optimization Plan Construction [ICSE 25]
- Understanding and Detecting SQL Function Bugs [EuroSys 25]
- Understanding and Reusing Test Suites Across Database Systems [SIGMOD 25]
- What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines [VLDB 23]
- An Empirical Evaluation of Columnar Storage Formats [VLDB 24]
- Dissecting, Designing, and Optimizing LSM-based Data Stores [SIGMOD 22 Tutorial]
- Magma: A High Data Density Storage Engine Used in Couchbase [VLDB 22]
- CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated Infrastructure [SIGMOD 24]
- CAMAL: Optimizing LSM-trees via Active Learning [SIGMOD 25]
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics [CIDR 21]
- Disaggregated Database Systems [VLDB 23 Tutorial]
- GPU Database Systems Characterization and Optimization [VLDB 24]
- The Art of Latency Hiding in Modern Database Engines [VLDB 24]
- DoppelGanger++: Towards Fast Dependency Graph Generation for Database Replay [SIGMOD 24]
- Scalable Garbage Collection for In-Memory MVCC Systems [VLDB 13]
- Rethinking serializable multiversion concurrency control [VLDB 15]
- An Empirical Evaluation of In-Memory Multi-Version Concurrency Control [VLDB 17]
- Accelerating Analytical Processing in MVCC using Fine-Granular High-Frequency Virtual Snapshotting [SIGMOD 18]
- Long-lived Transactions Made Less Harmful [SIGMOD 20]
- Rethink the Scan in MVCC Databases [SIGMOD 21]
- Diva: Making MVCC Systems HTAP-Friendly [SIGMOD 22]
- Memory-Optimized Multi-Version Concurrency Control for Disk-Based Database Systems [VLDB 22]
- Scalable and Robust Snapshot Isolation for High-Performance Storage Engines [VLDB 23]
- One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version Storage [SIGMOD 23]
- HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots [ICDE 12]
- TiDB: A raft-based htap database [VLDB 20]
- OceanBase Paetica: A Hybrid Shared-Nothing/Shared-Everything Database for Supporting Single Machine and Distributed Cluster [VLDB 23]
- BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications [SIGMOD 17]
- F1 Lightning: HTAP as a Service [VLDB 20]
- Retrofitting High Availability Mechanism to Tame Hybrid Transaction/Analytical Processing [ATC 21]
- ByteHTAP: ByteDance’s HTAP System with High Data Freshness and Strong Data Consistency [VLDB 22]
- PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database [VLDB 18]
- PolarDB-IMCI: A Cloud-Native HTAP Database System at Alibaba [SIGMOD 23]
- HTAP Databases: What is New and What is Next [SIGMOD 22]
- Data Sharing Model and Optimization Strategies in HTAP Database Systems [Journal of Software 23]
- HTAP Databases: A Survey [TKDE 24]
- A survey on hybrid transactional and analytical processing [VLDB Journal 24]
- Survey on Benchmarking Ability of HTAP Benchmarks [Journal of Software 24]
- TiQuE: Improving the Transactional Performance of Analytical Systems for True Hybrid Workloads [VLDB 23]
- Log Replaying for Real-Time HTAP: An Adaptive Epoch-based Two-Stage Framework [ICDE 24]
- Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAP [VLDB 24]
- Dike: A Benchmark Suite for Distributed Transactional Databases [SIGMOD 23]
- DBPA: A Benchmark for Transactional Database Performance Anomalies [SIGMOD 23]
- Why You Should Run TPC-DS: A Workload Analysis [VLDB 07]
- The Making of TPC-DS [VLDB 06]
- TPC-DS, Taking Decision Support Benchmarking to the Next Level [SIGMOD 02]
- Generating Thousands of Benchmark Queries in Seconds [VLDB 04]
- How Good is My HTAP System? [SIGMOD 22]
- OLxPBench: Real-time, Semantically Consistent, and Domain-specific are Essential in Benchmarking, Designing, and Implementing HTAP Systems [ICDE 22]
- M2Bench: A Database Benchmark for Multi-Model Analytic Workloads [VLDB 23]
- Cloud Analytics Benchmark [VLDB 23]
- Pollock: A Data Loading Benchmark [VLDB 23]
- VeriBench: Analyzing the Performance of Database Systems with Verifiability [VLDB 23]
- TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications [VLDB 23]
- CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance [VLDB 23]
- FEBench: A Benchmark for Real-Time Relational Data Feature Extraction [VLDB 23]
- TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems [VLDB 23]
- ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems [VLDB 23]
- Multi-model Databases: A New Journey to Handle the Variety of Data [CSUR 19]
- M2Bench: A Database Benchmark for Multi-Model Analytic Workloads [VLDB 23]
- MMSBench-Net: Scenario-Based Evaluation of Multi-Model Database Systems [23]
- MMDBench: A Benchmark for Hybrid Query in Multimodal Database [24]
- Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL [ICDE 24]
- Survey of Vector Database Management Systems [VLDBJ 24]
- Vector Database Management Techniques and Systems [SIGMOD 24]
- FlowWalker: A Memory-efficient and High-performance GPU-based Dynamic Graph Random Walk Framework [VLDB 24]
- Consistency in Non-Transactional Distributed Storage Systems [arXiv 15]
- NOC-NOC: Towards Performance-optimal Distributed Transactions [SIGMOD 24]
- Native Distributed Databases: Problems, Challenges and Opportunities [VLDB 24 Tutorial]