<dependency>
<groupId>io.datarouter</groupId>
<artifactId>datarouter-storage</artifactId>
<version>0.0.126</version>
</dependency>
Datarouter Storage holds the classes and interfaces that let you persist your Databeans to different datastores via runtime Client implementations such as datarouter-mysql, datarouter-aws-sqs, datarouter-memcached, datarouter-redis, etc.
datarouter-storage is the original motivation for Datarouter. The design decisions are explained here.
The storage framework is composed of several constructs:
- Databean: the primary unit of storage
- PrimaryKey: an object that uniquely identifies a databean
- IndexEntry: a databean representing a secondary index entry
- Node: an interface through which you put, get, scan, and delete databeans
- Client: a physical connection to a datastore
- Datarouter: a container for clients and nodes
- Dao: a class that contains and interacts with nodes
- internally rely more on normal compile-safe Java constructs like interfaces and composition rather than “convenient” reflection and annotations. This makes datarouter code easier for users to understand and debug
- encourage strong typing that helps refactoring and keeping a nimble code base, both in user applications and datarouter itself
- subclasses of Databean are the primary data definition language
- take advantage of refactoring tools like Eclipse when the Databeans change
- a schema-update utility helps to create tables and add and drop columns and indexes
- keep all field and index definitions together in the Databean file to minimize mistakes when adding, removing, or modifying fields
- encourage database access through simple methods that are common to many datastores
- put, get, delete, getMulti, getWithPrefix, scanKeys, scan, etc
- discourage joins or other complex queries that may not scale predictably
- but allow any type of query to the underlying datastore if it’s deemed necessary
- allow Databeans to be easily moved between or straddle datastores like MySQL, HBase and Memcached
- to keep an application modern as database technologies advance
- to avoid lock-in
- to transparently span multiple technologies, like easily putting a memcached cache in front of certain mysql tables
- hide tiered storage hierarchies behind a simple “Node” interface
- a PhysicalNode represents a storage location
- PhysicalNodes can be combined to form complex Node hierarchies:
- caching, replication, audit logging, buffering, redundant storage, horizontal or vertical partitioning, redundant read requests, etc
- these hierarchies are built using compiler-checked Java, but are more like configuration than code changes to Node hierarchies or the PhysicalNodes backing them are transparent to the application code using the Nodes
- natively support secondary indexes common to relational and other databases
- define indexes with strongly typed IndexEntry databeans
- enforce proper index usage by developers and make it explicit where code relies on an index for performance
- allow iteration of Databeans in order of some indexed columns
Datarouter is an opinionated framework. This section explains the reasons behind some of the design choices, including why datarouter-storage interfaces seem very restrictive compared to the SQL language. To help explain, it describes the internals of MySQL in a simplified way and compares with other Java ORMs.
A schema definition is held in memory for each table. Each row containing numbers, strings, dates, etc is serialized into a byte array. The rows are sorted by the primary key fields. Sorting is interpreted through the schema, applying data types, charsets, collations, etc.
Rows are grouped into pages with a max size of 16 KB, roughly the size of a page in a phonebook. Any modification
causes the whole page to be rewritten and stored in the InnoDB page cache (the buffer pool
). The page cache can be
many GB, and dirty pages are flushed to disk in the background. A row insertion causing the page to exceed 16 KB will result
in the page being split in half causing ~8 KB of empty space at the end of each page. Picture a phone book where the bottom
1/4 of each page is blank. This fragmentation is rarely corrected, done by a manual and expensive OPTIMIZE TABLE
.
Queries served by the page cache can be answered relatively quickly, while queries needing many disk reads are much slower. A query reading one row will load the whole page into memory. Pages are located using a B+ Tree index. The B+ Tree index is composed of more pages.
If 100 primary keys fit in a page, then a 5,000,000 row table will have:
- 50,000 leaf pages
- 500 level 1 pages
- 5 level 2 pages
- 1 level 3 page containing 5 PKs
The higher level PK index pages tend to be accessed frequently and remain in the InnoDB page cache.
A secondary index is a separate physical table governed by the same schema as the primary table but with a subset of the columns and a different sort order. Each secondary index row must contain all of the columns of the PK. InnoDB will silently insert any unspecified columns. A “covering” index query selects data only from the index fields. A non-covering query will use the PK fields to do a B+ Tree lookup in the parent table.
Older disks (HDDs) can read sequentially at 160 MBps, or 10,000 InnoDB pages/sec, while random access is limited to ~100 pages/sec. SSDs are much better at random access, but the OS is still optimized for sequential reads resulting in 10x better performance. The database can pre-fetch pages that will be needed by a large query. The OS can combine many small fetches into fewer large fetches and do further pre-fetching. SSDs will also last longer if you write to them sequentially.
Example query costs with 100 rows per page:
- Fetching 10,000 sequential rows will fetch 100 pages
- Fetching 10,000 random rows may fetch 10,000 pages
- Fetching 10,000 covering index rows will fetch 100 index pages
- Fetching 10,000 non-covering index rows will fetch 100 index pages plus as many as 10,000 parent pages
It always brings the full PK or Databean over the network, even when you don’t need all of the columns. Use only the PK when possible. This is to simplify the application by dealing only in strongly-typed databeans rather than specifying individual columns all over the code. InnoDB is a row-oriented datastore, keeping all columns of a row serialized together. If you select only one column, you still incur the expense of fetching all other columns from the database’s disk and storing those columns in the page cache. Since the work of fetching the unneeded columns from disk and storing them in memory is already done, we might as well send them back to the client.
What about very large columns? InnoDB has an optimization to store these in special pages that can be lazily fetched. We could write custom SQL that omits the large columns, but this breaks our databean abstraction. A better option can be to move the large columns to a separate databean/table that is accessed explicitly when needed.
Network usage is not generally an issue, but if it is, it’s likely due to one or just a few out of hundreds or thousands of queries. The problematic query could be optimized with custom SQL leaving the many other queries with the benefits of the databean abstraction.
Let’s use a phone book as an example:
- Primary key fields: lastName, firstName
- Other fields: phoneNumber
- Bytes per record: ~25
- Records per page: ~400
- Bytes per page: ~10,000
Supported methods:
MapStorage
exists
: return true if there is a record for the persongetMultiKeys
: return the PKs that existget
: get a single person by PK (lastName, firstName) including the phone numbergetMulti
: get multiple records by PK (this may read from many database pages)scanMulti
: getMulti returned as a Scanner
SortedStorage
scan
: return all records from startKey to stopKeyscanKeys
: return all firstName/lastNames from startKey to stopKeyscanRanges
: return all records from startKey to stopKeyscanWithPrefix
: return records based on one of these options:- lastName with wildcard suffix
- full lastName
- full lastName plus firstName with wildcard suffix
- full lastName plus full firstName
Note that there is not a built-in method to return records by specifying the firstName. That’s because the query is not efficiently supported by the database - it would need to skip rows looking for a firstName match. While a query to collect the first 10 people named “Matthew” may execute quickly, a similar query for “Calixte” may scan millions of rows without finding 10 matches. Because of that, datarouter doesn’t support queries without knowing the leftmost portion of the PK.
Imagine executing the firstName query manually on a phone book - you’d have to look at every row on every page. It’s not fun for a person nor a database. Relational databases can magically answer any complex query you throw at them, but under the covers they may have to do a lot of expensive work.
Triggering expensive queries like this can quickly consume all the CPU and disk I/O on the machine which does not scale well for a many-user website. These queries may appear to execute quickly during development with small data volumes, but they become more expensive as the data grows. If your phone book has only 15 people, the query appears fast, but as you get to millions or billions of people it becomes extremely inefficient.
Because the database only supports a limit on the number of matches, not on the number of skipped rows or total rows scanned, we can’t limit the cost of such a query. Further, because queries are executed in a transaction, this query may cause a lot of locking and/or versioning work that slows down other queries that touch the same rows. It’s stressful to worry about query performance degradation, and it’s time consuming to monitor query performance in a large application.
Datarouter applications will generally have no long-running queries.
While you can manually create secondary indexes in the database it can be tricky to make sure they are kept up to date in an evolving application. So far, Datarouter doesn't focus on making it easy to trigger expensive queries, while it does focus on helping you design the application to scale from the beginning.
There are two ways:
- Stream all rows in the table back to the client
- No extra code is needed
- This will return rows in batches to the client where further filtering can happen in Java
- Use this for infrequent queries that can wait for a potentially very slow response
- Downside is that this method will consume a full CPU and a lot of disk I/O, so you should limit the number of these tablescans operating on the database at any given time.
- Add a secondary index
- Use this for a fast B+ tree lookup in the secondary index
- Create a special databean by subclassing
IndexEntry
. The schema-update utility will ensure that this index gets created in the database and is used at runtime. - As a developer using the
IndexEntry
you’ll now have a clearer understanding of how much work you are asking the database to do. - Downsides
- The index must also exist, consuming disk and memory
- It must also be maintained causing updates to be slower
When creating a node, you generally want to specify the smallest interface with the operations you need.
- Start with MapStorage if you know the full keys of the data you want to retrieve. In the phone book example, if you always know the lastName and firstName, you have the full keys and can retrieve data with get(..) or getMulti(..).
- Expand the interface to SortedStorage if you don't know the full keys. For example, if you have a lastName but don't know the firstName, you can scan(..) through all people with that lastName.
- Expand the interface to IndexedStorage if you need to retrieve data without knowing the leftmost key fields. For example if you want to find a person with phoneNumber=1234567, you can create an IndexedNode with an index called byPhoneNumber, and use the index to locate the primary databean.
Choosing the smallest interface makes the access patterns of the table easier to understand when reading the code later. The smaller interfaces are compatible with more database implementations. For example, if you declare a MySQL node as MapStorage then you can easily infer that a Memcached layer can be added in front of the MySQL node. Adding Memcached in front of a SortedStorage node is possible but slightly more complicated.
Traditional SQL databases encourage you to create foreign keys between tables, letting you write queries that can fetch data from multiple tables and return a combined result set. The number of possible result set formats can grow large and is often handled with weakly typed objects like Lists of rows containing Maps to column values. Under the hood, the database is still accessing the same data pages as if you ran two selects and joined the data in Java. The more complicated the join, the more work the database must do.
A single database is generally harder to scale than the application servers that use it, so it’s beneficial to move as much of the processing as possible to the application servers. With SQL it’s easy to add a few order-by clauses to the end of a query, but that sorting can be very expensive for the database to do. It can be better to pre-sort the data via primary and secondary indexes or to bring the data back to the scalable application tier before sorting.
All of the permutations of column selection, joins, filtering (skipping rows), and ordering lead to a large number of SQL statements that are generally harder to maintain than strongly-typed Java objects.
Traditional Java ORMs like Hibernate require that all database operations go through a session cache that holds a copy of all databeans involved with the session. It’s required so that the ORM is able to follow links between database and perform fancier features like updating all the relatives of a databean. It’s designed for operating on small numbers of databeans involved in a transaction and can become prohibitively slow when many databeans are touched within a session. Batch processing a large number of records can bring the session cache to a halt, forcing the application to take responsibility for invalidating the cache during processing. Session cache management complicates the application code in terms of clearing the cache, but more importantly in terms of knowing the consequences of clearing the cache in a complicated data model. Lazy-loading of a databean's relatives is also complex, leading to many unexpected slow queries in places you don’t expect them.
Datarouter loads and stores your data more explicitly and lets you hold onto a collection of databeans in the application if you choose, which is more predictable and debuggable and doesn’t tie your code to a specific ORM.
Traditional ORMs make the assumption that the application uses a single RDBMS. Many successful applications may start with a single RDBMS but soon find that they need to add replicated databases for extra read capacity. To get further performance, they need to add a caching layer like memcached or redis and a messaging system like SQS. These caching and messaging layers don’t speak SQL, forcing the developers to write custom code for each cache scenario or use more complicated ORM extensions or plugins. As data grows further, they need to move some large or frequently accessed tables to separate database machines.
These replicated databases, caches, messaging queues, and multiple databases break the original assumptions of the ORM. Hand-crafted SQL that joins two tables together must be rewritten into multiple queries so the join is done in the application. The links between databeans that were configured into the data model (usually via annotations) must be removed, and code that relied on them must be modified.
A feature of the ORM is to make these links transparent to the application code, but this also makes them harder to remove as lazy-loading can happen in unexpected places outside of your DAO layer. Because the ORM does most of its work inside transactions, the application is using transactions everywhere even though many of them aren’t needed, so during this splitting of databases you must be careful to identify where transactions were actually necessary.
Datarouter advocates for assuming that all tables are on separate database servers from the start. Joins are done at the
application layer, not caring which machines the data resides on or if those machines change. Transactions, where
necessary, are explicitly coded into the application forcing you think ahead of time where they are necessary and making
it easier to reason about which tables can be split apart. Datarouter provides a replication abstraction
that can be inserted without changing business logic. Queries that are okay with stale data can pass a anyDelay
parameter
even if no replicated databases exist yet. The caching layer is accessed with the same put, get, and delete operations
as the RDBMS, making it easy to insert caching without changing the business logic.
This library is licensed under the Apache License, Version 2.0 - see LICENSE for details.