Tuesday, 25 August 2015

Uber: Migrating from Postgres to Sharded Datastore on top of Mysql + Lambda Architecture (append only updates)

Project Mezzanine: The Great Migration


The first design decision was the choice of database for the tripstore. Our short list of requirements were:
  • Operationally robust (no data loss, supports backup, replication to secondary data centers, easy to troubleshoot, predictable, operational expertise).
  • Horizontally scalable both in storage capacity and IOPS.
  • High write-availability. We always want to be able to persist a trip to stable storage. It’s okay to trade-off short-term read-availability, as the backend is working mostly in a batch-oriented fashion.
  • Secondary index support. Trips are looked up by user, city, and summarized in various ways.
  • No downtime for any operation (expanding storage, backup, adding indexes, adding data, and so forth).
The last item in the list was addressing a very immediate pain point. The trips table in PostgreSQL had grown so big that any operation that needed to add a new column or add a new index caused downtime. This made it increasingly cumbersome to develop new features.
We decided that a column-oriented, schemaless approach where data (JSON blobs) are organized in a grid indexed by trip-UUID, column name, and optionally a timestamp would work well as an overall data model. The model lends itself naturally to horizontal scaling by partitioning the rows across multiple shards, and supports our rapid development culture by being schemaless. New columns can be added, and new fields can be added to a column with no reconfiguration.
We evaluated various NoSQL-style databases with the above characteristics. However, we didn’t feel confident they were a good fit for storing our trip data, because of either our operational experience or the product’s maturity.
Inspired by blog posts, such as this one from FriendFeed, we decided to build our own simple, sharded datastore on top of MySQL. The key characteristics of the system we built are:
  • Sharding: Rows are sharded into a fixed set of shards, decided at setup time. Typically, we use 4096. Each shard corresponds to a MySQL tablespace, and the shards are distributed across a number of MySQL servers. Shards can be moved between MySQL servers for load-balancing, and the capacity can be increased online. We typically expand by splitting each MySQL server in two.
  • Append-only (no updates) data model: It only supports an append-only data model where a cell can never be modified after it is written. This is very useful for a system that stores transactional data and wants to guard against data corruption. By being append-only, modifications are naturally idempotent and commutative. The latter means that we can replay updates in any order and get the same result. (We learned later that the append-only style is also advocated by the lambda architecture.)
  • Buffered writes. If the shard where a cell needs to be written to is unavailable (or slow), we write the data to a pending table in any other available MySQL server. These are then later replayed once the shard becomes available. Due to the idempotent and commutative data model, this is always safe and does not require cross-host coordination.
  • Sharded secondary indexes: Indexes can be created on multiple fields in the columns and are sharded on a specific key (e.g., user uuid). They are implemented as MySQL tables and backfilled in the background. In case we need to change the index (e.g., adding a field), we can create a new version, backfill it, and then switch to the new version by changing an index alias, all without application downtime.
The whole system, which we simply call Schemaless in homage to its design, is written in Python. The initial version took about 5 months from idea to production deployment, and we will describe specific implementation details in future blog posts