Tuesday, 25 August 2015

Uber: Migrating from Postgres to Sharded Datastore on top of Mysql + Lambda Architecture (append only updates)

Project Mezzanine: The Great Migration


The first design decision was the choice of database for the tripstore. Our short list of requirements were:
  • Operationally robust (no data loss, supports backup, replication to secondary data centers, easy to troubleshoot, predictable, operational expertise).
  • Horizontally scalable both in storage capacity and IOPS.
  • High write-availability. We always want to be able to persist a trip to stable storage. It’s okay to trade-off short-term read-availability, as the backend is working mostly in a batch-oriented fashion.
  • Secondary index support. Trips are looked up by user, city, and summarized in various ways.
  • No downtime for any operation (expanding storage, backup, adding indexes, adding data, and so forth).
The last item in the list was addressing a very immediate pain point. The trips table in PostgreSQL had grown so big that any operation that needed to add a new column or add a new index caused downtime. This made it increasingly cumbersome to develop new features.
We decided that a column-oriented, schemaless approach where data (JSON blobs) are organized in a grid indexed by trip-UUID, column name, and optionally a timestamp would work well as an overall data model. The model lends itself naturally to horizontal scaling by partitioning the rows across multiple shards, and supports our rapid development culture by being schemaless. New columns can be added, and new fields can be added to a column with no reconfiguration.
We evaluated various NoSQL-style databases with the above characteristics. However, we didn’t feel confident they were a good fit for storing our trip data, because of either our operational experience or the product’s maturity.
Inspired by blog posts, such as this one from FriendFeed, we decided to build our own simple, sharded datastore on top of MySQL. The key characteristics of the system we built are:
  • Sharding: Rows are sharded into a fixed set of shards, decided at setup time. Typically, we use 4096. Each shard corresponds to a MySQL tablespace, and the shards are distributed across a number of MySQL servers. Shards can be moved between MySQL servers for load-balancing, and the capacity can be increased online. We typically expand by splitting each MySQL server in two.
  • Append-only (no updates) data model: It only supports an append-only data model where a cell can never be modified after it is written. This is very useful for a system that stores transactional data and wants to guard against data corruption. By being append-only, modifications are naturally idempotent and commutative. The latter means that we can replay updates in any order and get the same result. (We learned later that the append-only style is also advocated by the lambda architecture.)
  • Buffered writes. If the shard where a cell needs to be written to is unavailable (or slow), we write the data to a pending table in any other available MySQL server. These are then later replayed once the shard becomes available. Due to the idempotent and commutative data model, this is always safe and does not require cross-host coordination.
  • Sharded secondary indexes: Indexes can be created on multiple fields in the columns and are sharded on a specific key (e.g., user uuid). They are implemented as MySQL tables and backfilled in the background. In case we need to change the index (e.g., adding a field), we can create a new version, backfill it, and then switch to the new version by changing an index alias, all without application downtime.
The whole system, which we simply call Schemaless in homage to its design, is written in Python. The initial version took about 5 months from idea to production deployment, and we will describe specific implementation details in future blog posts





How FriendFeed uses MySQL to store schema-less data

We use MySQL for storing all of the data in FriendFeed. Our database has grown a lot as our user base has grown. We now store over 250 million entries and a bunch of other data, from comments and "likes" to friend lists.
As our database has grown, we have tried to iteratively deal with the scaling issues that come with rapid growth. We did the typical things, like using read slaves and memcache to increase read throughput and sharding our database to improve write throughput. However, as we grew, scaling our existing features to accomodate more traffic turned out to be much less of an issue than adding new features.
In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time. Removing old indexes takes just as much time, and not removing them hurts performance because the database will continue to read and write to those unused blocks on every INSERT, pushing important blocks out of memory. There are complex operational procedures you can do to circumvent these problems (like setting up the new index on a slave, and then swapping the slave and the master), but those procedures are so error prone and heavyweight, they implicitly discouraged our adding features that would require schema/index changes. Since our databases are all heavily sharded, the relational features of MySQL like JOIN have never been useful to us, so we decided to look outside of the realm of RDBMS.
Lots of projects exist designed to tackle the problem storing data with flexible schemas and building new indexes on the fly (e.g., CouchDB). However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs (see this somewhat outdated article on CouchDB, for example). MySQL works. It doesn't corrupt data. Replication works. We understand its limitations already. We like MySQL for storage, just not RDBMS usage patterns.
After some deliberation, we decided to implement a "schema-less" storage system on top of MySQL rather than use a completely new storage system. This post attempts to describe the high-level details of the system. We are curious how other large sites have tackled these problems, and we thought some of the design work we have done might be useful to other developers.


Lambda Architecture

Nathan Marz came up with the term Lambda Architecture (LA) for a generic, scalable and fault-tolerant data processing architecture, based on his experience working on distributed data processing systems at Backtype and Twitter.
The LA aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.
Here’s how it looks like, from a high-level perspective:
LA overview
  1. All data entering the system is dispatched to both the batch layer and the speed layer for processing.
  2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  3. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
  5. Any incoming query can be answered by merging results from batch views and real-time views.


No comments:

Post a Comment