Constants Are Changing : a Software and Technology scrapbook: database

Showing posts with label database. Show all posts

Tuesday, 21 March 2017

Donne Martin/system-design-primer: Learn how to design large-scale systems. Prep for the system design interview.

GitHub - donnemartin/system-design-primer: Learn how to design large-scale systems. Prep for the system design interview.: "Motivation

Learn how to design large scale systems.

Prep for the system design interview.
Learn how to design large scale systems

Learning how to design scalable systems will help you become a better engineer.

System design is a broad topic.

There is a vast amount of resources scattered throughout the web on system design principles.

This repo is an organized collection of resources to help you learn how to build systems at scale."

'via Blog this'

Friday, 27 May 2016

Heroku Metrics: There and Back Again

Heroku Metrics: There and Back Again: "No system is perfect, and ours isn’t some magical exception. We’ve learned some things in this exercise that we think are worth pointing out.

Sharding and Partitioning Strategies Matter

Our strategy of using the owner column for our shard/partitioning key was a bit unfortunate, and now hard to change. While we don’t currently see any ill effects from this, there are hypothetical situations in which this could pose a problem. For now, we have dashboards and metrics which we watch to ensure that this doesn’t happen and a lot of confidence that the systems we’ve built upon will actually handle it in stride.
Even still, a better strategy, likely, would have been to shard on owner + process_type (e.g. web), which would have spread the load more evenly across the system. In addition to the more even distribution of data, from a product perspective it would mean that in a partial outage, some of an application’s metrics would remain available.

Extensibility Comes Easily with Kafka

The performance of our Postgres cluster doesn’t worry us. As mentioned, it’s acceptable for now, but our architecture makes it trivial to swap out, or simply add another data store to increase query throughput when it becomes necessary. We can do this by spinning up another Heroku app that uses shareable addons, starts consuming the summary topics and writes them to a new data store, with no impact to the Postgres store!
Our system is more powerful and more extensible because of Kafka."

'via Blog this'

Wednesday, 18 May 2016

Stream processing, Event sourcing, Reactive, CEP… and making sense of it all — Martin Kleppmann’s blog

Stream processing, Event sourcing, Reactive, CEP… and making sense of it all — Martin Kleppmann’s blog: "Some people call it stream processing. Others call it Event Sourcing or CQRS. Some even call it Complex Event Processing. Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies who want to sell you stuff. But sometimes, they contain a kernel of wisdom which can really help us design better systems.

In this talk, we will go in search of the wisdom behind the buzzwords. We will discuss how event streams can help make your application more scalable, more reliable and more maintainable. Founded in the experience of building large-scale data systems at LinkedIn, and implemented in open source projects like Apache Kafka and Apache Samza, stream processing is finally coming of age."

Title: making sense of stream processing

Title: making sense of stream processing

'via Blog this'

MySQL EXPLAIN Explained

Mysql Explain Explained from Jeremy Coates

also:

https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/

Wednesday, 11 May 2016

Could PostgreSQL 9.5 be your next JSON database?

Could PostgreSQL 9.5 be your next JSON database?: "You can use PostgreSQL to create rich, complex JSON/JSONB documents within the database. But then if you are doing that, you may want to consider whether you are using PostgreSQL well. If the richness and complexity of those documents comes from relating the documents to each other then the relational model is often the better choice for data models that have intertwined data. The relational model also has the advantage that it handles that requirement without large scale duplication within the actual data. It also has literally decades of engineering expertise backing up design decisions and optimizations.

What JSON support in PostgreSQL is about is removing the barriers to processing JSON data within an SQL based relational environment. The new 9.5 features take down another barrier, adding just enough accessible, built-in and efficient functions and operators to manipulate JSONB documents.

PostgreSQL 9.5 isn't your next JSON database, but it is a great relational database with a fully fledged JSON story. The JSON enhancements arrive alongside numerous other improvements in the relational side of the database, "upsert", skip locking and better table sampling to name a few.

It may not be your next JSON database, but PostgreSQL could well be the next database you use to work with relational and JSON data side by side."

'via Blog this'

Wednesday, 24 February 2016

Diagnosing Common Database Performance Hotspots in our Java Code

Diagnosing Common Database Performance Hotspots in our Java Code:

For this article I focus on the database as I am sure all of your apps are suffering from one of these access patterns! You can use pretty much any profiling, tracing or APM tool available in the market, but I am using the free Dynatrace Personal License. Java also comes with great tools such as Java Mission Control. Many frameworks that access data – such as Hibernate or Spring – also offer diagnostics options typically through logging output.

Using these tracing tools doesn’t require any code changes as they all leverage JVMTI (JVM Tooling Interface) to capture code level information and even to trace calls across remoting tiers. This is very useful in distributed, (micro)service-oriented applications; just modify your startup command line options for your JVM to get the tools loaded. Some tool vendors provide IDE integration where you can simply say “run with XYZ Profiling turned on”. I have a shortYouTube tutorial demonstrating how to trace an app launched from Eclipse!

Identify Database Performance Hotspots

When it turns out that the database is the main contributor to the overall response time of requests to your application, be careful about blaming the database and finger pointing at the DBAs! There might be several reasons that would cause the database to be that busy:

Inefficient use of the database: wrong query design, poor application logic, incorrect configuration of data access framework
Poor design and data structure in the database: table relations, slow stored views, missing or wrong indexes, outdated table statistics
Inappropriate database configuration: memory, disk, tablespaces, connection pools

In this article I mainly want to focus on what you can do from the application side to minimize the time spent in the database:

Diagnose Bad Database Access Patterns

When diagnosing applications I have several database access patterns I always check for. I look at individual requests and put them into the following DB Problem Pattern categories:

Excessive SQLs: Executing a lot (> 500) different SQL Statements
N+1 Query Problem: Executing the same SQL statement multiple times (>20):
Slow Single SQL Issue: Executing a single SQL that contributes > 80% of response time
Data-Driven Issue: Same request executes different SQL depending on input parameters
Database Heavy: Database Contribution Time is > 60% of overall response time
Unprepared Statements: Executing the same SQL without preparing the statement
Pool Exhaustion: Impacted by High Connection Acquisition Time (getConnection time > executeStatement)
Inefficient Pool Access: Excessive access to connection pool (calling getConnection > 50% of executeStatement count)
Overloaded Database Server: Database server is simply overloaded with too many requests from different apps

Tuesday, 22 December 2015

db-engines.com Database Popularity Rankings (updated)

Database Rankings

1.	1.	1.	Oracle	Relational DBMS	1497.55	+16.61	+37.76
2.	2.	2.	MySQL	Relational DBMS	1298.54	+11.70	+29.96
3.	3.	3.	Microsoft SQL Server	Relational DBMS	1123.16	+0.83	-76.89
4.	4.	5.	MongoDB	Document store	301.39	-3.22	+54.87
5.	5.	4.	PostgreSQL	Relational DBMS	280.09	-5.60	+26.09
6.	6.	6.	DB2	Relational DBMS	196.13	-6.40	-14.13
7.	7.	7.	Microsoft Access	Relational DBMS	140.21	-0.75	+0.31
8.	8.	9.	Cassandra	Wide column store	130.84	-2.08	+36.78
9.	9.	8.	SQLite	Relational DBMS	100.85	-2.60	+6.15
10.	10.	10.	Redis	Key-value store	100.54	-1.87	+12.66

11.	11.	11.	SAP Adaptive Server	Relational DBMS	81.47	-2.24	-4.52
12.	12.	12.	Solr	Search engine	79.15	-0.63	+0.73
13.	14.	16.	Elasticsearch	Search engine	76.57	+1.79	+30.67
14.	13.	13.	Teradata	Relational DBMS	75.72	-1.37	+8.32
15.	16.	17.	Hive	Relational DBMS	55.27	+0.36	+18.90
16.	15.	15.	HBase	Wide column store	54.25	-2.21	+3.17
17.	17.	14.	FileMaker	Relational DBMS	50.12	-1.61	-2.10
18.	18.	20.	Splunk	Search engine	43.86	-0.76	+12.39
19.	19.	21.	SAP HANA	Relational DBMS	38.86	-0.76	+11.05
20.	20.	18.	Informix	Relational DBMS	36.40	-2.05	+1.28
21.	21.	23.	Neo4j	Graph DBMS	33.18	-0.86	+8.02

Tuesday, 25 August 2015

Uber: Migrating from Postgres to Sharded Datastore on top of Mysql + Lambda Architecture (append only updates)

Project Mezzanine: The Great Migration

The first design decision was the choice of database for the tripstore. Our short list of requirements were:

Operationally robust (no data loss, supports backup, replication to secondary data centers, easy to troubleshoot, predictable, operational expertise).
Horizontally scalable both in storage capacity and IOPS.
High write-availability. We always want to be able to persist a trip to stable storage. It’s okay to trade-off short-term read-availability, as the backend is working mostly in a batch-oriented fashion.
Secondary index support. Trips are looked up by user, city, and summarized in various ways.
No downtime for any operation (expanding storage, backup, adding indexes, adding data, and so forth).

The last item in the list was addressing a very immediate pain point. The trips table in PostgreSQL had grown so big that any operation that needed to add a new column or add a new index caused downtime. This made it increasingly cumbersome to develop new features.
We decided that a column-oriented, schemaless approach where data (JSON blobs) are organized in a grid indexed by trip-UUID, column name, and optionally a timestamp would work well as an overall data model. The model lends itself naturally to horizontal scaling by partitioning the rows across multiple shards, and supports our rapid development culture by being schemaless. New columns can be added, and new fields can be added to a column with no reconfiguration.
We evaluated various NoSQL-style databases with the above characteristics. However, we didn’t feel confident they were a good fit for storing our trip data, because of either our operational experience or the product’s maturity.
Inspired by blog posts, such as this one from FriendFeed, we decided to build our own simple, sharded datastore on top of MySQL. The key characteristics of the system we built are:

Sharding: Rows are sharded into a fixed set of shards, decided at setup time. Typically, we use 4096. Each shard corresponds to a MySQL tablespace, and the shards are distributed across a number of MySQL servers. Shards can be moved between MySQL servers for load-balancing, and the capacity can be increased online. We typically expand by splitting each MySQL server in two.
Append-only (no updates) data model: It only supports an append-only data model where a cell can never be modified after it is written. This is very useful for a system that stores transactional data and wants to guard against data corruption. By being append-only, modifications are naturally idempotent and commutative. The latter means that we can replay updates in any order and get the same result. (We learned later that the append-only style is also advocated by the lambda architecture.)
Buffered writes. If the shard where a cell needs to be written to is unavailable (or slow), we write the data to a pending table in any other available MySQL server. These are then later replayed once the shard becomes available. Due to the idempotent and commutative data model, this is always safe and does not require cross-host coordination.
Sharded secondary indexes: Indexes can be created on multiple fields in the columns and are sharded on a specific key (e.g., user uuid). They are implemented as MySQL tables and backfilled in the background. In case we need to change the index (e.g., adding a field), we can create a new version, backfill it, and then switch to the new version by changing an index alias, all without application downtime.

The whole system, which we simply call Schemaless in homage to its design, is written in Python. The initial version took about 5 months from idea to production deployment, and we will describe specific implementation details in future blog posts

How B-tree indexing works in MySQL

http://stackoverflow.com/questions/2362667/how-b-tree-indexing-works-in-mysql

Q. When I create an index for a table in mysql, I see that the index_type is type BTREE. Now although I understand about btree(s), I do not quiet understand how it stores the index and how the database searches the records based on this.

I mean, btree is excellent for databases to perform read and writes large blocks of data, when we create an index for column type of Primary key, what I understand is, it creates a tree and splitting the values for the root based on the value type of the root.

Now, does it store only the the primary key ID under the trees or the whole data associated with that primary key?

After finding the wanted primary ID, how does the database extract the record?

A. The database stores the value indexed as a B-Tree key, and the record pointer as a B-Tree value.

Whenever you search for a record holding a certain value of an indexed column, the engine locates the key holding this value in the B-Tree, retrieves the pointer to the record and fetches the record.

What exactly is a "record pointer", depends on the storage engine.

In MyISAM, the record pointer is an offset to the record in the MYI file.
In InnoDB, the record pointer is the value of the PRIMARY KEY.

In InnoDB, the table itself is a B-Tree with a PRIMARY KEY as a B-Tree key. This is what called a "clustered index" or "index-organized table". In this case, all other fields are stored as a B-Treevalue.

In MyISAM, the records are stored without any special order. This is called "heap storage".

Wednesday, 22 April 2015

General database links

When is "ACID" ACID? Rarely.

tl;dr: ACID and NewSQL databases rarely provide true ACID guarantees by default, if they are supported at all. See the table.

Dean Wampler: SQL Strikes Back! Recent Trends in Data Persistence and Analysis

http://www.infoq.com/presentations/sql-newsql-hybrid-fp

Traditional Data Warehouse: Pros
–Mature
–Rich SQL, analytics functions
–Scales to “mid-size” data
Cons
–Expensive per TB
–Can’t scale to Hadoop-sized data sets

Data Warehouse vs. Hadoop? : • Data Warehouse
+Mature
+Rich SQL, analytics
–Scalability
–$$/TB
• Hadoop
–Maturity vs. DWs
+Growing SQL
+Massive scalability
+Excellent $$/TB

Facebook had data in Hadoop. Facebook’s Data Analysts needed access to it...
so they created Hive...

vladmihalcea.com: A BEGINNER’S GUIDE TO TRANSACTION ISOLATION LEVELS IN ENTERPRISE JAVA

http://vladmihalcea.com/2014/12/23/a-beginners-guide-to-transaction-isolation-levels-in-enterprise-java/

A relational database strong consistency model is based on ACID transaction properties. In this post we are going to unravel the reasons behind using different transaction isolation levels and various configuration patterns for both resource local and JTA transactions.

Martin Fowler: NoSQL, Polyglot Persistence

Monday, 2 June 2014

Example Time Scale of System Latencies

https://twitter.com/PieCalculus/status/459485747842523136/photo/1

Constants Are Changing : a Software and Technology scrapbook