Tuesday, 26 April 2016

infoQ:Top 10 Performance Mistakes

Top 10 Performance Mistakes: "Martin Thompson, co-founder of LMAX, keynoted on performance at QCon São Paulo 2016. Initially entitled “Top Performance Myths and Folklore”, Thompson renamed the presentation “Top 10 Performance Mistakes” because “we all make mistakes and it’s very easy to do that.”

This is a digest of the top 10 performance related mistakes he has seen in production, including advice on avoiding them."

8. Data Dependent Loads. Thompson presented the results of a benchmark measuring the time needed to perform one operation when attempting to sum up all longs in a 1GB array located in memory (RAM). The time depends on how the memory is accessed, and is presented in the following table:
10-perf-ops
The results of this benchmark show that not all memory operations are equal, and one needs to be careful how it deals with them. Thompson considers that it is important to know some basics regarding the performance of various data structures, noting that Java’s HashMap is over ten times slower than .NET Dictionary for structures larger than 2GB. He added that there are cases when .NET is much slower than Java.

7. Too Much Allocation. While allocation is almost free in many cases, claiming back that memory is not free because the garbage collector needs significant time when working on large sets of data. When lots of data is allocated, the cache is filled up and older data is discarded, making operations on that data to take 90ns/op instead of 7ns/op, which is more than an order of magnitude slower.
6. Going Parallel. While going parallel is very attractive for certain algorithms, there are some limitations and overhead associated with it. Thompson cited the paper Scalability! But at what COST? in which the authors compare parallel systems with single threaded ones by introducing COST (Configuration that Outperforms a Single Thread), defined as
The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.
The authors have analyzed the measurements of various data-parallel systems and concluded that “many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.”
In this context Thompson remarked that there is a certain communication and synchronization overhead associated with parallel tasks, and some of the activity being intrinsically serial and not parallelizable. According to Amdahl Law, if 5% of a system’s activity needs to be serial, then the system’s speed improvement will be maximum 20 times no matter how many processors are used.

5. Not Understanding TCP. On this one Thompson remarked that many are considering a microservices architecture without a solid understanding of TCP. In certain cases it is possible to experience delayed ACK limiting the number of packets sent over the wire to 2-5 per second. This is due to a deadlock created by two algorithms introduced in TCP: Nagle and TCP Delayed Acknowledgement. There is a 200-500 ms timeout that interrupts the deadlock, but the communication between microservices is severally affected by it. The recommended solution is to use TCP_NODELAY which disables Nagle’s algorithm and multiple smaller packets can be sent one after the other. The difference is between 5 and 500 req/sec, according to Thompson.
4. Synchronous Communications. Synchronous communication between a client and a server incurs a time penalty which becomes problematic in a system where machines need fast communication. The solution is not buying more expensive and faster hardware but using asynchronous communication, said Thompson. In this case, a client can send multiple requests to the server without having to wait on the response between them. This approach requires a change in how the client deals with responses, but it is worthwhile.
3. Text Encoding. Many times developers choose to send data over the wire using a text encoding format such as JSON, XML or Base64 because “it is human readable.” But Thompson noted that no human reads it when two systems talk to each other. It may be easier to debug with a simple text editor, but there is a high CPU penalty related to converting binary data to text and back. The solution is using better tools that understand binary, Thompson mentioning Wireshark
1. Logging. For the #1 performance culprit Thompson listed the time spent for logging. He showed a graph depicting the average time spent for a logging operation when the number of threads increases:



'via Blog this'

No comments:

Post a Comment