Constants Are Changing : a Software and Technology scrapbook: performance

Showing posts with label performance. Show all posts

Tuesday, 19 July 2016

why i like netstat (h/t @icco) pic.twitter.com/PunGzBqjLg
— Julia Evans (@b0rk) July 19, 2016

Monday, 18 July 2016

some words about perf (this one's gonna need more than one page) pic.twitter.com/F4eqpCaaWe
— Julia Evans (@b0rk) July 18, 2016

a sketch about why I like dstat pic.twitter.com/lgQHm8Sq7J
— Julia Evans (@b0rk) July 18, 2016

Saturday, 11 June 2016

Performance: 10 Commandments
Mostly scraped from @mjpt777 talks pic.twitter.com/K3NWtGnMl0
— Julian Warszawski (@hundredmondays) June 7, 2016

Tuesday, 26 April 2016

infoQ:Top 10 Performance Mistakes

Top 10 Performance Mistakes: "Martin Thompson, co-founder of LMAX, keynoted on performance at QCon São Paulo 2016. Initially entitled “Top Performance Myths and Folklore”, Thompson renamed the presentation “Top 10 Performance Mistakes” because “we all make mistakes and it’s very easy to do that.”

This is a digest of the top 10 performance related mistakes he has seen in production, including advice on avoiding them."

8. Data Dependent Loads. Thompson presented the results of a benchmark measuring the time needed to perform one operation when attempting to sum up all longs in a 1GB array located in memory (RAM). The time depends on how the memory is accessed, and is presented in the following table:

The results of this benchmark show that not all memory operations are equal, and one needs to be careful how it deals with them. Thompson considers that it is important to know some basics regarding the performance of various data structures, noting that Java’s HashMap is over ten times slower than .NET Dictionary for structures larger than 2GB. He added that there are cases when .NET is much slower than Java.

7. Too Much Allocation. While allocation is almost free in many cases, claiming back that memory is not free because the garbage collector needs significant time when working on large sets of data. When lots of data is allocated, the cache is filled up and older data is discarded, making operations on that data to take 90ns/op instead of 7ns/op, which is more than an order of magnitude slower.
6. Going Parallel. While going parallel is very attractive for certain algorithms, there are some limitations and overhead associated with it. Thompson cited the paper Scalability! But at what COST? in which the authors compare parallel systems with single threaded ones by introducing COST (Configuration that Outperforms a Single Thread), defined as

The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.

The authors have analyzed the measurements of various data-parallel systems and concluded that “many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all of their reported configurations.”
In this context Thompson remarked that there is a certain communication and synchronization overhead associated with parallel tasks, and some of the activity being intrinsically serial and not parallelizable. According to Amdahl Law, if 5% of a system’s activity needs to be serial, then the system’s speed improvement will be maximum 20 times no matter how many processors are used.

5. Not Understanding TCP. On this one Thompson remarked that many are considering a microservices architecture without a solid understanding of TCP. In certain cases it is possible to experience delayed ACK limiting the number of packets sent over the wire to 2-5 per second. This is due to a deadlock created by two algorithms introduced in TCP: Nagle and TCP Delayed Acknowledgement. There is a 200-500 ms timeout that interrupts the deadlock, but the communication between microservices is severally affected by it. The recommended solution is to use TCP_NODELAY which disables Nagle’s algorithm and multiple smaller packets can be sent one after the other. The difference is between 5 and 500 req/sec, according to Thompson.
4. Synchronous Communications. Synchronous communication between a client and a server incurs a time penalty which becomes problematic in a system where machines need fast communication. The solution is not buying more expensive and faster hardware but using asynchronous communication, said Thompson. In this case, a client can send multiple requests to the server without having to wait on the response between them. This approach requires a change in how the client deals with responses, but it is worthwhile.
3. Text Encoding. Many times developers choose to send data over the wire using a text encoding format such as JSON, XML or Base64 because “it is human readable.” But Thompson noted that no human reads it when two systems talk to each other. It may be easier to debug with a simple text editor, but there is a high CPU penalty related to converting binary data to text and back. The solution is using better tools that understand binary, Thompson mentioning Wireshark
1. Logging. For the #1 performance culprit Thompson listed the time spent for logging. He showed a graph depicting the average time spent for a logging operation when the number of threads increases:

'via Blog this'

Monday, 11 April 2016

Brendan Gregg: SREcon 2016 Performance Checklists for SREs

SREcon 2016 Performance Checklists for SREs:

hap://techblog.neSlix.com/2015/11/linux-‐performance-‐analysis-‐in-‐60s.html

Saturday, 9 April 2016

WebPagetest - Website Performance and Optimization Test

WebPagetest - Website Performance and Optimization Test: "Run a free website speed test from multiple locations around the globe using real browsers (IE and Chrome) and at real consumer connection speeds. You can run simple tests or perform advanced testing including multi-step transactions, video capture, content blocking and much more. Your results will provide rich diagnostic information including resource loading waterfall charts, Page Speed optimization checks and suggestions for improvements."

'via Blog this'

Thursday, 7 April 2016

The revenge of the listening sockets

The revenge of the listening sockets: "Back in November we wrote a blog post about one latency spike. Today I'd like to share a continuation of that story. As it turns out, the misconfigured rmem setting wasn't the only source of added latency."

'via Blog this'

The story of one latency spike

The story of one latency spike: "A customer reported an unusual problem with our CloudFlare CDN: our servers were responding to some HTTP requests slowly. Extremely slowly. 30 seconds slowly. This happened very rarely and wasn't easily reproducible. To make things worse all our usual monitoring hadn't caught the problem. At the application layer everything was fine: our NGINX servers were not reporting any long running requests."

'via Blog this'

Saturday, 2 April 2016

bcc: Dynamic Tracing Tools for Linux (IOvisor github)

bcc: "bcc is more than just tools. The BPF enhancements that bcc uses were originally intended for software defined networking (SDN). In bcc, there are examples of this with distributed bridges, HTTP filters, fast packet droppers, and tunnel monitors. BPF was enhanced to support more than just networking, and has general tracing support in the Linux 4.x series. bcc is really a compiler for BPF, that comes with many sample tools. So far bcc has both Python and lua front ends. bcc/BPF, or just BPF, should become an standard resource for performance monitoring and analysis tools, to provide detailed metrics beyond /proc. Latency heat maps, flame graphs, and more should become commonplace in performance GUIs, powered by BPF."

'via Blog this'

Thursday, 31 March 2016

Probing the JVM with BPF/BCC | All Your Base Are Belong To Us

Probing the JVM with BPF/BCC | All Your Base Are Belong To Us: "Probing the JVM with BPF/BCC

Now that BCC has support for USDT probes, another thing I wanted to try is look at OpenJDK probes and extract some useful examples. To follow along, install a recent OpenJDK (I used 1.8) that has USDT probes enabled.

On my Fedora 22, sudo dnf install java was just enough for everything. Conveniently, OpenJDK ships with a set of .stp files that contain probe definitions. Here’s an example — and there are many more in your $JAVA_HOME/tapset directory:"

'via Blog this'

Monday, 28 March 2016

Brendan D. Gregg: Linux perf Examples

Linux perf Examples:

"These are some examples of using the perf Linux profiler, which has also been called Performance Counters for Linux (PCL), Linux perf events (LPE), or perf_events. Like Vince Weaver, I'll call it perf_events so that you can search on that term later. Searching for just "perf" finds sites on the police, petroleum, weed control, and a T-shirt. This is not an official perf page, for either perf_events or the T-shirt."

Julia Evans: How does perf work? (in which we read the Linux kernel source)

Friday, 18 March 2016

tcpdump is amazing - Julia Evans

tcpdump is amazing - Julia Evans:

Let's suppose you have some slow HTTP requests happening on your machine, and you want to get a distribution of how slow they are. You could add some monitoring somewhere inside your program. Or! You could use tcpdump. Here's how that works!

Use tcpdump to record network traffic on the machine for 10 minutes

analyze the recording with Wireshark

be a wizard

The secret here is that we can use tcpdump to record network traffic, and then use a tool that we're less scared of (Wireshark) to analyze it on our laptop after.

Friday, 17 April 2015

Aruoba/Fernández-Villaverde: Comparison of Programming Languages in Economics

Comparison of Programming Languages in Economics: code examples

http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

We solve the stochastic neoclassical growth model, the workhorse of modern macroeconomics, using C++11,Fortran 2008, Java, Julia, Python, Matlab, Mathematica, and R. We implement the same algorithm, value function iteration with grid search, in each of the languages. We report the execution times of the codes in a Mac and in a Windows computer and comment on the strength and weakness of each language