NY Times: For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights
et far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”Several start-ups are trying to break through these big data bottlenecks by developing software to automate the gathering, cleaning and organizing of disparate data, which is plentiful but messy. The modern Wild West of data needs to be tamed somewhat so it can be recognized and exploited by a computer program.“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.
Peadar Coyle: Data Science as a Process
Practical Probabilistic Programming (http://t.co/NpecSd6MIP) looks interesting. Uses a Scala toolkit called Figaro, https://t.co/r2TRM6jUFm
— Dean Wampler (@deanwampler) August 11, 2014
Making Sense of Performance in Data Analytics Frameworks
Makélélé and Linear Algebra
The authors studied Spark workloads and had to add the needed instrumentation themselves. They perform ‘blocked task analysis’. This enables them to measure the time spent blocked on disk I/O and network I/O, and thus calculate the theoretical maximum improvement from eliminating all such blocking. With respect to the conventional wisdom, Ousterhout et al. find that:
Network optimization can reduce job completion time by a median of 2% at best.
Optimizing or eliminating disk accesses can reduce job completion time by a median of 19% at best.
Optimizing stragglers can reduce job completion time by a median of 10% at best.
http://graphicallinearalgebra.net/
The importance of Makélélé’s role was difficult to appreciate for the non-specialist. But football insiders regularly described him as the work-horse, the engine room, the battery of the team. He sat deep in midfield, was always in the right place to disrupt opposition attacks, recovered possession, and got the ball out quickly to his teammates, turning defence into attack. Without Makélélé, the galácticos didn’t look quite so galactic.
Similarly, linear algebra does not get very much time in the spotlight. But many galáctico subjects of modern scientific research: e.g. artificial intelligence and machine learning, control theory, solving systems of differential equations, computer graphics, “big data“, and even quantum computing have a dirty secret: their engine rooms are powered by linear algebra.
Linear algebra is not very glamorous. It is normally taught to science undergraduates in their first year, to prepare for the more exciting stuff ahead. It is background knowledge. Everyone has to learn what a matrix is, and how to add and multiply matrices. The question “what is a matrix?” is typically answered in a particularly boring way: a matrix is a double array of numbers
Python versus R for Data Science
I’m a huge fan of doing exploratory data analysis/visualization in R using ggplot2 and dplyr. Using these tools, I find it straightforward to translate my ideas from English into code and visualizations. The analogous process in Python is usually more convoluted, and the results are not nearly as pleasing to the eye or as informative. I’ve found that the ease of working in R greatly enhances my creativity and general happiness.
That being said, R also has some pretty huge drawbacks relative to Python. The language is byzantine and weird, and anytime I leave the magical world of dplyr and ggplot I start to feel the pain. Furthermore, I find it hard to automate workflows, or build reusable code. My current strategy is to leverage the best of both worlds — do early stage data analysis in R, then switch to Python when it’s time to get serious, be a team player, and ship some real code and data products.
Simpson's paradox
Simpson's paradox, or the Yule–Simpson effect, is a paradox in probability and statistics, in which a trend that appears in different groups of data but disappears or reverses when these groups are combined. It is sometimes given the impersonal title reversal paradox or amalgamation paradox.[1]This result is often encountered in social-science and medical-science statistics,[2] and is particularly confounding when frequency data are unduly given causal interpretations.[3] Simpson's paradox disappears when causal relations are brought into consideration. Many statisticians believe that the mainstream public should be informed of the counter-intuitive results in statistics such as Simpson's paradox.[4][5]
Peadar Coyle: Interview with a Data Scientist (Hadley Wickham)
Peadar Coyle: A Bayesian Hierarchical model for the Six Nations
No comments:
Post a Comment