Constants Are Changing : a Software and Technology scrapbook: spark

Showing posts with label spark. Show all posts

Friday, 18 September 2015

Hadoop, Spark, Storm (and ecosystem) links

.@mpron SQL is roaring back, but it’s important to understand the CAP theorem and when availability is preferable over ACID #DZBigData
— Dean Wampler (@deanwampler) September 22, 2014

What MapReduce can't do

We discuss here a large class of big data problems where MapReduce can't be used - not in a straightforward way at least - and we propose a rather simple analytic, statistical solution.MapReduce is a technique that splits big data sets into many smaller ones, process each small data set separately (but simultaneously) on different servers or computers, then gather and aggregate the results of all the sub-processes to produce the final answer. Such a distributed architecture allows you to process big data sets 1,000 times faster than traditional (non-distributed) designs, if you use 1,000 servers and split the main process into 1,000 sub-processes.MapReduce works very well in contexts where variables or observations are processed one by one. For instance, you analyze 1 terabyte of text data, and you want to compute the frequencies of all keywords found in your data. You can divide the 1 terabyte into 1,000 data sets, each 1 gigabyte. Now you produce 1,000 keyword frequency tables (one for each subset) and aggregate them to produce a final table.However, when you need to process variables or data sets jointly, that is 2 by 2 or or 3 by 3, MapReduce offers no benefit over non-distributed architectures. One must come with a more sophisticated solution.

Here we are talking about self-joining 10,000 data sets with themselves, each data set being a time series of 5 or 10 observations. My point in my article is that the distributed architecture of Map Reduce does not bring any smart processing of these computations. Just brute force (better than a non-parallel approach, sure) but totally incapable of removing the very massive redundancy involved in these computations (as well as in storage redundancy, since the same data set must be stored on a bunch of servers to allow for the computations of all cross-correlations).

This underscores the importance of a major upcoming change to Hadoop - the introduction of YARN which will open up Hadoop's distributed data and compute framework for use with non-MapReduce paradigms and frameworks. One such example - RushAnalytics which, incidentally, also adds fine-grained parallelism within each node of a cluster.

Machine Learning links

A Visual Introduction to Machine Learning

10 more lessons learned from building Machine Learning systems

A Tour of Machine Learning Algorithms

There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit. This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.

Supervised Learning: Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data. Example problems are classification and regression. Example algorithms are Logistic Regression and the Back Propagation Neural Network.

Unsupervised Learning: Input data is not labelled and does not have a known result. A model is prepared by deducing structures present in the input data. Example problems are association rule learning and clustering. Example algorithms are the Apriori algorithm and k-means.

Semi-Supervised Learning: Input data is a mixture of labelled and unlabelled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression. Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabelled data.

Reinforcement Learning: Input data is provided as stimulus to a model from an environment to which the model must respond and react. Feedback is provided not from of a teaching process as in supervised learning, but as punishments and rewards in the environment. Example problems are systems and robot control. Example algorithms are Q-learning and Temporal difference learning.

When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods. A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labelled examples. Reinforcement learning is more likely to turn up in robotic control and other control systems development.

General Big Data, Big Analytics links

5 technologies that will help big data cross the chasm

Apache Spark

Spark’s popularity is aided by the YARN resource manager for Hadoop and the Apache Mesos cluster-management software, both of which make it possible to run Spark, MapReduce and other processing engines on the same cluster using the same Hadoop storage layer. I wrote in 2012 about the move away from MapReduce as one of five big trends helping us rethink big data, and Spark has stepped up as the biggest part of that migration

2015: Thomas W. Dinsmore: Predictions for Big Analytics

2015: Predictions for Big Analytics

Apache Spark usage will explode.
Analytics in the cloud will take off.
Python will continue to gain on R as the preferred open source analytics platform.
H2O will continue to win respect and customers in the Big Analytics market.
SAS customers will continue to seek alternatives.

Constants Are Changing : a Software and Technology scrapbook