Monday, 8 June 2015

Machine Learning links

A Visual Introduction to Machine Learning

10 more lessons learned from building Machine Learning systems


A Tour of Machine Learning Algorithms

There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit. This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.
  • Supervised Learning: Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data. Example problems are classification and regression. Example algorithms are Logistic Regression and the Back Propagation Neural Network.
  • Unsupervised Learning: Input data is not labelled and does not have a known result. A model is prepared by deducing structures present in the input data. Example problems are association rule learning and clustering. Example algorithms are the Apriori algorithm and k-means.
  • Semi-Supervised Learning: Input data is a mixture of labelled and unlabelled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression. Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabelled data.
  • Reinforcement Learning: Input data is provided as stimulus to a model from an environment to which the model must respond and react. Feedback is provided not from of a teaching process as in supervised learning, but as punishments and rewards in the environment. Example problems are systems and robot control. Example algorithms are Q-learning and Temporal difference learning.
When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods. A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labelled examples. Reinforcement learning is more likely to turn up in robotic control and other control systems development.




What is machine learning?


Implementing a Distributed Deep Learning Network over Spark – Data Science Central





Graphics in reverse: Probabilistic programming does in 50 lines of code what used to take thousands.
In a probabilistic programming language, the heavy lifting is done by the inference algorithm — the algorithm that continuously readjusts probabilities on the basis of new pieces of training data. In that respect, Kulkarni and his colleagues had the advantage of decades of machine-learning research. Built into Picture are several different inference algorithms that have fared well on computer-vision tasks. Time permitting, it can try all of them out on any given problem, to see which works best.
Moreover, Kulkarni says, Picture is designed so that its inference algorithms can themselves benefit from machine learning, modifying themselves as they go to emphasize strategies that seem to lead to good results. “Using learning to improve inference will be task-specific, but probabilistic programming may alleviate re-writing code across different problems,” he says. “The code can be generic if the learning machinery is powerful enough to learn different strategies for different tasks.”

Peter Norvig: Machine Learning for Programming

Q: Can we learn complex nontraditional programs from examples? 
A: Not yet, maybe someday.
Q: Can we learn to optimize programs? 
A: Yes, short parts 




17 Great Machine Learning Libraries

Python

  • Scikit-learn: comprehensive and easy to use, I wrote a whole article on why I like this library.
  • PyBrain: Neural networks are one thing that are missing from SciKit-learn, but this module makes up for it.
  • nltk: really useful if you’re doing anything NLP or text mining related.
  • Theano: efficient computation of mathematical expressions using GPU. Excellent for deep learning.
  • Pylearn2: machine learning toolbox built on top of Theano - in very early stages of development.
  • MDP (Modular toolkit for Data Processing): a framework that is useful when setting up workflows.

Java

  • Spark: Apache’s new upstart, supposedly up to a hundred times faster than Hadoop, now includes MLLib, which contains a good selection of machine learning algorithms, including classification, clustering and recommendation generation. Currently undergoing rapid development. Development can be in Python as well as JVM languages.
  • Mahout: Apache’s machine learning framework built on top of Hadoop, this looks promising, but comes with all the baggage and overhead of Hadoop.
  • Weka: this is a Java based library with a graphical user interface that allows you to run experiments on small datasets. This is great if you restrict yourself to playing around to get a feel for what is possible with machine learning. However, I would avoid using this in production code at all costs: the API is very poorly designed, the algorithms are not optimised for production use and the documentation is often lacking.
  • Mallet: another Java based library with an emphasis on document classification. I’m not so familiar with this one, but if you have to use Java this is bound to be better than Weka.
  • JSAT: stands for “Java Statistical Analysis Tool” - created by Edward Raff and was born out of his frustation with Weka (I know the feeling). Looks pretty cool.

http://www.reddit.com/r/MachineLearning/
AMA Andrew Ng and Adam Coates:
Linear/logistic regression and k-means clustering are probably the dominant paradigms in ML, and likely will always be. There's just too much bang for the buck.

@andrewyng
What kind of self projects and follow up courses would you recommend after the Coursera ML course?

[–]andrewyng[S] 39 points 
Here're a few common paths: 1. Many people are applying ML to projects by themselves at home, or in their companies. This helps both with your learning, as well as helps build up a portfolio of ML projects in your resume (if that is your goal). If you're not sure what projects to work on, Kaggle competitions can be a great way to start. Though if you have your own ideas I'd encourage you to pursue those as well. If you're looking for ideas, check out also the machine learning projects my Stanford class did last year: http://cs229.stanford.edu/projects2014.html I'm always blown away by the creativity and diversity of the students' ideas. I hope this also helps inspire ideas in others! 2. If you're interested in a career in data science, many people go on from the machine learning MOOC to take the Data Science specialization. Many students are successfully using this combination to start off data science careers.https://www.coursera.org/specialization/jhudatascience/1 


Mark Hall on Data Mining & Weka: Weka and Spark

Spark is smokin' at the moment. Every being and his/her/its four-legged canine-like companion seems to be scrambling to provide some kind of support for running their data processing platform under Spark. Hadoop/YARN is the old big data dinosaur, whereas Spark is like Zaphod Beeblebrox - so hip it has trouble seeing over its own pelvis; so cool you could keep a side of beef in it for a month :-) Certainly, Spark has advantages over first generation map-reduce when it comes to machine learning. Having data (or as much as possible) in memory can't be beat for iterative algorithms. Perhaps less so for incremental, single pass methods.
Anyhow, the coolness factor alone was sufficient to get me to have a go at seeing whether Weka's general purpose distributed processing package - distributedWekaBase - could be leveraged inside of Spark. To that end, there is now a distributedWekaSpark package which, short of a few small modifications to distributedWekaBase (mainly to support some retrieval of outputs in-memory rather than from files), proved fairly straightforward to produce. In fact, because develop/test cycles seemed so much faster in Spark than Hadoop, I prototyped Weka's distributed k-means|| implementation in Spark before coding it for Hadoop. 

Introduction to Data Science with Apache Spark: Get started with Zeppelin on HDP

Apache Spark provides a lot of valuable tools for data science. With our release of Apache Spark 1.3.1 Technical Preview, the powerful Data Frame API is available on HDP.
Data scientists use data exploration and visualization to help frame the question and fine tune the learning.Apache Zeppelin helps with this.
Based on the concept of an interpreter that can be bound to any language or data processing backend, Zeppelin is a web based notebook server. As one of its backends, Zeppelin implements Spark, and other implementations, such as Hive, Markdown, D3 etc., are also available.
In a series of blog posts, we will describe how Zeppelin, Spark SQL and MLLib can be combined to simplify exploratory Data Science.In the first post of this series, we describe how to Install / Build Zeppelin for HDP 2.2 and uncover some basic capabilities for data exploration that Zeppelin offers. 

Machine learning: an overview

Let me tell you something: You can't really use Machine Learning if you don't know the statistical/mathematical basis. I am really upset when I see a Youtube video of some guy in T-Shirt probably working at a large organization ranting about Machine Learning and Data Science, telling programmers that maths is easy to grasp. Everybody knows how to press a button or, if you force me, almost everybody knows how to fix something in their Windows control panel, but that does not mean we can trust them when talking about building a secure payment system, Everybody can use Mahout or the like but that does not mean he knows jack about what he is doing using Naive Bayes to predict the class from thre variables (x, y, z) where z=x^2 and x belongs to the range [-1,1].
Machine Learning is just a fancy word for the statistical/mathematical tools lying underneath, whose objective is to extract something that we may loosely call knowledge (or something that we understand) from data (or something chaotic that we do not understand), so that computers may take action based on the inferred knowledge. An example of this would be a robot arm/humanoid: without programming actions on direction/velocity/acceleration vectors based on an established model, we may put sensonrs on a subject's articulations, and from these datapoints learn a regression model on the manifold of natural movements. Another example is in Business Intelligente: We may learn groups of customers (market segmentation) so that we may engage several groups with specific policies or offers target at them.
Maching Learning is applied Statistics/Mathematics. Is very little and very unpractical without Optimization/Operations Research, from the algorithmic and practical/scalable point of view.
I've come to the conclusion that there exists two large main approaches to ML, disregarding the specific technique we are dealing with and its target (i.e., supervised or unsupervised), plus one in the middleway:
  • Functional approach (Mathematical)
  • Neural Netword/Deep Learning approach (Middle way)
  • Probabilistic approach (Statistical)


Deep Learning with Spark and TensorFlow 

No comments:

Post a Comment