Monday, 6 April 2015

Trey Causey: Getting started in data science: My thoughts



http://treycausey.com/getting_started.html


One of the primary things that separates a data scientist from someone just building models is the ability to think carefully about things like endogeneity, causal inference, and experimental and quasi-experimental design. Data scientists must understand and think about things like data generating processes and reason through how misspecdraw from their analyses.

It takes a long time and a lot of training for this to come naturally. I don't think I gave much thought to selecting on the dependent variable and how endemic it is until I got to grad school. Now it sticks out like a sore thumb everywhere I look. Similarly, thinking carefully about outliers (extreme values) or the process by which your data came to have missing values; these are things that often get swept aside by tutorials showing you how to use R. This isn't to say you have to go to grad school (you probably shouldn't) or even to college; it just means that data science is not simply a series of programs and tutorials that automatically make inferences from your data. Often times, what isn't in your data has significant implications for inference. Your software package isn't going to tell you what they are.

All this being said, I do think we live in an extremely exciting time for democratizing education. I hope some good comes out of it. Enough doom and gloom, and let's get on to the links.





* Math. There is no getting around it. You simply must study math and statistics. I use linear algebra in my work daily. If you have never taken calculus and can only take one math course right now, I highly recommend Gilbert Strang's course on MIT's Open Courseware. edX is also currently offering anintroductory linear algebra course that is well put-together, but Strang is the gold standard. Of course, you should also have taken a basic multivariate calculus course if you want to read research papers that implement new algorithms, but I use pure calculus far less in my day-to-day work. I don't recommend it, but many social scientists make it all the way through the PhD and into academic postitions having never taken a calculus course.

* Statistics. The vast majority of my job revolves around statistical inference. As mentioned above, linear regressions are incredibly simple to estimate, yet there are some core assumptions that, if not met, can render your results sketchy at best and completely invalid at worst. Training in statistics will teach you to know these assumptions, understand what happens when they're not met, and what to do about it. In fact, training in statistics usually takes the path of "here's some simple linear models" in a course or two. Then nearly every course following that tries to figure out how to estimate models that violate the assumptions of linear models, but in different ways: autocorrelation in time series data, non-independent observations due to time or spatial clustering, dependent variables that are counts with lots of zeros, and so on.

* Experiments and causal inference. You should also be well-versed in thinking about research design. If you're going to be in charge of your company's split tests and experiments, you'll want to master this stuff. Judea Pearl's Causality is probably the most well-known and referenced work, but it's not for beginners. You could do really well for yourself by starting with a basic research methods textbook, especially from the social sciences as they're often concerened about doing experiments when you're not in a laboratory setting. Designing Social Inquiry, referred to by many as "KKV", is a really good starting point for some of this.

* Machine learning. Machine learning and statistics have significant overlap, but while statistics is often concerned with precise and unbiased estimates of parameters, machine learning is usually focused on making accurate predictions on unseen data. Andrew Ng's course runs routinely on Coursera and is a good first step. The barriers to entry in reading about machine learning are significantly higher than many other topics, as machine learning is an applied subfield of computer science. Since the math requirements for CS majors are often non-trivial, a working knowledge of multivariate calculus is often assumed. In fact, one of the fundamental estimation techniques used in many machine learning algorithms, stochastic gradient descent, assumes you know what a gradient is.

For a more applied, less math-heavy introduction to the concepts in R, I highly recommend Machine Learning for Hackers by Drew Conway and John Myles White. Full disclosure, they are my friends, but don't hold that against them. Machine Learning in Action is also good, for those that prefer Python.

No comments:

Post a Comment