This article originally appeared at Sharp Sight Labs. Follow Joshua Ebner, the founder of Sharp Sight Labs, on Twitter. Read more here.
Over and over, when talking with people who are starting to learn data science, there’s a frustration that comes up:
“I don’t know which programming language to start with.”
And it’s not just programming languages, it’s also software systems like Tableau, SPSS, etc. There is an ever widening range of tools and programming languages and it’s difficult to know which one to select.
I get it. When I started focusing heavily on data science a few years ago, I reviewed all of the popular programming languages at the time: Python, R, SAS, D3, not to mention a few that in hindsight, really aren’t that great for analytics like Perl, Bash, and Java. Even today, I just read a suggestion (by a well known data science blogger) to use somewhat arcane tools like UNIX’s Awk and Sed. (Don’t worry if you have no idea what Awk and Sed are, because you shouldn’t learn them. Not in the beginning.)
There are so many suggestions, so much material, so many options, it becomes difficult to know what to learn first. There’s a mountain of content, and it’s difficult to know where to find the “gold nuggets;” the things to learn that will bring you the high return on time investment.
And that’s the crux of the problem. The fact is, your time is limited. Learning a new programming language is a large investment in your time, so you need to be strategic about which one you select.
To be clear, some languages will yield a very high return on your investment (your investment of both time and money). Other languages are purely auxiliary tools that you might use only a few times per year.
Let me make this easy for you: learn R first.
Focus on one language
Before describing why you should learn R, I want to emphasize that you should learn one language as you start learning data science.
As I’ve published R tutorials here at Sharp Sight Labs, I’ve had several people ask me whether or not they should learn Python (at the same time). My answer to this is essentially “no.” Unless you have a direct need for more than one language, you should select one.
The reason to focus on one programming language is because you need to focus much more on process and technique, not syntax. You need to learn how to think about data and how to solve problems using the tools of data science. As it turns out, I think that R is the best programming language for doing this.
Learn R
Almost without reservation, I recommend that you learn R as your first “data science programming language.” While there are exceptions (e.g. if you have a specific project need), I think that R is the best choice when you’re getting started.
Here’s why:
R is becoming the “lingua franca” of data science
R is becoming the lingua franca for data science. That’s not to say that it’s the only language, or that it’s the best tool for every job. It is, however, the most widely used and it is rising in popularity.
As I’ve noted before, O’Reilly Media conducted a survey in 2014 to understand the tools that data scientists are currently using. They found that R is the most popular programming language (if you exclude SQL as a “proper” programing language).
Looking more broadly, there are other rankings that look at programming language popularity in general (not just among data scientists). For example, Redmonk measures programming language popularity by examining discussion (on Stack Overflow) and usage (on GitHub). In their latest rankings, R placed 13th, the highest of any statistical programming language. Redmonk also noted that R has been rising in popularity over time.
A similar ranking by TIOBE (which ranks programming languages by the number of search engine searches) indicates a strong year over year rise for R.
Keep in mind that the Redmonk and TIOBE rankings are for all programming languages. When you look at these, R is now ranking among the most popular and most commonly used over all.
Companies using R
R is in heavy use at several of the best companies who are hiring data scientists. Google and Facebook – who I consider to be two of the best companies to work for in our modern economy – both have data scientists using R.
(To get an idea of how a company like Facebook uses R, I would definitely check out Solomon Messing’s blog. Solomon is a data scientist at Facebook, and his blog posts demonstrating R are excellent.)
As Revolution Analytics recently noted, “R is also the tool of choice for data scientists at Microsoft, who apply machine learning to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments.”
Beyond tech giants like Google, Facebook, and Microsoft, R is widely in use at a wide range of companies including Bank of America, Ford, TechCrunch, Uber, and Trulia.
R is popular in academia
R isn’t just a tool for industry. It is also very popular among academic scientists and researchers, a fact attested to in a recent profile of the R programming language in the prestigious journal Nature.
R’s popularity in academia is important because that creates a pool of talent that feeds industry.
Said differently, if the best and brightest people are trained in R at university, then that will increase the importance of R in industry. The supply of academics, PhDs, and researchers who leave academia for business will create it’s own demand for people with R.
Moreover, as data science matures, data scientists in the business world will need to communicate more with academic scientists. We will need to borrow techniques and share ideas. As we instrument the planet and transform the world into data-flows, the lines between academic science and business-oriented data science will likely blur.
Learning the “skills of data science” is easiest in R
The popularity of R isn’t the only reason to learn R, however.
Ultimately, to really learn data science, you need to learn the “core” skill areas: data manipulation, data visualization, and machine learning.
In selecting a language, you need a language that has significant capabilities in each of these areas. You need tools for performing each of these tasks, as well as resources for learning them in the language you choose.
As I noted above, you need to focus much more on process and technique, not syntax.
You need to learn how to think about solving problems.
You need to learn how to find insight in data.
To do this, you’ll need to master the 3 core skill areas of data science: data manipulation, data visualization, and machine learning. Mastering these skill areas will be easier in R than almost any other language.
Data wrangling
It’s often said that 80% of the work in data science is data manipulation. More often than not, you’ll need to spend significant amounts of your time “wrangling” your data; putting it into the shape you want. R has some of the best data management tools you’ll find.
The dplyr package in R makes data manipulation easy. It is the tool I wish I had years ago. When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.
Data visualization
ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization.
I’ve said numerous times, that there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.
Moreover, when you combine ggplot2 and dplyr together (using the chaining methodology), finding insight in your data becomes almost effortless.
Machine learning
Finally, there’s machine learning. While I think most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools.
When you’re ready to start using (and learning) machine learning, R has some of the best tools and resources.
One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.
Learn more languages and tools later
To be clear, eventually you’ll want to learn more programming languages. Just like there’s no single best tool in a toolbox, there’s no single programming language that’s perfect for every data problem you want to solve. Having said that, after you master the core skills in data science in R, you’ll probably want to learn other languages to solve specific problems.
Here’s a quick review of tools you might consider after you learn R:
Python
Python is a great multi-purpose programming language that you should definitely consider at some point. To be clear, in O’Reilly’s recent survey, Python was the second most popular programming language among data scientists. It has excellent visualization tools, as well as tools for machine learning. For most people, I consider Python to be the second language to learn.
D3
I love D3. The visualizations created in D3 are beautiful, and the interactivity of D3 visualizations is perfect for building dashboards. My issue with it is that it doesn’t scale well. To me, D3 is much more of a “craftsman’s tool.” It’s great for building an elegant data visualization, but creating such things more-or-less “by hand” won’t scale under circumstances where you have to support dozens of partners with new analyses and ad-hoc requests.
I’m also optimistic that R’s ggvis will allow R users to create highly dynamic and interactive visualizations, so at some point, R users may be able to learn R’s ggvis instead of D3.
Summary: Learn R, and focus your efforts>
So to reiterate, choose one language. If you’re starting out, R is almost certainly the best choice. And, really focus on learning the skills of data science.
Additionally, once you start to learn R, don’t get “shiny new object” syndrome.
You’re likely to see demonstrations of new techniques and tools. Just look at some of the dazzling data visualizations that people are creating.
Seeing other people create great work (and finding out that they’re using a different tool) might lead you to try something else. Trust me on this: you need to focus. Don’t get “shiny new object” syndrome. You need to be able to devote a few months (or longer) to really diving into one tool.
And as I noted above, you really want to build up your competence in skills across the data science workflow. You need to have solid skills at least in data visualization and data manipulation. You need to be able to do some serious data exploration in R before you start moving on.
Spending 100 hours on R will yield vastly better returns than spending 10 hours on 10 different tools. In the end, your time ROI will be higher by concentrating your efforts. Don’t get distracted by the “latest, sexy new thing.”
(Image credit: Mandelbrot Creation Animation, generated using R- via Wikimedia Commons)