I shied away from R for quite some time. My background is in C++, Java, and later C# with the major flavors of database engines thrown into the mix. I dabbled with R, but could never understand why it was so appealing. It simply wasn't intuitive in a classic programming sense. At least, this is the way I thought back then.
As I shifted away from pure programming to the field of data science, however, I realized it was time to commit to learning R. I took some free classes and started hunkering down. I have finally broken through the barrier and can call myself a qualified R programmer.
I find myself doing everything in R, from command-line calculations to scraping the web for data. Recently, I took on a project that analyzes baseball data (a package that can be installed directly from R). It's been fun going back in time, exploring the history of players and games. Giving yourself actual projects to work on is one of the best ways to learn any language, and R is no exception.
Why do I believe you will become hooked on the R language? It has two features that I find quite appealing and cannot imagine living without them. The first is element-wise processing and the second is enhanced subsetting. I'll get to why these features are great below.
Admittedly, R is going to require a mindset shift. If you are not a fan of interpreted languages, R could rub you the wrong way as this is how it operates. R doesn't have a particularly great user interface. There's some help from GUI tools such as RStudio. However, don't expect to develop frontend applications using R, at least, not in the traditional way of development.
I will describe the two features of R that I feel have the biggest potential to convert you, starting with element-wise processing. You are given an assignment to read a CSV file and summarize the data contained in the file. Your boss wants to know the average salaries across all countries, and the CSV file contains average salaries by country. How would you process this?
In a language such as VBA, for example, you would read in the file using some built-in file construct. Then, you would need to parse the file (usually line-by-line) and further parse each line into a structure of some form. Then, you either need to perform the required summarization (average) tasks on the data or store the structure for use later in the program. Let's suppose you are going to process it later. This helps you modularize the program, making it easier to read and maintain.
You have created that structure (an array, perhaps?) and it's time to process the array for summarization. You would need to loop through the array and obtain the data you need to summarize. For this example, you need to find the salary column and add to a total variable. Once you summed all of the values, you would divide by the number of items in the array.
I understand the above task wouldn't be difficult to do in most traditional languages. It's clear-cut what needs to be done. However, it still takes several lines of coding. I am going to show how this is done in R.
For this discussion, let's say I have a file named "population.csv" that contains the salary data of citizens by country. We want to take the average of those salaries. This example is good enough to illustrate the point. Here are the lines of code to process this in R:
salaries <- read.csv("population.csv")
averageSalaries <- mean(salaries$salary)
That is all it took to do satisfy the requirement. Of course, this is an oversimplification, and I would need to specify where the file is located just like any other programming language. In this case, it exists in my working directory. But, I could have just as easily added the path. That path could be a URL to a website that contains the CSV file.
You'll notice there is no looping in this code. This is the power of element-wise processing. It simply took the mean of all of the elements in the salaries structure (in this case, a data frame.) This is yet another benefit of R, by the way. The read.csv() function took care of parsing and creating a data frame (similar to a database object in memory, for lack of a better term.)
If element-wise processing is not enough to convince you, then take a look at another wonderful feature of the language: subsetting. For this example, I'll use the baseball package that you can install as part of your R instance. It's called Lahman, so named for the creator and maintainer of the package, Sean Lahman.
The package contains several tables. For our purposes, we'll only deal with a few. I could download the packages from seanlahman.com, but when you install the package and include it (using the library or require commands), it is available in your workspace without reading any files.
Suppose you wanted to know Babe Ruth's batting average the last year he played for Boston before switching to the Yankees (the year was 1919). This information is contained in the Batting data frame (I'll call it table going forward). You will need to know the key of Babe Ruth for this table. This can be found by looking up Babe Ruth in the Master table. You can easily accomplish this using the subsetting feature as follows:
babe_ruth <- Master[Master$nameLast == "Ruth" & Master$nameFirst == "Babe", ]
Again, that's all there is to it. In other languages, you would need to loop through the Master table, searching for the two elements that contained the last name and the first name.
The object babe_ruth contains the playerID which is what we need to find his batting average. However, we have a minor problem. The Batting table does not contain the batting average. It does contain the components that make up the calculation, namely hits (H) and at-bats (AB). You can define a new column in Batting that has this calculation (and apply it using element-wise processing - WooHoo!) Let's call the calculated field BA for batting average.
Batting$BA <- Batting$H / Batting$AB
We now have the batting average for all players in one fell swoop. Let's find what that is for the Babe:
Batting[Batting$playerID == babe_ruth$playerID & Batting$yearID == 1919, "BA"]
This will give you the answer of .3217594 or a 322 batting average.
As you can see, R can accomplish much in just three lines of code along with the inclusion of a library or package. If you want to count the library command that makes the Lahman database available, then it will be four lines of code. I'm okay with that!
I don't believe R is going to replace other languages. That is not what it was designed to do. It was designed by statisticians for statistics. It gives answers to problems quickly. Think of it as the Swat Team of computer programming languages. It does what it needs to do and then gets out!
This article doesn't even touch upon all that R can do. It gives you a basic idea of two powerful concepts. R is well-supported and growing in popularity. It does take a bit of a paradigm shift when learning the language. A few months back (at the time of this writing), I created a website called DataScienceReview.com. It's in its infancy, so there isn't much to it right now. But, you can bet it will contain plenty of material on how to work with R, including tutorials, samples, fun projects, etc. It will also have updates on the data science industry.