Data Science – a fantastic blend of advanced statistics, mathematics expertise, problem-solving, data inference, algorithm development, business acumen, and real-world programming ability – is all about capturing data in ingenious ways. And Machine learning is a key area of data science. ML is a set of algorithms that train on a data set to make predictions or take actions to segregate it.
There are plenty of tools which work on the basis of rubrics like data visualization or business intelligence and provide full service solutions. Full service tools only work great with well organized data but face hiccups otherwise. Therefore, as a Developer, if you want unique results for your experiments, writing your own code would be the best way of going about it.
Python vs R
While choosing the best programming language for data science, two of the most popular languages around, R and Python come to mind but choosing between them is always a dilemma for a data scientist.
Released in 1989, Python originated as an open source scripting language and has grown vigorously over time. It has object oriented programming built in. Today, it has sports libraries (numpy, scipy and matplotlib) and functions for almost any statistical operation / model building. It has become very strong in operations on structured data after the introduction of pandas as it is now very easy to work with data frames and time series data. With Anaconda from Continuum Analytics, the package management has become very easy to use. The notebook IDE of IPython / Jupyter is also very good.
Created in 1992, a few years after the release of Python, R had to follow in Python’s foot steps. This open source counterpart of SAS, has traditionally been used in academics and research and is a very cost effective option. Rcpp makes it very easy to extend R with C++. RStudio is a mature and excellent IDE. Latest updates for R gets released quickly because of its open source nature.
In a study conducted by kdnuggets among data scientists, it was concluded that python holds the majority of support with R holding 44% and SAS only holding 3% amongst this group.
The difference between Python and R is largely philosophical. One is a full-service language developed by Unix scriptwriter and the other is a tool for data analysis designed and built by stat heads, big data junkies, and social scientists. Let’s discuss the criteria to determine the right language for data science and machine learning.
1. Ease of Learning
Python is renowned for simplicity in the programming world and thus is a first choice for data analysts. As there are no widespread GUI interfaces, Python notebooks are more mainstream aligned. This language also has great features for documentation and sharing.
On the other hand, R is quite challenging to learn and apply. It requires the developer to learn and understand coding. It is a low level programming language and thus even simple procedures require long codes.
2. Python is Evolving / R is Staying Pure
Python is continuously evolving and getting better. The latest Python is 3.6.3 which contains lots of new features and optimizations like Preserving Keyword Argument Order, Simpler customization of class creation, and Local Time Disambiguation.
It would be wrong to declare that R isn’t changing but even the recent variant of S, which is capable of making large code bases cleaner, can be easily run in an R interpreter.
3. Data Handling / Graphical Capabilities
Though both the programming languages have good data handling capabilities and options for parallel computations, R has more highly advanced graphical capabilities than Python. With Seaborn in Python, making custom plots is much easier.
4. Collecting Data
While both languages have weaknesses when it comes to data processing, evolutions over the past few years have significantly alleviated these problems. Packages & developments such as Feather and Readr have reduced resource footprint. We cover a few of these updates below.
- Feather (Fast reading and writing of data to disk)
Python is fast, lightweight, easy-to-use binary format for file types. It also makes pushing data frames in and out of memory as simply as possible. Compared to R, Python has high read and write performance, i.e. 600 MB/s vs 70 MB/s of CSVs. It also helps is passing data from one language to another. R has almost similar features to Python.
- Ibis (Pythonic way of accessing datasets)
Ibis bridges the gap between local environments and remote storages like Hadoop or SQL. It also integrates with the rest of the Python ecosystem.
- Readr (Re-implements read.csv into something better)
read.csv does not perform well and it is slow as it takes strings into factors.
- Haven (Interacts with SAS, Stata, SPSS data)
Capable of reading SAS and bringing it into a dataframe.
- JsonLite (Handles JSON data)
Intelligently turns JSON into matrices or data frames
In past few years, Python has definitely overtaken R when it comes to programming and application for Analytics, Data Science, and Machine Learning. Most of the common tasks which were easily executable in one program or the other are now doable in both. They are similar enough and thus if you know one of them then to pick up the other one won’t be hard for you.
Once you master both the languages, you ultimately master data science. Make the best of both worlds as many data scientists are already doing. Use Python for the first stage of data aggregation and then feed the data into R, which applies the well-tested, optimized statistical analysis routines built into the language. This way you use R as a library for Python or Python as preprocessing library for R. Build a layer cake. Python as the cake and R as a layer or vice versa. Is Python the frosting and R the cake? Or is it the other way around? You decide.
Related Topic for Python