star_icon
blog-img

Author: Lalit Kumar

Posted On Sep 03, 2015   |   3 min

Every minute – 72 hours of videos are uploaded on YouTube, Google gets 4 million search queries, Twitter gets 300,000 tweets and 50,000 apps are downloaded from the App store. These facts give an indication of how fast the digital data age is growing. Organizations are now looking to leverage the knowledge hidden in the digital data to enable them to make quick decisions with actionable insights.

This is where the discipline of data science comes in the picture. It involves a multidisciplinary approach coupled with analytics, data mining, machine learning and programming. Data scientists are trying to understand and extract knowledge from data (structured and unstructured) in the form of predictions, patterns, trends, etc. In terms of a tool or a programming language, R and Python are among the preferred and popular choices of a data scientist.
In this blog, we would look at the various aspects of Python that makes it a good fit for data science. First, let’s cover the general features and then talk about libraries available in Python for supporting data science.

  • Easy to Learn: Python is an easy to learn programming language with a very simple and clean syntax which can be easily picked up by someone with a little programming background. Python also comes with an excellent support for built-in data types which enables the programmers to do multiple tasks in less time with less amount of coding. Furthermore, data scientists can now spend less time on programming language and focus on the data science aspect.
  • Quick Prototyping: Data scientists often tend to face situations where they need a quick script or a program to do a one off thing. Some examples could be extracting data, formatting data or even doing a proof of concept (POC). Python with its interpreter, coupled with power of libraries and data structures comes very handy for above like situations. One can quickly implement new methods and even create prototypes to validate concepts and ideas.
  • Pluggable: Python is already a popular choice for developing web applications. For any analytics that needs to be done on top of these applications, Python presents a strong case. Using Python would keep the analytics and the application closely coupled with a common technology stack built on Python.
  • Freely available: Python is freely available to download and is also supported by a vast community base.

Once the business objectives are defined and the data is made available, the very first step a data scientist would typically do is clean the data. This would include dealing with missing values and removing useless or unwanted data. Once this is done, the next step includes data analysis and modeling followed by results in visual form. To cater these needs, Python has a wide support of libraries. Some of the libraries are discussed below:

  • Numpy: It is a Python package which has support for handling large multi-dimensional arrays and matrices. In addition, it also has support for handling mathematical functions such as Trigonometric, Hyperbolic, exponents, logarithms and much more.
  • Pandas: It is a software library written in Python for data analysis. Series and DataFrame are the two most important data structures available in Pandas. Series handles 1-Dimensional data and DataFrame is for 2-D data. Indexing data, reading and writing data from various sources, handling missing data, slicing, pivoting are some of the tasks that can be handled easily using Pandas.
  • Scikit-learn: This is a software library written in Python which is more focused towards machine learning. For activities like training and modeling of data, Scikit-learn offers phenominal support. For extracting, cleaning and manipulating of data it makes use of existing libraries such as Numpy and Pandas.
  • Matplotlib: Visualization of data by making informative statistical graphs is an important aspect when dealing with data. This is where Matplotlib helps engineers and scientists. Matplotlib, a software library written in Python has support for creating different types of 2D graphs such as line graph histograms, bar chart, pie chart, and scatter plot. It also provides very granular controls over graph properties like lines, font, axes and colors. For 3D graphs, it comes with a toolkit mplot3d.
  • Seaborn: It’s another library used for visualization of data built on top of Matplotlib.

The above stated points present a strong case for Python, but before you make a decision, perform end-to-end research on what you need. If the focus is on building a product or a module in a product around data analytics then Python would make the cut, but if the focus is towards a research problem then you may have to think twice. Considering the great strides made by Python in recent years, it won’t be surprising if we see Python right up there making a strong case in every domain.

References:
• https://www.youtube.com/yt/press/statistics.html
• http://www.internetlivestats.com/google-search-statistics/
• http://pandas.pydata.org/pandas-docs/stable/index.html
• http://www.numpy.org/
• http://matplotlib.org/
• https://pypi.python.org/pypi/seaborn/
• http://scikit-learn.org/