Author: Asheesh Choksi

Posted On Jul 08, 2013   |   7 Mins Read

Decision making is one of the most crucial steps in creating an efficient quality system. Healthcare systems for example are in constant process of making decisions, like:

    • Medical insurance companies estimate cost of insurance
    • Government bodies want to know the distribution of expenses. Based on the utilization and effectiveness, they can decide resources in various demographics.
    • Pharmaceutical companies want to decide the formulations to be manufactured.
    • Doctor deciding a treatment for patient.
  • Patient deciding health care provider (clinic/hospital – doctor)

In the absence of Technology, healthcare providers make their decisions based on experience or intuitions hoping for the best, thereby leaving high probabilities for errors. According to a report, an average of 195,000 people in the US die annually due to potentially preventable, in-hospital medical errors[1]. Another research calculated the cost of waste in health care spending at $1.2 trillion[2]. The numbers may vary but the biggest area of excess was identified as Defensive medicine (i.e. tests and procedures), followed by inefficient healthcare administration. Another major chunk of health care waste goes into the conditions like obesity which can be considered preventable by changes in lifestyle.

Role of Technology

Technology can play vital role in improving the quality of healthcare, its cost and time by delivering right information to right people at right time. Having recognized that the data science can help in improving the dynamics of health care[3], I can think of following goals:

  • Reduce the cost and improve the efficiency
  • Provide improved understanding for effectiveness of a clinical treatment
  • Reduce medical errors and improve patient safety
  • Identify areas of expertise and the people associated
  • Provide insight into origin of a disease
  • Early detection of problems
  • Detect frauds and malpractices

It is evident that the data analytics solution should be based on not only patient’s health records but also their social and demographic data. This however is easier said than done. Most often the electronic records created by clinics and hospitals are too localized and do not conform to some given standards. Such records are termed as EMR (Electronic Medical Record) and are usually not in the state of sharing because they are not structured. This issue becomes more complex with the legal constraints on information sharing like HIPAA (Health Insurance Portability and Accountability Act).

As the sharing of medical records became more of a necessity, the concept of EMR evolved into EHR (Electronic Health Records). The EHR is a superset of health records. It may include a whole range of data in comprehensive or summary form, including demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal stats like age and weight, and billing information[4].

Though EHR are sharable across facilities their format is still varying and hence there was a need to standardize the format of medical records. HL7 is one such standardization authority[5]. HL7 provides a framework for the exchange, integration, sharing, and retrieval of electronic health information. Though it is not an open standard i.e. information through this framework is accessible to its members only.

Besides there are some health care providers who maintain their records in legacy databases like MUMPS. These are often desktop type of systems and the process of retrieval and porting of records could be extremely slow.

Need for Technology – Big Data Framework

It can easily be seen that the health care problem space requires massive data sets to be processed to extract meaningful information in a specific context. The number of records of millions of people could run into tens of billions. Therefore, the computing infrastructure should provide a cost effective implementation of:

  • Storage of massive un-structured data set
  • Unconstrained parallelized data processing
  • Some degree of fault tolerance with high availability of the system

Big Data frameworks like Hadoop make a 100% response to meet the above challenges. Hadoop’s distributed file system (HDFS) and Map Reduce engine are capable of processing hundreds of terabytes of data. Further Hadoop’s ability to use cheap commodity hardware makes it a cost effective solution.

Business Use Cases of Data Analytics with Hadoop

We have seen above that many health providers are maintaining medical records in localized silos. Besides, large sets of data are now available on internet. Thus there may be many use cases for big data analytics to create cost effective systems with high accuracy and efficiency. However, we may want to focus on the followings:

Healthcare Intelligence

Business of Healthcare insurance works by aggregating the risk (and associated costs) and dividing it over the members of the risk group. The data and the results are constantly in process of updating. For example, we may be interested in finding age in certain demography where people below this age are not prone to certain disease. Such information is of great help in computing cost of insurance. So again we need to parse large data sets for extracting information about the disease, symptoms, medicines, medical advice and opinions and social/geographical demographics. Hadoop is a framework of choice in case of this use case.

Another important use case of analyzing and detecting patterns in large data sets is in the area of Fraud detection. With Hadoop’s ability to store massive data sets, it is possible to monitor constantly by storing and analyzing data for conformance to good practices. It is possible to detect abnormal patterns in medical treatment processes which otherwise may not get detected easily. Thus, Map Reduce programming in a Hadoop cluster can help us in deriving intelligence for creating apriori models of healthcare insurance predictive analysis.

Recommendation Engine

Effectiveness of certain medicines in some demographics with given symptomatic conditions needs to be indexed. Such an application can be of immense value to the ongoing focus on Evidence Based Medicines (EBM). The EBM is defined as retrieval of evidence by parsing a wide range of documents or records and applying strict criteria for validity of the search. This requires indexing of every document and record and a lot of machine learning to categorize them appropriately. Such system could be expected to provide searching of medical records, the treatments and effectiveness of treatments with high accuracy.

Hadoop Cluster is ideally suited to execute indexing and creating inverted index structures. Machine learning algorithms can be developed to systematically rank the indexed documents so as to return the evidences accurately in response to search queries.

Recruitment Analytics

Health providers want to identify right people for right job to handle the challenges. The process of identifying professionals in Healthcare domain is very slow. This is because the industry wants to identify people based on not only the knowledge and work interactions but also in their social experiences. This makes them effective and smart workforce.

These days many people are express themselves on social networking and other portals, for example via blogging. Now it is possible to create robust web analytics engine which crawls over several web sites and constantly collects people’s comprehensions and their sentiments. The machine learning algorithms can analyze these unstructured data to categorize and rate the concerned professionals. Such intelligence can enable healthcare providers to recruit smart workforce that leads in improvement of customer satisfaction and stronger relationships.


The decision making in healthcare can be improved substantially by using Big Data Technologies. Using Hadoop eco system, health care providers can now process massive data sets to see the evidences and arrive at correct decision faster than before. In this work we examined few use cases where health care provider will get the benefit of big data technology.

Next blog in this series we will explore several tools of Hadoop eco system to analyze a wide variety of data to bring out what is called the apriori model. These models could be used in predicting the trends and events. Thus the system can take preventive measures in advance. Such a system can bring major changes in improving the operational efficiency and cutting the costs.