A chemist studies chemical properties of objects. A biologist studies living beings. A data scientist studies data. Data is real. It has real properties. Study of data leads to information and knowledge. Answering questions with data leads to revelation, understanding, and wisdom.
Data Science is the science of extracting knowledge from data and presenting the findings in an audience-friendly format. Governments and corporations are sitting on massive quantities of data. Hidden in this data is information that can revolutionize the way governments and corporations serve their clients and earn revenue. Hidden there is information that is of great scientific interest. For businesses, hidden in data is knowledge that can give a business the competitive advantage it needs.
Data science involves application or Mathematics (Linear Algebra, Calculus, Graph Theory, Statistics), expert technical skills, and the ability to ask insightful questions. A data scientist does not need to be an expert in everything but the more he/she knows the better. The most important characteristic of a data scientist is curiosity and ability to ask insightful questions. Without this ability, technical and math skills cannot be used resourcefully.
Why now
There is suddenly a high demand for data scientists. Why is there a new field called Data Science and why is it in such high demand? Following are some of the more important reasons:
- Storage is very cheap and it is getting cheaper
- Software for data science is improving and getting cheaper
- Almost all medium to large sized organizations have generated large collections of data and there is just massive amounts of data available online.
- There are many algorithms for data science and the number is growing
- The market is finally beginning to appreciate the value of data science to improve service and profitability
In terms of technology, the cloud has been the game changer. All of a sudden, everyone can access unlimited quantities of storage space and processing power at a very little cost. In addition, most providers such as Amazon provide numerous value-added tools to facilitate data science work. A cloud allows data scientists to easily bypass their local technical limitations.
Profile of a successful data scientist
A successful data scientist is curious and passionate about his data. Data scientists make assumptions about data, learn from the data and then modify their hypothesis over several iterations until they reach conclusion strongly supported by their data. They are comfortable with analytics platforms, programming, and mathematics. Once data scientists complete their study, they present their findings in easy to understand narratives.
Not all data scientist are alike. Some do general work while others focus on domains such as health, technology, or business. They can work in predictive analysis, big data, spatial data, or any combination of these. Based on their requirements they require different technical skills. Following some trends that I have observed:
- global skills: SQL
- unstructured data: python is better than R
- structured data: R or python
- predictive analysis: R or python
- big data: Hadoop, Stats, SPSS, PIG, Spark
- spatial data: maptitude, mapinfo
- data visualization: tableau, JavaScript D3, Qlik
Depending on what the data and analysis requires different mathematical tools are required. Regression is used frequently.
When you good assumptions, many tools are available to test those assumptions. When you don’t have good assumptions and just need to explore the data, use machine learning.
R and python are preferred skills for predictive analysis. python is better suited for working with unstructured data. Expert knowledge of SQL is required for most work. Hadoop, Cassandra, and Spark are often used for big data.
Career prospects
What are the career prospects for someone going into data science. There is a high demand for data scientists as more and more organizations are realizing the benefits of data science. Therefore, this is an urgent need for data scientists. Currently, the average salary of a data scientist is higher than programmers or mathematicians.
Skills used in Data Science
Not all data scientist are alike. Some do general work while others focus on domains such as health, technology, or business. They can work in predictive analysis, big data, spatial data, or any combination of these. Based on their requirements they require different technical skills.
Technologies
Following some technology trends that I have observed.
- global skills: SQL
- unstructured data: python is better than R
- structured data: R or python
- predictive analysis: R or python
- big data: Hadoop, Stats, SPSS, PIG, Spark
- spatial data: maptitude, mapinfo
- machine learning: weka
- data visualization: tableau, JavaScript D3, Qlik
Depending on what the data and analysis requires different mathematical tools are required. Regression is used frequently.
When you good assumptions, many tools are available to test those assumptions. When you don’t have good assumptions and just need to explore the data, use machine learning.
Python, R, Java, Scala, and Clojure are the dominant technologies used in Data Science. R has been the dominant language but it is rapidly being overtaken by Python. If you can only have time to learn one language, learn Python.
Mathematics skills
Solid understanding of Mathematics is required in data science. You should be comfortable with the following:
- Linear Algebra (Matrix Factorization)
- Calculus
- Graphy Theory
- Statistics
- Distributions (Bionomial, Poisson)
- Summary statistics
- Hypothesis testing
- Bayesian Analysis
Artificial Intelligence
Many programmers assume that if you are doing data science you are doing SQL and machine learning. Machine learning is only one aspect of data science. Many questions do not require machine learning solutions. As a data scientist, you should be able to understand what the following machine learning algorithms are and how to apply them to your data.
- Supervised Learning
- SVM
- Random Forest
- Unsupervised Learning
- k-means
- LDA
- NLP / Information Retrieval
People of think of machine learning as a magical solution. You will feed some data to a software and it will magically find everything there is to know from that data. The reality is very different. Machine learning algorithms help under very specific circumstances and can help accomplish very specific tasks. An NLP algorithm will not help you describe the relationship between sales and profits. You need to have a clear understanding of the data and the answers you are looking for before you run the algorithm. Any algorithm would return some kind of results regardless of whether the results provides any useful information. Suppose I run a predictive model on the works of Shakespeare and Charles Dickens, and the results come back showing that there is 100% chance that both writers will us a vowel in every word of every sentence. Not very useful information since we already know this and it does not answer any questions we might have about those writings.
Data science workflow
There are many methodologies used by data scientists. Regardless of the steps and terms used, most of them conform to the following:
- Plan your project
- define goals
- organize and coordinate resources
- start your project
- Prepare data for analysis (iterative process)
- acquire data
- clean data
- explore and refine data
- Model your problem (iterative process)
- create model
- validate model
- evaluate model
- refine model
- Wrap up
- present your findings
- revisit your model
- archive and document
Planning a Data Science Project
When planning for a data science project, you need to spend to seek clarification about the project. The important questions are:
- What are the goals of the project?
- What resources are required for this project?
- data
- software
- personnel
- Are there any important deadlines?
- Who are the stakeholders?
Once you have sufficiently gained clarity on your project:
- define project metrics
- define analytical approach
- organize and coordinate your resources
- start the project.
Your project’s metrics
Data science projects are goal-oriented. Project metrics measure success of the project against the specified goals. Project metrics must be defined before starting the data science project.
Metrics are measured using KPI. KPI (Key performance Indicators): should be simple, easy to understand and should highlight significant indicators. For example, Financial KPI could include cost, profit, and sales. KPI need to be SMART (Specific, Measurable, Assignable, Realistic, Time-bound). For example, cost is a very specific number. It is measurable and assignable. It is not a guesstimate but a realistic number. It is also time bound because the cost is associated to a time period.
If the work involves classification, classification accuracy measurements are important. These measures show sensitivity, specificity, positive predictive value, negative predictive value.
There is a saying in industry, “What gets measured get fixed”. Metrics show what is going right or what is going wrong. It highlights the successes and failures and makes the information available to decision makers and people who can fix the problems.
Analytical Approach – concept
Selection on the analytical approach depends on the question being asked. The data scientist needs to ask stakeholders for clarification. To solve a problem, the analytic approach needs to be selected in the context of the business requirements.
What is the nature of the problem? Is it a probability of occurrence problem, discovery of relationships problem, or summary of counts and frequencies problem? Some examples:
- What is the probability of someone moving to a larger residence after becoming a parent? Use a predictive model
- What impact does stormy weather have on worker productivity? Use descriptive approach to find a possible correlation
- Find pattern in a survey involving multiple choice questions? Use statistical analysis such as a classification approach
- Identify patterns in genome sequence? Machine learning to explore data to identify patterns, relationships, and trends.
Conclusion
The goal of this post was to help you understand what Data Science is and is not and also to show you a bird’s eye views of Data Science. You should be able to dive into different areas of this exciting new field while understanding where each skillset fits in the overall picture.