Skills used in Data Science

Not all data scientist are alike. Some do general work while others focus on domains such as health, technology, or business. They can work in predictive analysis, big data, spatial data, or any combination of these. Based on their requirements they require different technical skills.


Following some technology trends that I have observed.

  • global skills: SQL
  • unstructured data: python is better than R
  • structured data: R or python
  • predictive analysis: R or python
  • big data: Hadoop, Stats, SPSS, PIG, Spark
  • spatial data: maptitude, mapinfo
  • machine learning: weka
  • data visualization: tableau, JavaScript D3, Qlik

Depending on what the data and analysis requires different mathematical tools are required. Regression is used frequently.

When you good assumptions, many tools are available to test those assumptions. When you don't have good assumptions and just need to explore the data, use machine learning.

Python, R, Java, Scala, and Clojure are the dominant technologies used in Data Science. R has been the dominant language but it is rapidly being overtaken by Python. If you can only have time to learn one language, learn Python.

Mathematics skills

Solid understanding of Mathematics is required in data science. You should be comfortable with the following:

  • Linear Algebra (Matrix Factorization)
  • Calculus
  • Graphy Theory
  • Statistics
    • Distributions (Bionomial, Poisson)
    • Summary statistics
    • Hypothesis testing
    • Bayesian Analysis

Artificial Intelligence

Many programmers assume that if you are doing data science you are doing SQL and machine learning. Machine learning is only one aspect of data science. Many questions do not require machine learning solutions. As a data scientist, you should be able to understand what the following machine learning algorithms are and how to apply them to your data.

  • Supervised Learning
    • SVM
    • Random Forest
  • Unsupervised Learning
    • k-means
    • LDA
    • NLP / Information Retrieval

People of think of machine learning as a magical solution. You will feed some data to a software and it will magically find everything there is to know from that data. The reality is very different. Machine learning algorithms help under very specific circumstances and can help accomplish very specific tasks. An NLP algorithm will not help you describe the relationship between sales and profits. You need to have a clear understanding of the data and the answers you are looking for before you run the algorithm. Any algorithm would return some kind of results regardless of whether the results provides any useful information. Suppose I run a predictive model on the works of Shakespeare and Charles Dickens, and the results come back showing that there is 100% chance that both writers will us a vowel in every word of every sentence. Not very useful information since we already know this and it does not answer any questions we might have about those writings.