Planning data science project

When planning for a data science project, you need to spend to seek clarification about the project. The important questions are:

  • What are the goals of the project?
  • What resources are required for this project?
    • data
    • software
    • personnel
  • Are there any important deadlines?
  • Who are the stakeholders?

Once you have sufficiently gained clarity on your project:

  • define project metrics
  • define analytical approach
  • organize and coordinate your resources
  • start the project.

Your project's metrics

Data science projects are goal-oriented. Project metrics measure success of the project against the specified goals. Project metrics must be defined before starting the data science project.

Metrics are measured using KPI. KPI (Key performance Indicators): should be simple, easy to understand and should highlight significant indicators. For example, Financial KPI could include cost, profit, and sales. KPI need to be SMART (Specific, Measurable, Assignable, Realistic, Time-bound). For example, cost is a very specific number. It is measurable and assignable. It is not a guesstimate but a realistic number. It is also time bound because the cost is associated to a time period.

If the work involves classification, classification accuracy measurements are important. These measures show sensitivity, specificity, positive predictive value, negative predictive value.

There is a saying in industry, "What gets measured get fixed". Metrics show what is going right or what is going wrong. It highlights the successes and failures and makes the information available to decision makers and people who can fix the problems.

Analytical Approach - concept

Selection on the analytical approach depends on the question being asked. The data scientist needs to ask stakeholders for clarification. To solve a problem, the analytic approach needs to be selected in the context of the business requirements.

What is the nature of the problem? Is it a probability of occurrence problem, discovery of relationships problem, or summary of counts and frequencies problem? Some examples:

  • What is the probability of someone moving to a larger residence after becoming a parent? Use a predictive model
  • What impact does stormy weather have on worker productivity? Use descriptive approach to find a possible correlation
  • Find pattern in a survey involving multiple choice questions? Use statistical analysis such as a classification approach
  • Identify patterns in genome sequence? Machine learning to explore data to identify patterns, relationships, and trends.