Skip to content

Instantly share code, notes, and snippets.

@innovatism
Created November 30, 2021 20:49
Show Gist options
  • Save innovatism/12ba3bc074f15bb9e2a88626b45dfa10 to your computer and use it in GitHub Desktop.
Save innovatism/12ba3bc074f15bb9e2a88626b45dfa10 to your computer and use it in GitHub Desktop.

The Machine Learning (ML) Team

A machine learning project start with a lot of questions: what is the goal of the project? What application we are building? What results are we expecting? Which tasks we need to perform? What approach can we use? To response to these questions, we need to build a robust team. Each member of the team will play an important role and cooperate closely with each other.

Roles definition

A fully matured machine learning team consists of the following core roles:

  • Data Analysts
  • Data Engineers
  • Data Scientist
  • Research Scientists
  • ML Engineers
  • Developers

There are also other supporting roles:

  • QA
  • TA
  • Annotators

Data Analysts

  • get insights from user data with:
    • descriptive statistics: give information that describes the data - some examples include Customer demographics, Landing page conversion rates, loyalty and retention rates)
    • inferential statistics : deduce the characteristics of users as a population, some examples: user’s trend, statistical hypothesis testing (e.g: A/B testing) Cooperate with the product-business team and create the product roadmap from these insights Define model evaluation procedure and acceptance criteria
  • Analyse feedbacks data from deployed model
  • Tools: Excel, SQL, Tableau, Power BI …

Data Engineers

  • Build and maintain the infrastructure used to collect, transform and store data, an example is the ETL process (Extract Transform Load)
  • Develop annotator tools that helps collecting labeled data
  • Manage and orchestrate the pipeline of how data is ingested and moved across different means of data storage As the number of data increases, they need to possess skills for distributed computing and storage (alias big data)
  • Tools: Data storage, Message brokers, Pipeline management tools, Data warehouse …

Data Scientist 

  • Analyse, process, interpreting data
  • Find features/ insights from data with statistical methods (feature engineering)
  • Communicate findings with business/ product team/ stakeholders
  • Build Machine Learning models that serve as prototypes or deployed in production
  • Tools: statistics, databases, Machine learning, machine learning frameworks …

Research Scientist

This role involves develop new algorithms for product-related fields. This leads to breakthroughs and competitive edges to competitors.

Machine Learning Engineer

  • Build and maintain tools and infrastructure to deploy, serve, monitor, and update model
  • Develop prediction interfaces (client side or cloud service endpoints)
  • Handle scalability with containers and orchestration platforms like Kubernetes

Developers

  • Integrate the machine learning product with the main application
  • Abstracting the machine learning prediction with user friendly features

Full Stack Data Scientist

Finding staffs for these specific roles is sometime challenging and not cost-effective in small company. This require one or a few data scientist to handle several roles at the same time, they are Full Stack Data Scientist. They are thus required to have a wider range of knowledge and skills.

QA

  • Evaluate the model when it is deployed based on the acceptance criteria
  • Perform regression tests to ensure the model match the real use-cases

TA

  • Work closely with the ML Engineers to define automatic testing scenarios
  • Schedule and do performance tests with the model deployed on servers

Annotator:

They collect labeled data following data requirements with third-party tools or tools designed by the in-house data engineer. They can be:

  • Qualified in-house annotators
  • Contract annotators
  • Outsourced annotation services

Data quality and quantity are varied depending on which group of annotators. In-house annotator’s data quality is best but can be not enough for the considered application. Outsourced label data can come at good amount but will need quality control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment