A machine learning project start with a lot of questions: what is the goal of the project? What application we are building? What results are we expecting? Which tasks we need to perform? What approach can we use? To response to these questions, we need to build a robust team. Each member of the team will play an important role and cooperate closely with each other.
A fully matured machine learning team consists of the following core roles:
- Data Analysts
- Data Engineers
- Data Scientist
- Research Scientists
- ML Engineers
- Developers
There are also other supporting roles:
- QA
- TA
- Annotators
- get insights from user data with:
- descriptive statistics: give information that describes the data - some examples include Customer demographics, Landing page conversion rates, loyalty and retention rates)
- inferential statistics : deduce the characteristics of users as a population, some examples: user’s trend, statistical hypothesis testing (e.g: A/B testing) Cooperate with the product-business team and create the product roadmap from these insights Define model evaluation procedure and acceptance criteria
- Analyse feedbacks data from deployed model
- Tools: Excel, SQL, Tableau, Power BI …
- Build and maintain the infrastructure used to collect, transform and store data, an example is the ETL process (Extract Transform Load)
- Develop annotator tools that helps collecting labeled data
- Manage and orchestrate the pipeline of how data is ingested and moved across different means of data storage As the number of data increases, they need to possess skills for distributed computing and storage (alias big data)
- Tools: Data storage, Message brokers, Pipeline management tools, Data warehouse …
- Analyse, process, interpreting data
- Find features/ insights from data with statistical methods (feature engineering)
- Communicate findings with business/ product team/ stakeholders
- Build Machine Learning models that serve as prototypes or deployed in production
- Tools: statistics, databases, Machine learning, machine learning frameworks …
This role involves develop new algorithms for product-related fields. This leads to breakthroughs and competitive edges to competitors.
- Build and maintain tools and infrastructure to deploy, serve, monitor, and update model
- Develop prediction interfaces (client side or cloud service endpoints)
- Handle scalability with containers and orchestration platforms like Kubernetes
- Integrate the machine learning product with the main application
- Abstracting the machine learning prediction with user friendly features
Finding staffs for these specific roles is sometime challenging and not cost-effective in small company. This require one or a few data scientist to handle several roles at the same time, they are Full Stack Data Scientist. They are thus required to have a wider range of knowledge and skills.
- Evaluate the model when it is deployed based on the acceptance criteria
- Perform regression tests to ensure the model match the real use-cases
- Work closely with the ML Engineers to define automatic testing scenarios
- Schedule and do performance tests with the model deployed on servers
They collect labeled data following data requirements with third-party tools or tools designed by the in-house data engineer. They can be:
- Qualified in-house annotators
- Contract annotators
- Outsourced annotation services
Data quality and quantity are varied depending on which group of annotators. In-house annotator’s data quality is best but can be not enough for the considered application. Outsourced label data can come at good amount but will need quality control.