Launch device management solutions to support operations for 30k+ devices
Deliver E2E on-demand OTA experience on mobile/web according to prod roadmap
Longxiang He
Yang Song
Zhongjie Chen
Deliver software to enable scheduled batch push OTA update.
Longxiang He
Yang Song
Collect historic logs from edge devices.
80% bridges Felix Li
100% superboxes in CN Qing Wang
Collect real-time metrics from edge devices.
80% bridges Felix Li
100% superboxes in CN Qing Wang
Build real-time dashboard to monitor device healthiness and operation.
Design data collection and processing pipeline. Felix Li
Define metrics to meet business requirement and create dashboard. Tao Ren
Scale remote device access to support 100k devices.
Research, validate and choose deployment solution. Qing Wang
Deploy the solution to production in the US. Felix Li
Launch proof-of-concept to have per-device credentials management system
Zhongjie Chen
Consider Vault
Increase cloud operation efficiency
Convert 100% of AWS resource operations into software
Longxiang He
Milestone 1 - Dev
Milestone 2 - Test
Milestone 3 - Demo / POC / Canada
Milestone 4 - Prod
Redesign and deploy VPC/IAM/SecurityGroup to secure production/test/dev environment
Create a process to review production operations before commitment
Longxiang He
Deployment service to support real-time p2p communication for 30k+ bridges and 100+ robots.
Deploy ws2.0 for 100+ robots, with 99.9% SLA. Zhongjie Chen
Deploy ws2.0 for US with 99.9% SLA. Felix Li
Automate audit and security scanning process
Secure customer data and production resources.
Create the process to support devs to get resources.
Longxiang He
Implement a human/machine identity tracking system
Yang Song
Implement tracing around production resources operations.
Yang Song
Segregate production / test / dev to meet compliance requirement
Develop the capability to defend against DDoS attacks.
Reduce production incidents and reduce troubleshooting costs.
Centralize log for critical cloud services
all US Services Felix Li
Replicte the same launch of ELK in CN Qing Wang
Integrate real-time metrics for critical cloud services
Felix Li
Define the standards for cloud production service
Longxiang He
Make 50% of services meeting the standards
Longxiang He
Reduce on-call reaction time to 1hr
incident happened (node CPU > 99%) |——————| incident resolved (node healthy)
4-6hrs => 1hr
Develop reproducable dev environment
Felix Li
API documentation genration
Longxiang He
Longxiang He
Prepare for EKS migration plan in the US. Qing Wang
Migrate promethues to EKS (including deprecate current promethues)