Skip to content

Instantly share code, notes, and snippets.

@yangchenyun
Last active October 14, 2021 03:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yangchenyun/ba29659b629e147466b1e59a82763e32 to your computer and use it in GitHub Desktop.
Save yangchenyun/ba29659b629e147466b1e59a82763e32 to your computer and use it in GitHub Desktop.
Turing_OKR

OKRs

Launch device management solutions to support operations for 30k+ devices

Deliver E2E on-demand OTA experience on mobile/web according to prod roadmap

Longxiang He Yang Song Zhongjie Chen

Deliver software to enable scheduled batch push OTA update.

Longxiang He Yang Song

Collect historic logs from edge devices.

  • 80% bridges Felix Li
  • 100% superboxes in CN Qing Wang

Collect real-time metrics from edge devices.

  • 80% bridges Felix Li
  • 100% superboxes in CN Qing Wang

Build real-time dashboard to monitor device healthiness and operation.

  • Design data collection and processing pipeline. Felix Li
  • Define metrics to meet business requirement and create dashboard. Tao Ren

Scale remote device access to support 100k devices.

  • Research, validate and choose deployment solution. Qing Wang
  • Deploy the solution to production in the US. Felix Li

Launch proof-of-concept to have per-device credentials management system

Zhongjie Chen Consider Vault

Increase cloud operation efficiency

Convert 100% of AWS resource operations into software

Longxiang He

  • Milestone 1 - Dev
  • Milestone 2 - Test
  • Milestone 3 - Demo / POC / Canada
  • Milestone 4 - Prod

Redesign and deploy VPC/IAM/SecurityGroup to secure production/test/dev environment

Create a process to review production operations before commitment

Longxiang He

Deployment service to support real-time p2p communication for 30k+ bridges and 100+ robots.

  • Deploy ws2.0 for 100+ robots, with 99.9% SLA. Zhongjie Chen
  • Deploy ws2.0 for US with 99.9% SLA. Felix Li

Automate audit and security scanning process

Secure customer data and production resources.

Create the process to support devs to get resources.

Longxiang He

Implement a human/machine identity tracking system

Yang Song

Implement tracing around production resources operations.

Yang Song

Segregate production / test / dev to meet compliance requirement

Develop the capability to defend against DDoS attacks.

Reduce production incidents and reduce troubleshooting costs.

Centralize log for critical cloud services

  • all US Services Felix Li
  • Replicte the same launch of ELK in CN Qing Wang

Integrate real-time metrics for critical cloud services

Felix Li

Define the standards for cloud production service

Longxiang He

Make 50% of services meeting the standards

Longxiang He

Reduce on-call reaction time to 1hr

incident happened (node CPU > 99%) |——————| incident resolved (node healthy) 4-6hrs => 1hr

Long-term investment

Productivity / Cost

Develop reproducable dev environment

Felix Li

API documentation genration

Longxiang He

Look into CI/CD pipeline

Longxiang He

Audit cost and expenses

Prepare for EKS migration plan in the US. Qing Wang

Migrate promethues to EKS (including deprecate current promethues)

Migrate broadway to EKS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment