Skip to content

Instantly share code, notes, and snippets.

@cabecada
Created April 2, 2024 06:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cabecada/b5c23cfdd8c5c6711592502795390fa7 to your computer and use it in GitHub Desktop.
Save cabecada/b5c23cfdd8c5c6711592502795390fa7 to your computer and use it in GitHub Desktop.
Accomplished System Architect with over 13 years of hands-on experience managing large-scale databases, including those of 10s of 60TB scale. Proficient in designing and implementing robust monitoring and logging solutions to ensure optimal performance and reliability. Demonstrated expertise in building highly scalable architectures capable of handling substantial data volumes while maintaining seamless operations. Solid experience in facilitating ease of migration, upgrades, and incident management processes, ensuring minimal disruption and downtime. Adept at generating insightful reports to provide stakeholders with comprehensive visibility into system health and performance metrics. Proven track record of optimizing database environments for maximum efficiency and resilience in dynamic business landscapes.
Skills:
Systems Architecture (Generl architecture meeting and discussions, making POCs in testlabs)
Reliability Engineering (uptime, fault injection and tolerance, upgrades, disaster recovery)
Observability Tools (e.g., Graphite, Prometheus, Grafana)
Logging Tools (e.g ELK, pgbadger)
Scalability Planning (loose coupling, scatter gather in parallel to avoid cross node latency)
Automation (CI/CD pipelines)
Containerization (Docker, Mesos)
Configuration Management (Puppet, Rex, Ansible)
Databases (PostgreSQL, MongoDB)
Virtualisation (VMware vSphere)
FileSystems (ext4, zfs)
OperatingSystems (Ubuntu, Redhat, Gentoo)
Backup and Recovery Strategies (FileSystem Snapshots, logical backups using pgbackrest of 28TB dbs)
Troubleshooting and Debugging (perf, flamegraph, gdb)
Quick Learner (opensource support for postgresql, pocs, ask questions on slack channels)
PostgreSQL DBRE at Adjust
Key Responsibilities and Achievements:
Provided expert support for PostgreSQL databases ranging from 10TB to 60TB in size, ensuring high availability and optimal performance on bare metal servers.
Implemented robust failover strategies and managed seamless upgrades for PostgreSQL clusters, minimizing downtime and ensuring continuous service delivery.
Implemented Point-in-Time Recovery (PITR) and backup solutions using pgBackRest, ensuring data integrity and facilitating quick recovery in case of failures.
Proactively tracked and remediated PostgreSQL page corruption issues by leveraging in-depth knowledge of page geometry.
Optimized storage configurations on MDRAID (ext4) and ZPool (ZFS), tuning and monitoring for maximum performance and reliability.
Configured, tuned, and monitored pgBouncer for efficient connection pooling, both on the client and server sides.
Implemented table partitioning using pg_partman, enhancing query performance and manageability.
Designed and implemented sharding solutions using consistent hashing and range-based sharding techniques, leveraging foreign data wrappers (FDW).
Installed and configured various PostgreSQL extensions, including pg_repack, pg_partman, and parquet_fdw, to enhance functionality and performance.
Utilized parquet_fdw to query Parquet files using SQL, optimizing PostgreSQL servers for large OLAP queries in complex setups.
Investigated the use of DuckDB for querying parquet_fdw as part of ongoing performance optimization efforts.
Managed PostgreSQL replication setups, including streaming, cascading, and logical replication, ensuring data consistency and availability across distributed environments.
Collaborated Remotely with teams in Japan, Germany and India.
Infrastructure SRE at Opentable
Key Responsibilities and Achievements:
Orchestrated the deployment and management of containerized applications using Apache Mesos and Docker, ensuring efficient resource utilization and scalability.
Leveraged Singularity as the orchestrator to automate the scheduling and execution of containerized workloads, optimizing resource allocation and workload distribution.
Implemented service discovery mechanisms to facilitate communication between applications running in containerized environments, ensuring seamless interaction and scalability.
Engineered fault tolerance mechanisms to maintain application availability, automatically spawning additional instances on other servers in case of server failures.
Spearheaded canary deployments and feature flag implementations to test breaking changes and new features in production environments, minimizing risks and ensuring smooth rollouts.
Collaborated with development teams to annotate dashboards with relevant information about changes, enabling effective monitoring and analysis of performance impacts.
Led the management and optimization of PostgreSQL and Redis clusters across CI/UAT/Prod environments, ensuring high availability, scalability, and performance.
Architected and implemented Redis support for high availability (HA) and sharding, utilizing industry best practices and Redis Sentinel for failover automation.
Integrated Kafka into the ELK (Elasticsearch, Logstash, Kibana) architecture to handle high-volume log ingestion, providing real-time analytics and visualization capabilities.
Developed standardized logging templates for applications, ensuring consistency and compatibility across various services, and used Filebeat for log shipping to Logstash via Kafka.
Designed and implemented a standardized metrics template for developers to monitor system health and performance, leveraging tools like Graphite, Prometheus, and Grafana for visualization and analysis.
Established CI/UAT/Prod environments with uniform configurations and deployment processes, facilitating early detection of deployment issues and streamlining the release pipeline.
Collaborated with development teams to integrate logging and metrics templates into applications, enabling developers to create standardized dashboards for monitoring and troubleshooting.
Provided mentorship and guidance to junior team members on best practices for system architecture, deployment automation, and monitoring strategies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment