Skip to content

Instantly share code, notes, and snippets.

@jchen42703
Created February 23, 2023 20:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jchen42703/cac51ab62e997d8a67abba296b43c097 to your computer and use it in GitHub Desktop.
Save jchen42703/cac51ab62e997d8a67abba296b43c097 to your computer and use it in GitHub Desktop.

Monitoring Infrastructure

Monitor thousands of endpoints seamlessly.

What does this package do?

  • Health check thousands of endpoints
  • Automatically send notifications for when any of the endpoints are failing.
  • Customizable! Lets the user define what a "failing" endpoint is.
    • Request timeout
    • Request response code
    • Request response body

Design

High-Level

We keep a hash map of the websites to use. For each website, we want to send HTTP requests to make sure that it's up every X seconds.

MVP

Monitor websites with a hash map configuration:

// Support HTTP only for now.
// Can support other protocols like GRPC if needed.
type SuccessConditions struct {
    responseCode int
    responseBody string // allow regex
    timeoutMs int // timeout in milliseconds
}

// Tracks failure incident into history.
type HealthHistoryEntry struct {
    url string
    timestampMs int // UTC timestamp in ms of incident
    responseCode int
    responseBody string
    responseTimeMs int
    err string
}

type WebsiteConfig struct {
    url string // includes http/https
    users []string // user uuid -> uses that to fetch the internal db to fetch the webhook to send notifications on
    successConditions *SuccessConditions
    healthHistory []*HealthHistoryEntry // should flush this once it exceeds the capacity.
}

websites = map[string]*WebsiteConfig

Setup Grafana/Prometheus for logging and monitoring.

Job Scheduling:

  • Need to make sure there is adequate memory.
  • Should retry inside of job
  • Should be able to clean up job after execution.
  • Need to ensure that there is sufficient connections.
  • Need to not get rate limited by discord/slack/twilio
    • Add to queue when rate limited
type NotificationJob struct {
    webhookUrl string
    entry *HealthHistoryEntry
    // Specify phone number / email down the line
}

// Queues up webhooks
type NotificationScheduler struct {
    q []*NotificationJob
}

// Coordinates request sending and webhook sending
// Maintains a job queue.
type HealthChecker struct {
    websites map[string]*WebsiteConfig
}

High-Level Summary:

X = 120 -> every 2 minutes

HealthChecker --> Schedules a job/task for each website in `HealthChecker.websites` as a goroutine
                  Each task sends a HTTP request to their assigned endpoint every X seconds.
                                    |                |
                                    |                |
                                    |                --> When a task encounters a failure, the task should
                                    |                     schedule a NotificationJob with NotificationScheduler
                                    --> When a task succeeds, the task should just continue to poll

NotificationScheduler
        |
        |
        --> Receives a job?
                |
                --> Rate limited?
                        |
                        -> If yes, add to queue while rating for rate limit to chill out.
                           Make sure to send a notification to devs because this should be rare.
                        |
                        -> If not, execute the webhook job (goroutine). Make sure to grap the rate limit counter
                        from the request.
        |
        |
        -> Make sure to set a capacity for the queue to not run out of memory
            -> Send notification to developers when capacity is exceeded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment