jchen42703/MONITORING_APP_DESIGN.md

## MONITORING_APP_DESIGN.md

      
    Raw
  

              MONITORING_APP_DESIGN.md
            
          
    Monitoring Infrastructure

Monitor thousands of endpoints seamlessly.
What does this package do?


Health check thousands of endpoints
Automatically send notifications for when any of the endpoints are failing.
Customizable! Lets the user define what a "failing" endpoint is.

Request timeout
Request response code
Request response body


Design

High-Level

We keep a hash map of the websites to use. For each website, we want to send HTTP requests to make sure that it's up every X seconds.
MVP

Monitor websites with a hash map configuration:
// Support HTTP only for now.
// Can support other protocols like GRPC if needed.
type SuccessConditions struct {
    responseCode int
    responseBody string // allow regex
    timeoutMs int // timeout in milliseconds
}

// Tracks failure incident into history.
type HealthHistoryEntry struct {
    url string
    timestampMs int // UTC timestamp in ms of incident
    responseCode int
    responseBody string
    responseTimeMs int
    err string
}

type WebsiteConfig struct {
    url string // includes http/https
    users []string // user uuid -> uses that to fetch the internal db to fetch the webhook to send notifications on
    successConditions *SuccessConditions
    healthHistory []*HealthHistoryEntry // should flush this once it exceeds the capacity.
}

websites = map[string]*WebsiteConfig
Setup Grafana/Prometheus for logging and monitoring.
Job Scheduling:

Need to make sure there is adequate memory.
Should retry inside of job
Should be able to clean up job after execution.
Need to ensure that there is sufficient connections.
Need to not get rate limited by discord/slack/twilio

Add to queue when rate limited


type NotificationJob struct {
    webhookUrl string
    entry *HealthHistoryEntry
    // Specify phone number / email down the line
}

// Queues up webhooks
type NotificationScheduler struct {
    q []*NotificationJob
}

// Coordinates request sending and webhook sending
// Maintains a job queue.
type HealthChecker struct {
    websites map[string]*WebsiteConfig
}
High-Level Summary:
X = 120 -> every 2 minutes

HealthChecker --> Schedules a job/task for each website in `HealthChecker.websites` as a goroutine
                  Each task sends a HTTP request to their assigned endpoint every X seconds.
                                    |                |
                                    |                |
                                    |                --> When a task encounters a failure, the task should
                                    |                     schedule a NotificationJob with NotificationScheduler
                                    --> When a task succeeds, the task should just continue to poll

NotificationScheduler
        |
        |
        --> Receives a job?
                |
                --> Rate limited?
                        |
                        -> If yes, add to queue while rating for rate limit to chill out.
                           Make sure to send a notification to devs because this should be rare.
                        |
                        -> If not, execute the webhook job (goroutine). Make sure to grap the rate limit counter
                        from the request.
        |
        |
        -> Make sure to set a capacity for the queue to not run out of memory
            -> Send notification to developers when capacity is exceeded.