Monitor thousands of endpoints seamlessly.
- Health check thousands of endpoints
- Automatically send notifications for when any of the endpoints are failing.
- Customizable! Lets the user define what a "failing" endpoint is.
- Request timeout
- Request response code
- Request response body
We keep a hash map of the websites to use. For each website, we want to send HTTP requests to make sure that it's up every X seconds.
Monitor websites with a hash map configuration:
// Support HTTP only for now.
// Can support other protocols like GRPC if needed.
type SuccessConditions struct {
responseCode int
responseBody string // allow regex
timeoutMs int // timeout in milliseconds
}
// Tracks failure incident into history.
type HealthHistoryEntry struct {
url string
timestampMs int // UTC timestamp in ms of incident
responseCode int
responseBody string
responseTimeMs int
err string
}
type WebsiteConfig struct {
url string // includes http/https
users []string // user uuid -> uses that to fetch the internal db to fetch the webhook to send notifications on
successConditions *SuccessConditions
healthHistory []*HealthHistoryEntry // should flush this once it exceeds the capacity.
}
websites = map[string]*WebsiteConfig
Setup Grafana/Prometheus for logging and monitoring.
Job Scheduling:
- Need to make sure there is adequate memory.
- Should retry inside of job
- Should be able to clean up job after execution.
- Need to ensure that there is sufficient connections.
- Need to not get rate limited by discord/slack/twilio
- Add to queue when rate limited
type NotificationJob struct {
webhookUrl string
entry *HealthHistoryEntry
// Specify phone number / email down the line
}
// Queues up webhooks
type NotificationScheduler struct {
q []*NotificationJob
}
// Coordinates request sending and webhook sending
// Maintains a job queue.
type HealthChecker struct {
websites map[string]*WebsiteConfig
}
High-Level Summary:
X = 120 -> every 2 minutes
HealthChecker --> Schedules a job/task for each website in `HealthChecker.websites` as a goroutine
Each task sends a HTTP request to their assigned endpoint every X seconds.
| |
| |
| --> When a task encounters a failure, the task should
| schedule a NotificationJob with NotificationScheduler
--> When a task succeeds, the task should just continue to poll
NotificationScheduler
|
|
--> Receives a job?
|
--> Rate limited?
|
-> If yes, add to queue while rating for rate limit to chill out.
Make sure to send a notification to devs because this should be rare.
|
-> If not, execute the webhook job (goroutine). Make sure to grap the rate limit counter
from the request.
|
|
-> Make sure to set a capacity for the queue to not run out of memory
-> Send notification to developers when capacity is exceeded.