matthiasr/event_reporting.md

## event_reporting.md

      
    Raw
  

              event_reporting.md
            
          
    That's a whole system design interview question right there 😉
Fundamentally, if you need that level of detail and fidelity, you compromise on timeliness. You (somehow) record an event for every message sent, then tally those up on a daily or monthly basis.
It depends on the system and the details of your requirements. I would start by looking at how messages are being sent to begin with. For example, if there's a job in a database, you might be able to count the jobs marked as completed right there.
If the message system is fed by a Kafka topic, and you can get away with billing for messages attempted to send, you can count those from the source topic.
You can also explicitly record events, say into another Kafka topic or a database table. You will need to make some choices about CAP un that case: if recording the event fails, do you want to send anyway? What if you're not sure recording it worked?
I believe you would get the strongest guarantees to bill every message sent if you

take the message to be sent from the queue. it must have some unique ID; Kafka offset would do for that but something generated at the source is better
send the billing event and wait for acknowledgment
send the message
acknowledge message

If this fails at any point, rewind and retry. When generating the billing report, deduplicate on message ID. If the reporting topic is down you won't be sending messages because you couldn't bill for them.
You would get the strongest guarantees to send every message billed, and to continue sending messages even if you can't record,

take the message from the queue
send it
if this fails, retry
acknowledge the message
send the billing event
if this fails, move on, customer gets a freebie

You still probably want to deduplicate just in case there's a retry.
Writing an acknowledged report for every single message sent is going to be relatively slow; you can do this in batches but you need to either accept sending messages multiple times (mitigation: deduplicate on message ID again further down the chain), or not billing every message after all (mitigation: eat the loss).
Architecting for 100% is really hard, find out what the real tolerance is. The business may not care for 0.1% of revenue if it helps avoid spamming customers with retries.
The problem with using Prometheus is that it has made a bunch of choices for you, so you don't get to even think about them in the context of your needs.
Prometheus is designed to let you see quickly and reliably whether messages are being sent, and give you a close approximation of the rate. It compromises on the guarantee that no message goes uncounted.
How often miscounting happens depends on a lot of factors to.
For example, whenever the service that exposes the metrics restarts, there's a time window when increments are never seen by Prometheus (between the last scrape and the instance shutting down, and the new instance starting up and the first scrape).
Prometheus tries to approximately account for this but it fundamentally cannot know. This gets worse the more irregular a counter is; a per-customer counter is likely to be problematic in this way and in cardinality.