afmsavage/Keepmon.md

## Keepmon.md

      
    Raw
  

              Keepmon.md
            
          
    Monitoring for ECDSA and Random Beacon

New Relic

Get an Email and a text message for any of these monitors failing.
Also, download the New Relic app on your phone so you can see everything on the go.
Synthetics


Simple Browser synthetic looking for eth_connectivity 1 on metrics page to prove the node is online and connected to an Ethereum endpoint
Simple Browser synthetic looking for my operator eth address on the diagnostics page

Infrastructure Agent

I have installed the New Relic infrastructure agent on both of my Linux nodes.  Plan is to do some log shipping via the agent eventually back to New Relic when I get some time

Node CPU above 90% for 5 minutes
Node Memory above 90% for 5 minutes
Node Disk Used above 80% for 20 minutes
Node Not Responding

Grafana

Have @mutedtommy's Grafana dashboard and monitoring setup in place.  Check his Medium Post https://medium.com/@hr12rtk/keep-random-beacon-node-monitoring-grafana-prometheus-and-loki-4a4b669b31ea about how to set this up.  He also recently published a script to automatically set this up for you to ease the pain points.  Make sure your firewall rules are correct!
Run commands

Random Beacon Run CMD

You can see the port mapping that I am doing to expose the Metrics and Diagnostics to the New Relic endpoints.  Some trickery to only allow certain endpoints to talk to my node via security groups too.
sudo docker run -dit \
--restart always \
--log-driver loki \
--log-opt loki-url="http://IP:3100/loki/api/v1/push" \
--volume $HOME/keep-client:/mnt \
--env KEEP_ETHEREUM_PASSWORD=$KEEP_CLIENT_ETHEREUM_PASSWORD \
--env LOG_LEVEL=info \
--name kc \
-p 3919:3919 \ # node port
-p 8081:8080 \ # metrics
-p 8083:8082 \ # diagnostics
keepnetwork/keep-client:v1.3.0 --config /mnt/config/config.toml start
config.toml example

[Metrics]
    Port = 8080
    NetworkMetricsTick = 60
    EthereumMetricsTick = 600

[Diagnostics]
    Port = 8082
Wallet Monitoring

I am using https://buidlhub.com/ to monitor my Operator address to ensure I have enough ETH in there to cover operating costs.  Alerts me when I have less than 1 ETH in there via email.
I also have my operator wallet setup in Etherscan to email me on transactions.  If your node is involved in any work, you will get an email that 0 wei has been sent from your wallet as it calls the smart contract functions.
Backups

I am currently taking a snapshot of my EC2 instance daily and keeping only the latest one to save money.  This is so I can easily spin it back up if something catastrophic happens.  Also, and make sure you're at least doing this, backup your ~/keep-ecdsa/persistence directory.  With a backup of this directory, you can recreate your node and be good to go.  I am using a cronjob on my machine to sync my persistence directory off to S3 storage.  This cronjob runs every 2 hours.
This is my cronjob that I setup by running crontab -e

0 */2 * * * /home/ubuntu/s3backup.sh
#!/bin/bash
aws s3 sync ~/keep-ecdsa/persistence/ s3://$BUCKETNAME --delete
echo "$(date) s3 backup job ran successfully" >> /home/ubuntu/persistence_s3_copy.log