- Python 3.7
- Pipenv (or plain pip)
- Docker
Create a csv file located in the same place of the gist, the csv should follow a format like this one.
key_location,host,user
~/.keys/jresendiz27-aws_exercises.pem,ec2-18-234-125-130.compute-1.amazonaws.com,ubuntu
~/.keys/jresendiz27-aws_exercises.pem,ec2-100-26-188-98.compute-1.amazonaws.com,ubuntu
~/.keys/jresendiz27-aws_exercises.pem,ec2-54-175-170-91.compute-1.amazonaws.com,ubuntu
Configure the prometheus push gateway to sync information with prometheus.
docker run -p 9091:9091 prom/pushgateway
Configure the static_config parsers of prometheus and run it
global:
scrape_interval: 1m
scrape_timeout: 10s
scrape_configs:
- job_name: 'JOB_NAME'
scrape_interval: 1m
honor_labels: true
static_configs:
- targets: ['PUSH_GATEWAY_CONTAINER_IP:PORT']
i.e: Using docker:
docker run -p 9090:9090 -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Install the next requirements for python 3.7 from the Pipfile
[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true
[dev-packages]
[packages]
requests="==2.23.0"
prometheus_client="==0.7.1"
aiohttp="==3.6.2"
[requires]
python_version = "3.7"
Or via pip:
pip install requests=2.23.0
pip install prometheus_client=0.7.1
pip install aiohttp=3.6.2
Run the python script:
python collect_metrics_from_several_servers.py
You can now see the metrics in prometheus
A naive way of checking if the script is working as expected, is to try to retrieve information as possible and check it with another agent or using top to validate the provided information from the script.
Considering the current script, there are many improvements we can do. First of all, avoid using ssh connections to extract the information per process; a better way could be creating a custom agent (or using an existing one, like prometheus exporters) and install it on each instance as a daemon so the configured client just sends the required information instead of opening/closing an ssh connection each time we would like to retrieve the information.
Some aditional metrics I'd add would be the getting the ssh connections (in case the server is able to perform them), the disk usage, pending security patches and extra information about the cloud provider (instance id, security groups, etc)
I'd also consider adding some alerts for memory usage, cpu usage, disk.
To handle this case, I would add a memory monitoring alert in each instance to trigger based on a treshold, this will give the first approach of knowing how the memory is being managed (also including a monitoring agent). The quickest way of handling this, in case the involved services is stateless and is able to handle service recovery (i.e: via queues, circuit breaker, load balancer, etc) is restarting the service based on the configured alert.
Also configure the swap memory (this will decrease the performance).
There are many tools based on the service language to analyze the process memory. This can be useful too.