Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
sudo systemctl status -l -n 20 prometheus.service
[CA2 yzhong@ca2-p1v01-mon4-0001 ~]$ sudo systemctl status -l -n 20 prometheus.service
● prometheus.service - Prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2019-08-09 17:04:21 UTC; 4 days ago
Process: 16654 ExecReload=/bin/kill -s HUP $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 15929 (prometheus)
CGroup: /system.slice/prometheus.service
└─15929 /opt/prometheus/prometheus/prometheus --config.file /opt/prometheus/prometheus/prometheus.yml --storage.tsdb.path=/srv/prometheus/prometheus --storage.tsdb.retention=120d --web.listen-address=0.0.0.0:9090
Aug 13 21:19:42 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:42.797Z caller=manager.go:513 component="rule manager" group=HoustonDashboard msg="Evaluating rule failed" rule="alert: HoustonDashboardBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"houston-dashboard-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"houston-dashboard-production_[0-9]+\"}[1m]))))\n > 0.05\nfor: 10m\nlabels:\n severity: high\nannotations:\n description: The HAProxy backend `houston-dashboard` is responding with errors at\n a rate at or above 5.0%. An HAProxy backend is a set of HTTP servers tasked with\n answering requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"houston-dashboard-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: HoustonDashboard - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:42 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:42.809Z caller=manager.go:513 component="rule manager" group=Springfield msg="Evaluating rule failed" rule="alert: SpringfieldBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"springfield-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"springfield-production_[0-9]+\"}[1m]))))\n > 0.02\nfor: 5m\nlabels:\n severity: medium\nannotations:\n description: The HAProxy backend `springfield` is responding with errors at a rate\n at or above 2.0%. An HAProxy backend is a set of HTTP servers tasked with answering\n requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"springfield-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: Springfield - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:42 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:42.824Z caller=manager.go:513 component="rule manager" group=ParisProviderHotelbeds msg="Evaluating rule failed" rule="alert: ParisProviderHotelbedsBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"paris-provider-hotelbeds-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"paris-provider-hotelbeds-production_[0-9]+\"}[1m]))))\n > 0.02\nfor: 5m\nlabels:\n severity: medium\nannotations:\n description: The HAProxy backend `paris-provider-hotelbeds` is responding with errors\n at a rate at or above 2.0%. An HAProxy backend is a set of HTTP servers tasked\n with answering requests to a given service. A high error rate can mean that the\n corresponding servers are unresponsive, or that one or more is responding with\n a high rate of error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"paris-provider-hotelbeds-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: ParisProviderHotelbeds - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:42 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:42.355Z caller=manager.go:513 component="rule manager" group=CasablancaTrackingStore msg="Evaluating rule failed" rule="alert: CasablancaTrackingStoreBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"tracking-store-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"tracking-store-production_[0-9]+\"}[1m]))))\n > 0.05\nfor: 10m\nlabels:\n severity: high\nannotations:\n description: The HAProxy backend `tracking-store` is responding with errors at a\n rate at or above 5.0%. An HAProxy backend is a set of HTTP servers tasked with\n answering requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"tracking-store-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: CasablancaTrackingStore - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:43 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:42.455Z caller=manager.go:513 component="rule manager" group=BudapestDashboard msg="Evaluating rule failed" rule="alert: BudapestDashboardBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"budapest-dashboard-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"budapest-dashboard-production_[0-9]+\"}[1m]))))\n > 0.02\nfor: 5m\nlabels:\n severity: medium\nannotations:\n description: The HAProxy backend `budapest-dashboard` is responding with errors\n at a rate at or above 2.0%. An HAProxy backend is a set of HTTP servers tasked\n with answering requests to a given service. A high error rate can mean that the\n corresponding servers are unresponsive, or that one or more is responding with\n a high rate of error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"budapest-dashboard-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: BudapestDashboard - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:43 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:43.209Z caller=manager.go:513 component="rule manager" group=Calgary msg="Evaluating rule failed" rule="alert: CalgaryBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"calgary-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"calgary-production_[0-9]+\"}[1m]))))\n > 0.01\nfor: 5m\nlabels:\n severity: low\nannotations:\n description: The HAProxy backend `calgary` is responding with errors at a rate at\n or above 1.0%. An HAProxy backend is a set of HTTP servers tasked with answering\n requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"calgary-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: Calgary - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:43 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:43.341Z caller=manager.go:513 component="rule manager" group=Hayakawa msg="Evaluating rule failed" rule="alert: HayakawaBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"hayakawa-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"hayakawa-production_[0-9]+\"}[1m]))))\n > 0.05\nfor: 10m\nlabels:\n severity: high\nannotations:\n description: The HAProxy backend `hayakawa` is responding with errors at a rate\n at or above 5.0%. An HAProxy backend is a set of HTTP servers tasked with answering\n requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"hayakawa-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: Hayakawa - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:19:43 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:19:43.779Z caller=manager.go:513 component="rule manager" group=Atlanta msg="Evaluating rule failed" rule="alert: AtlantaBackendErrorRate\nexpr: ((sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"atlanta-production_[0-9]+\",code=\"5xx\"}[1m])))\n / (sum by(instance) (rate(haproxy_backend_http_responses_total{backend=~\"atlanta-production_[0-9]+\"}[1m]))))\n > 0.01\nfor: 5m\nlabels:\n severity: low\nannotations:\n description: The HAProxy backend `atlanta` is responding with errors at a rate at\n or above 1.0%. An HAProxy backend is a set of HTTP servers tasked with answering\n requests to a given service. A high error rate can mean that the corresponding\n servers are unresponsive, or that one or more is responding with a high rate of\n error. Consider reviewing the metric `rate(haproxy_server_response_errors_total{backend=~\"atlanta-production_[0-9]+\"}[1m])`;\n It shows a disaggregated view of the HTTP error rates for the relevant application\n servers.\n summary: Atlanta - HAProxy back-end error rate\n" err="query timed out in expression evaluation"
Aug 13 21:20:17 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=error ts=2019-08-13T21:20:17.093Z caller=notifier.go:528 component=notifier alertmanager=http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts count=1 msg="Error sending alert" err="Post http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts: context deadline exceeded"
Aug 13 21:20:21 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:20:20.874Z caller=manager.go:513 component="rule manager" group=Lexington msg="Evaluating rule failed" rule="alert: LexingtonShoppingFailureRate\nexpr: ((sum(increase(app_lexington_shopping_failure{env=\"production\",server=\"lexington\"}[30m])))\n / (sum(increase(app_lexington_shopping_success{env=\"production\",server=\"lexington\"}[30m]))))\n > 5\nfor: 2m\nlabels:\n severity: high\nannotations:\n description: |-\n There are several possible causes of this:\n\n Providers:\n\n If a deployment of provider integration coincides with a spike in errors, it may be worth rolling back that deployment.\n If several providers experienced a spike in errors at the same time, which coincided with a deployment of Lexington, there may be a helpers incompatibility between servers. It may be worth rolling back Lexington in that case.\n\n POS:\n\n If the POS requests seem to be failing for multiple providers, and none have recently been deployed, there may be an issue with relevant configurations in the Dashboard.\n If the POS requests appear clustered in a single system, there may be a problem with that service specifically. If it has not been deployed recently, it may indicate an issue with the third-party integration, and require escalation.\n summary: Ratio of failed over successful requests is more than 5% over 30m\n" err="query timed out in expression evaluation"
Aug 13 21:20:24 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:20:24.338Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:20:25 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:20:25.141Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:20:27 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:20:27.546Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:20:28 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=error ts=2019-08-13T21:20:28.380Z caller=notifier.go:528 component=notifier alertmanager=http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts count=2 msg="Error sending alert" err="Post http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts: context deadline exceeded"
Aug 13 21:20:39 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=error ts=2019-08-13T21:20:39.709Z caller=notifier.go:528 component=notifier alertmanager=http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts count=4 msg="Error sending alert" err="Post http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts: context deadline exceeded"
Aug 13 21:21:17 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:21:16.080Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:21:21 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:21:21.743Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:21:22 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=warn ts=2019-08-13T21:21:22.815Z caller=manager.go:554 component="rule manager" group=Stockholm msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Aug 13 21:21:28 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=error ts=2019-08-13T21:21:28.474Z caller=notifier.go:528 component=notifier alertmanager=http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts count=1 msg="Error sending alert" err="Post http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts: context deadline exceeded"
Aug 13 21:21:39 ca2-p1v01-mon4-0001.ca2.internal.zone prometheus[15929]: level=error ts=2019-08-13T21:21:39.491Z caller=notifier.go:528 component=notifier alertmanager=http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts count=1 msg="Error sending alert" err="Post http://ca2-p1v01-mon4-0002.ca2.internal.zone:9093/api/v1/alerts: context deadline exceeded"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.