Testing Prometheus alerts - Wed, Feb 16, 2022
Testing Prometheus alerts in a restricted environment
The problem
The Kubernetes clusters in my last project where hosted in a very restricted environment. That is: Only pull access to certain GitLab and Artifactory repositories were allowed and virtually no connection to the outside world.
The cluster was equipped with a Prometheus Operator
including Alertmanager. That setup allowed us to deploy our Alerting rules
and Alertmanager configuration
as Custom Resources
like this:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
prometheus: example
role: alert-rules
name: prometheus-example-rules
spec:
groups:
- name: ./example.rules
rules:
- alert: ExampleAlert
expr: vector(1)
---
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: config-example
labels:
alertmanagerConfig: example
spec:
route:
...
receivers:
- name: 'monitoring-system'
webhookConfigs:
- url: http://monitoring-system:8080
sendResolved: true
As you can see we used Webhooks
as the receiver of our alerts.
Since the environments were restricted, the question arose how to test whether our alerts are firing correctly in the dev
and staging
environment ?
The solution: A webhook receiver that logs to stdout
The solution that we found the easiest was to deploy a very simple python based http server that would receive the alerts and write them to stdout in json. The solution had the following benefits:
- Deployment could be done with GitOps and Helm as the rest of the components
- Alerts could be easily seen in the cluster by using kubectl logs
- Since Vector was used to send container logs to ELK , we could do queries to find out whether alle expected alerts had been fired
- By using Helm's condition fields we could easily enable / disable the receiver for certain environments
- Using a Service and the appropriate Network Policy , we could use a DNS name in the AlertManager configuration.
Receiver implementation and deployment
The implementation of the receiver is actually quiet simple and was coded as part of my weekly Coding Katas . Here are the most interesting lines:
...
logging.basicConfig(
level=logging.DEBUG,
format='{"timestamp": "%(asctime)s", "alert": %(message)s}',
handlers=[logging.StreamHandler(sys.stdout)]
)
...
def do_POST(self):
self.send_response(200)
self.end_headers()
data = self.rfile.read(int(self.headers['Content-Length']))
logger.debug(data.decode("utf-8"))
The logger configuration makes the alert payload (which is json) part of a log message. And in the implementation, the alert’s payload is just logged.
A basic test can be done using curl with an example alert:
curl -X POST -H "Content-Type: application/json" -d @tests/sample_request.json http://localhost:8080
After exposing the Webhook receiver using a service and adding a Network Policy that allowed Ingress traffic from Prometheus, we could adjust our AlertManager configuration to send alerts to a DNS name:
webhookConfigs:
- url: http://test-webhook-receiver.myspace.svc.cluster.local:8080
sendResolved: true
As already explained we could then check wether the alerts had been fired as part of the post conditions of the tests using either kubectl logs
or querying the ELK stack (the later one could also be automated, so that the test could run fully automated).
Problem solved.
Spin-off: Using sub processes in unit tests
An interesting question that arose while defining the tests for the receiver was how to start the receiver as part of my unit tests fixtures
?
I opted to use the multiprocessing module
in the setUp and tearDown hooks
to start and stop the receiver:
...
def start_server(httpd):
httpd.serve_forever()
...
@classmethod
def setUpClass(cls):
httpd = HTTPServer(('', 8080), AlarmHandler)
cls._process = Process(target=start_server, args=(httpd,))
cls._process.start()
...
@classmethod
def tearDownClass(cls):
cls._process.kill()
During setup I create a server using the AlarmHandler
which contains the do_POST
method and simply start the process. And after all tests have run I just kill the process.
Although it may not be the most sophisticated method it is quiet simple and easy to understand. In one of my next posts I would like to explore if that test cane be made even simpler. Another thing that might come handy with these kind of test setups would be to capture the logs in the test and do some asserts on their content.
Conclusion
Testing Prometheus alerting in a Kubernetes cluster using Webhooks might be tricky and can be simplified by simulating a Receiver which logs the alerts. The alerts can then be traced either by directly looking at the logs or querying the logs in the logging system (e.g. ELK).