Testing Prometheus alerts in a restricted environment

The problem

The Kubernetes clusters in my last project where hosted in a very restricted environment. That is: Only pull access to certain GitLab and Artifactory repositories were allowed and virtually no connection to the outside world.
The cluster was equipped with a Prometheus Operator including Alertmanager. That setup allowed us to deploy our Alerting rules and Alertmanager configuration as Custom Resources like this:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: null
  labels:
    prometheus: example
    role: alert-rules
  name: prometheus-example-rules
spec:
  groups:
  - name: ./example.rules
    rules:
    - alert: ExampleAlert
      expr: vector(1)
---
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: config-example
  labels:
    alertmanagerConfig: example
spec:
  route:
...
  receivers:
  - name: 'monitoring-system'
    webhookConfigs:
      - url: http://monitoring-system:8080
        sendResolved: true

As you can see we used Webhooks as the receiver of our alerts.
Since the environments were restricted, the question arose how to test whether our alerts are firing correctly in the dev and staging environment ?

The solution: A webhook receiver that logs to stdout

The solution that we found the easiest was to deploy a very simple python based http server that would receive the alerts and write them to stdout in json. The solution had the following benefits:

Deployment could be done with GitOps and Helm as the rest of the components
Alerts could be easily seen in the cluster by using kubectl logs
Since Vector was used to send container logs to ELK , we could do queries to find out whether alle expected alerts had been fired
By using Helm's condition fields we could easily enable / disable the receiver for certain environments
Using a Service and the appropriate Network Policy , we could use a DNS name in the AlertManager configuration.

Receiver implementation and deployment

The implementation of the receiver is actually quiet simple and was coded as part of my weekly Coding Katas . Here are the most interesting lines:

...

logging.basicConfig(
    level=logging.DEBUG,
    format='{"timestamp": "%(asctime)s", "alert": %(message)s}',
    handlers=[logging.StreamHandler(sys.stdout)]
)

...

def do_POST(self):
    self.send_response(200)
    self.end_headers()
    data = self.rfile.read(int(self.headers['Content-Length']))
    logger.debug(data.decode("utf-8"))

The logger configuration makes the alert payload (which is json) part of a log message. And in the implementation, the alert’s payload is just logged.
A basic test can be done using curl with an example alert:

curl -X POST -H "Content-Type: application/json" -d @tests/sample_request.json http://localhost:8080

After exposing the Webhook receiver using a service and adding a Network Policy that allowed Ingress traffic from Prometheus, we could adjust our AlertManager configuration to send alerts to a DNS name:

webhookConfigs:
  - url: http://test-webhook-receiver.myspace.svc.cluster.local:8080
    sendResolved: true

As already explained we could then check wether the alerts had been fired as part of the post conditions of the tests using either kubectl logs or querying the ELK stack (the later one could also be automated, so that the test could run fully automated).
Problem solved.

Spin-off: Using sub processes in unit tests

An interesting question that arose while defining the tests for the receiver was how to start the receiver as part of my unit tests fixtures ?
I opted to use the multiprocessing module in the setUp and tearDown hooks to start and stop the receiver:

...

def start_server(httpd):
    httpd.serve_forever()

...

@classmethod
def setUpClass(cls):
    httpd = HTTPServer(('', 8080), AlarmHandler)
    cls._process = Process(target=start_server, args=(httpd,))
    cls._process.start()

...

@classmethod
def tearDownClass(cls):
    cls._process.kill()

During setup I create a server using the AlarmHandler which contains the do_POST method and simply start the process. And after all tests have run I just kill the process.
Although it may not be the most sophisticated method it is quiet simple and easy to understand. In one of my next posts I would like to explore if that test cane be made even simpler. Another thing that might come handy with these kind of test setups would be to capture the logs in the test and do some asserts on their content.

Conclusion

Testing Prometheus alerting in a Kubernetes cluster using Webhooks might be tricky and can be simplified by simulating a Receiver which logs the alerts. The alerts can then be traced either by directly looking at the logs or querying the logs in the logging system (e.g. ELK).

Testing Prometheus alerts - Wed, Feb 16, 2022

The problem

The solution: A webhook receiver that logs to stdout

Receiver implementation and deployment

Spin-off: Using sub processes in unit tests

Conclusion

Back to Home