CONFIG.SYS

Analyse data of a CI/CD pipelines

Data generated by CI/CD processes can be valuable in improving CD/CD. In a recent post I looked at how to easily collect data from CI/CD pipeline runs in GitLab by simply appending a special job to it. In this post I make a first attempt on analyzing the data from these pipelines. I recommend reading afore mentioned post for some background information.

The use case

The first basic use case is to “Find long running jobs”. To implement that we need a query for the data that returns the average and max execution time of jobs. By comparing the average against the max execution time we can easily find jobs that are running longer than expected. That insight could be used to spot potential problems with that job.
For this first example Is use a local mongodb database in my local machine. All the steps could be put into a scheduled CI/CD job as well, so that analysis occurs regularly. The basic steps for this use case to work are:

Download the collected pipeline data
Import the data into a mongodb database
Run the query that selects the average execution time for jobs
Run a query that selects the maximum execution time for jobs
Compare the max execution time with the average execution time

Setting up the database

First of all we need a mongodb database. I opted for using the docker image which can be spun up like this:

docker run --name mongo -p 27017:27017 -d mongo:latest

Note that I expose the default port 27017 to access the database from the host. Next we need to import the pipeline data into the database. Since this data is stored in GitLab’s internal package registry, it can easily be downloaded using GitLabs packages API :

curl -o pipelines.json --header "PRIVATE-TOKEN: <PRIVATE_ACCESS_TOKEN" "https://gitlab.com/api/v4/projects/<PROJECT_ID>/packages/generic/ci-data/1.0.0/pipelines.json"

That file now needs to be imported into mongodb database. For that I use mongoimport :

mongoimport --db=gitlab --collection=pipelines --file=${PWD}/pipelines.json

After that the data should be accessible in the pipelines collection.

Developing the query

The document format / tree for pipelines for which to develop the query basically looks like this:

+--------------+
|              |
| - project_id |
|              |
+------+-------+
       | 1
       |             +----------------+
       |             |                |
       |             | - name         |
       | jobs        |                |
       +------------>+ - started_at   |
                   n |                |
                     | - finished_at  |
                     |                |
                     +----------------+

The root document is the pipeline and contains all the attributes of the corresponding API object . The jobs that constitute the pipeline are nested as jobs and contain the dates started_at and finished_at that we need for calculating the execution time of the job. Note that the collection pipelines could contain pipelines for different GitLab projects.
Therefore in order to retrieve the average time for a specific job for the projects we need to group the documents on the fields project_id and name of the job. We can then use the avg function to calculated the average execution time. Note that in order for that to work we need to unwind the jobs. Lets look at the most important parts (The whole query can be found here ):

db.pipelines.aggregate([
  { $unwind: "$jobs" },
  { $group: {
      _id: {
        "project": "$project_id",
        "jobName": "$jobs.name"
      },
      averageExecutionTime: {
        $avg: {
          $dateDiff: {
            startDate: {
              "$dateFromString": {
                "dateString": "$jobs.started_at",
                "format": "%Y-%m-%dT%H:%M:%S.%LZ"
              }
            },
            endDate: {
              "$dateFromString": {
                "dateString": "$jobs.finished_at",
                "format": "%Y-%m-%dT%H:%M:%S.%LZ"
              }
            },
            unit: "second"
...

The date parsing and conversion which is necessary because the dates are stored as strings. The result looks something like this:

[
  {
    "_id": { "project": 38561833, "jobName": "build-job" },
    "averageExecutionTime": "43"
  }
]

Getting the max value is actually easy by just replacing the aggregate function from avg to max:

...
maxExecutionTime: {
  $max: {
    $dateDiff: {
      startDate: {
        "$dateFromString": {
...

[
  {
    "_id": { "project": 38561833, "jobName": "build-job" },
    "maxExecutionTime": 72
  }
]

We could now dig further into the data by displaying the job id or pipeline id in order to find out what the actual reasons were for the long execution times. Note that this query is just a basic proof that this kind of data analysis works for detecting anomalies.

Conclusion

Building on my last posts on this topic I showed how to do some basic data mining on the collected data in a lightweight fashion. And although the example may seem trivial and lacks some final touches, it nevertheless showed how to do some basic data mining on the pipeline data.

Analyse data of a CI/CD pipelines - Fri, Oct 21, 2022

Analyse data of a CI/CD pipelines

The use case

Setting up the database

Developing the query

Conclusion

Back to Home