CONFIG.SYS

Collect data of a CI/CD pipeline run

In one of my recent posts I wrote about the importance of asserting or evaluating the results of CI/CD artifacts (traces, reports, etc.). In this post I think a bit further about recognizing the data generated by CI/CD processes as a valuable asset for gaining insight into the overall software delivery process. In addition the post contains the first steps in collecting this data.
Please note that I use most of the GitLab CD/CD vocabulary here since it is the main tool I currently use.

Why CI/CD data is important

The key idea here is that data generated by CI/CD processes is likewise valuable for improving the overall SDLC as production data for improving products and services. While tools like GitLab analytics already offer some insights on some aspects of a SDLC like value streams , the scope of this approach focuses more on a technical level. Let me illustrate that by giving some examples:

Detect anomalies in reports like unit tests. For example a sudden decrease in test cases
Find build jobs that run ever longer over time, which might indicate a problem with the build itself
Search for new warnings in job traces that might indicate a forthcoming problem or deprecation
Corelate results and traces from different jobs, pipelines or projects to find duplications or related problems

When thinking about all the traces and artifacts a lot more use cases come to mind. Obviously these days the application of ML comes to mind as these technologies are common in the analysis of data like the one described.

Starting with a small use case

For the start I want to start with a small use case from above:

Find build jobs that run ever longer over time

In this post I will show an example on how to collect pipeline data for the same project over multiple pipeline runs. In the next post I will take a look at how to analyse the data. Some non-functional requirements for implementation are:

It should be easy to use
It should only need the CI/CD tooling to run (GitLab in this case)

How it works

The mayor influence on the implementation is that it should be simple and easy to use, for this first use case. Especially not relying on external systems was important. Following is a description how it works:

At the end of a pipeline, a special stage and job is added which collects the jobs themselves as well as their traces from all jobs that constitute the pipeline
The collected data is stored in a mongo database which is a service of the collect job
After the job completes, the content of the database is exported and put into the package registry of the corresponding project.
On subsequent runs of the collect job the data from previous pipeline runs is loaded from the registry before the data of the current pipeline run is added. Thus a time line of pipeline runs is created

+-------+     +-------+     +-----------+
| job a +---->+ job b +---->+collect job|
+-------+     +-------+     +-----+-----+
                                  |
pipeline run                      | store data
                             +----v---+
                             |package |
                             |registry|
                             +--------+

Here are the key points achieved with that design:

Statelessness: Since the life time of the database is limited to the run of the collect job and the data is stored in the internal registry, the job is completely stateless.
No reliance on external services: As a result from its statelessness, no reliance or connectivity to an external service is necessary
Easy to add and remove: Data collection can be added by simply adding the job definition to the pipeline

Let’s take a look into the implementation details:

The collect job

The collect job is written in python and uses the python GitLab package to retrieve the data. The script first tries to access the project and pipeline it belongs to by using the CI/CD variables :

project_id = os.getenv("CI_PROJECT_ID")
pipeline_id = os.getenv("CI_PIPELINE_ID")

project = gitlab.projects.get(id=project_id)
pipeline = project.pipelines.get(id=pipeline_id)

In the following loop all jobs of the pipeline and their traces are collected:

for pipeline_job in pipeline.jobs.list(get_all=True):
    complete_job = project.jobs.get(pipeline_job.id)
    job_as_dict = complete_job.asdict()

    for artifact in complete_job.attributes["artifacts"]:
        if artifact["file_type"] == "trace":
            job_as_dict["trace"] = complete_job.trace().decode("utf-8")

    collected_jobs.append(job_as_dict)

The jobs and traces are then added to a pipeline document which is stored in the mongo database:

pipeline_as_dict = pipeline.asdict()
pipeline_as_dict["jobs"] = collected_jobs
mongo_db["pipelines"].insert_one(pipeline_as_dict)

At the end we have a new pipeline document in the collection pipelines with jobs and traces as nested documents. In later stages the script could be extended so that artifacts (like test reports) could be added to the pipeline collection as well.

The GitLab ci job definition

The job definition in the pipeline looks like this:

collect-ci-data-job:
  services:
    - name: mongo:latest
      alias: mongodb
  stage: analysis
  image: ghcr.io/toms-code-katas/gitlab-helpers/gitlab-trace-to-mongo:latest
  before_script:
    - 'curl --fail-with-body --silent --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" --output ${CI_PROJECT_DIR}/pipelines.json "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/${CI_PROJECT_NAME}-ci-data/1.0.0/pipelines.json"'
    - mongoimport --host="mongodb:27017" --db=gitlab --collection=pipelines --file=${CI_PROJECT_DIR}/pipelines.json
  script: |-
    echo "Starting pipeline analysis"
    python3 /app/gitlab_trace_to_mongo.py    
  after_script:
    - mongoexport --host="mongodb:27017" --db gitlab --collection pipelines --out ${CI_PROJECT_DIR}/pipelines.json
    - 'curl --fail-with-body --silent --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" --upload-file ${CI_PROJECT_DIR}/pipelines.json "${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/${CI_PROJECT_NAME}-ci-data/1.0.0/pipelines.json"'
  when: always

The image used is a small alpine based python image containing the script as well as the tools mongoexport and mongoimport . Note the service part which creates an empty mongodb at the start of the job. The before_script and after_script definitions contain the import / export of the database from / to the package registry. Since the script makes use of the GitLab CI/CD variables, no parameters are needed. An important aspect is, that the job runs always, regardless of any of the prior jobs failing. That way it is ensured that data of failing pipelines is captured as well.

Conclusion

In this post I looked at how data from CI/CD pipelines and processes can help improving the overall software delivery process. I also pointed out an easy and lightweight way of collecting the data for GitLab pipelines. In the next part I take a look at how to do some basic analysis of the data.

Collect data of a CI/CD pipeline run - Fri, Sep 23, 2022