Analyse data of a CI/CD pipelines - Fri, Oct 21, 2022
Analyse data of a CI/CD pipelines
Analyse data of a CI/CD pipelines
Data generated by CI/CD processes can be valuable in improving CD/CD. In a recent post I looked at how to easily collect data from CI/CD pipeline runs in GitLab by simply appending a special job to it. In this post I make a first attempt on analyzing the data from these pipelines. I recommend reading afore mentioned post for some background information.
The use case
The first basic use case is to “Find long running jobs”. To implement that we need a query for the data that returns the average and max execution time of jobs. By comparing the average against the max execution time we can easily find jobs that are running longer than expected. That insight could be used to spot potential problems with that job.
For this first example Is use a local mongodb
database in my local machine. All the steps could be put into a scheduled CI/CD job as well, so that analysis occurs regularly.
The basic steps for this use case to work are:
- Download the collected pipeline data
- Import the data into a mongodb database
- Run the query that selects the average execution time for jobs
- Run a query that selects the maximum execution time for jobs
- Compare the max execution time with the average execution time
Setting up the database
First of all we need a mongodb database. I opted for using the docker image which can be spun up like this:
docker run --name mongo -p 27017:27017 -d mongo:latest
Note that I expose the default port 27017
to access the database from the host.
Next we need to import the pipeline data into the database. Since this data is stored in GitLab’s internal package registry, it can easily be downloaded using GitLabs packages API
:
curl -o pipelines.json --header "PRIVATE-TOKEN: <PRIVATE_ACCESS_TOKEN" "https://gitlab.com/api/v4/projects/<PROJECT_ID>/packages/generic/ci-data/1.0.0/pipelines.json"
That file now needs to be imported into mongodb database. For that I use mongoimport :
mongoimport --db=gitlab --collection=pipelines --file=${PWD}/pipelines.json
After that the data should be accessible in the pipelines
collection.
Developing the query
The document format / tree for pipelines for which to develop the query basically looks like this:
+--------------+
| |
| - project_id |
| |
+------+-------+
| 1
| +----------------+
| | |
| | - name |
| jobs | |
+------------>+ - started_at |
n | |
| - finished_at |
| |
+----------------+
The root document is the pipeline
and contains all the attributes of the corresponding API object
. The jobs that constitute the pipeline are nested as jobs
and contain the dates started_at
and finished_at
that we need for calculating the execution time of the job. Note that the collection pipelines
could contain pipelines for different GitLab projects.
Therefore in order to retrieve the average time for a specific job for the projects we need to group
the documents on the fields project_id
and name
of the job. We can then use the avg
function to calculated the average execution time. Note that in order for that to work we need to unwind
the jobs. Lets look at the most important parts (The whole query can be found here
):
db.pipelines.aggregate([
{ $unwind: "$jobs" },
{ $group: {
_id: {
"project": "$project_id",
"jobName": "$jobs.name"
},
averageExecutionTime: {
$avg: {
$dateDiff: {
startDate: {
"$dateFromString": {
"dateString": "$jobs.started_at",
"format": "%Y-%m-%dT%H:%M:%S.%LZ"
}
},
endDate: {
"$dateFromString": {
"dateString": "$jobs.finished_at",
"format": "%Y-%m-%dT%H:%M:%S.%LZ"
}
},
unit: "second"
...
The date parsing and conversion which is necessary because the dates are stored as strings. The result looks something like this:
[
{
"_id": { "project": 38561833, "jobName": "build-job" },
"averageExecutionTime": "43"
}
]
Getting the max value is actually easy by just replacing the aggregate function from avg
to max
:
...
maxExecutionTime: {
$max: {
$dateDiff: {
startDate: {
"$dateFromString": {
...
[
{
"_id": { "project": 38561833, "jobName": "build-job" },
"maxExecutionTime": 72
}
]
We could now dig further into the data by displaying the job id
or pipeline id
in order to find out what the actual reasons were for the long execution times.
Note that this query is just a basic proof that this kind of data analysis works for detecting anomalies.
Conclusion
Building on my last posts on this topic I showed how to do some basic data mining on the collected data in a lightweight fashion. And although the example may seem trivial and lacks some final touches, it nevertheless showed how to do some basic data mining on the pipeline data.