Improve software development with machine learning

Use machine learning to get better at software development

Getting better at software development with ML sounds a little bit strange, so what exactly does that mean ? AI /ML are already widely used for analyzing data to make software better met the needs of customers. So what’s new here ? Well, the key idea here is to get better at the actual development and delivery of software with the help of ML, rather than to select the next feature based on customer clicks. In this blog I try to explain how that could work and outline the road map for a first small application of ML.

Invaluable data generated every day

I work for a big company with hundreds of teams and thousands of git repositories. The GitLab CI/CD pipelines of these repositories probably produce likewise tens of thousands of build logs, traces, reports and other artifacts. Automatically analyzing this data obviously can yield valuable information even to the point where failure of a pipeline run can be predicted because of the knowledge gained from a similar project in the past.
But even on a smaller scale, for example a complex project, automatic analysis can be beneficial for the same reasons. GitLab analytics does provide already some insights on an overall project performance but is rather centered about visualizing value streams (e.g. in which time was feature x deployed).

A first very basic use case

As i wrote in one of my recent posts my interest in this topic steams from a very simple yet unpleasant experience; A build job failing without that being recognized. Thus the pipeline continued. Therefore this first use case of ML based CI/CD data analysis has the goal of Determining if job failed solely based on it’s trace. A function like this could have detected the failure, or could have even predicted the failure.
The setup I use for this use case is as follows:

A python script that collects CICD pipelines, their jobs and traces
A Mongo database for storage of these pipeline documents
TensorFlow for machine learning

Some smaller scripts and glue code are also part of this proof of concept. Here are the steps of how the process look like:

1: Gather data

The fist step is to get the data from GitLab into the Mongo database. GitLabs pipeline API contains all the necessary end points for doing that. A python script will traverse a group structure identified by it’s root group and collect all project pipelines, jobs and traces and insert a pipeline document in the database.

2: Classify and export the data

Having the data in mongo yields already some value, since we could start querying the data for failed jobs, even over time. In order for TensorFlow to use the data it must be categorized and exported into the file system. A simple node based data exporter will export the data in a format appropriate for TensorFlow.

3: Setup and train the model

Having the data available in the correct format is the starting point to let TensorFlow learn about the traces. At the end we should have a model that could detect similar failed jobs from their trace.

Test the model against test data

And last but not least, the model needs to be tested. That is a new trace is evaluate and should be determined as failed.

Conclusion

Building on some recent posts I looked at how CI/CD data automatically gathered and analyzed with machine learning can reveal value information on improving an overall software development process. I outlined a proof of concept that i do as the next step. In my next post I will take a deeper look into gathering the data and storing it in a mongodb database.

Improve software development with machine learning - Sun, Sep 18, 2022