A (very) basic text analysis with TensorFlow - Sun, Nov 13, 2022
A (very) basic text analysis with TensorFlow
A (very) basic text analysis with TensorFlow
In one of my recent posts
I developed a basic mechanism for collecting GitLab job logs/traces
of CI/CD pipelines with the aim to use this data as training data
for a machine learning model
.
In this post I describe the challenges I faced when doing the first (and admittedly naive) try of creating an appropriate model.
First try: Pimp the tutorial
Inspired by the slogan that
TensorFlow is aimed at everyone from hobbyist to professional developers, to researchers pushing the boundaries of artificial intelligence.
I took a look at this tutorial
which uses TensorFlow
(the most prominent ML platform driven by google) to categorize movie
reviews as being positive or negative. The idea was to simply change the code of that tutorial so that the model would learn from the trace training data to categorize traces as being part of a failed
or success
ful build.
At that point in time I did not really know about layers
, optimizers
and losses
.
Now the first quick and dirty pimped tutorial looked can be found here
. Compared to the tutorial I simply feed the categorized traces which I had previously exported to a mongodb
using a simple node-based data exporter
.
The test data folder structure looked like this:
├── train
│ ├── failed
│ │ ├── 10001.txt
│ │ ├── 10003.txt
│ │ ├── 10005.txt
│ │ └── ...
│ ├── success
│ │ ├── 12001.txt
│ │ ├── 12003.txt
│ │ ├── 12005.txt
│ │ └── 12005.txt
After I trained the model on that data I tried to predict the category with some obviously failing and succeeding builds. But the results even for those obvious cases where inconclusive.
It soon became obvious to me that the major problem was that the training data set was far too small and too little diverse for a model to be trained. And since AI models are only as good as their data
this naive approach was doomed to fail right from the start.
So I needed more data. But in order for me to gain some further knowledge and see how things work, I decided to simplify my goal.
Next: Make it simpler
Since I had not enough data, I create a training data generator that would create 25.000 simplified traces randomly containing only the term failed
or success
categorized accordingly. That should give me a very precise model. And indeed after training the model for 10
iterations things looked much better:
Epoch 8/10
625/625 [==============================] - 3s 4ms/step - loss: 0.1743 - binary_accuracy: 1.0000 - val_loss: 0.1291 - val_binary_accuracy: 1.0000
Epoch 9/10
625/625 [==============================] - 3s 4ms/step - loss: 0.1304 - binary_accuracy: 1.0000 - val_loss: 0.0978 - val_binary_accuracy: 1.0000
Epoch 10/10
625/625 [==============================] - 3s 4ms/step - loss: 0.0967 - binary_accuracy: 1.0000 - val_loss: 0.0715 - val_binary_accuracy: 1.0000
782/782 [==============================] - 1s 2ms/step - loss: 0.0743 - binary_accuracy: 1.0000
The key point here is that the loss
which indicates erroneous predictions was down to 0.0743
which in turn indicates that the model should be pretty descent at predicting categorize. And indeed the predictions were accurate for some make-shift traces:
samples_to_predict = np.array(
["success", "failed", "I Failed my vocabulary test", "Success is not an option"])
predictions = trace_model.predict(samples_to_predict, verbose=2)
test = (predictions > 0.5).astype('int32')
print(test)
1/1 - 0s - 91ms/epoch - 91ms/step
[[1]
[0]
[0]
[1]]
Above is the python code that starts the prediction followed by its output. The output can be interpreted as:
"success" = category 1
"failed" = category 0
"I Failed my vocabulary test" = category 0
"Success is not an option" = category 1
If you take at the complete source code
of this example you will learn that 1
is the category for success and 0
for failed
. So at last I got the prove that a model can be trained to reliably predict very basic outcomes of traces.
Conclusion
This first try at creating the model that can categorize traces as being failed
or a success
mainly failed at the lack of
data and my lack of knowledge of creating machine learning models. So I decided to start getting some more knowledge abut machine learning with TensorFlow as well as getting some more data.