WEBVTT 00:00.000 --> 00:13.440 Okay, so my name is Lucas Tavarek. I'm a technical leader at Intel and today's talk is about AI-based failure 00:13.440 --> 00:23.560 aggregation. So in our CI pipelines we often see hundreds of test failures every day and most 00:23.560 --> 00:31.720 of them are very similar but not exactly the same. Because of this minor differences, our 00:31.720 --> 00:39.800 engineers had to expect time to manually analyze these failures, repetitively like over and 00:39.800 --> 00:47.160 over again. And in today I will show you how we reduce the overall noise in the CI system with 00:47.160 --> 00:55.400 text embeddings. Let's start with a quick and high level overview of the original workflow that we 00:55.400 --> 01:03.720 used. So we had on the left-hand side regular CI CD pipelines that like build components and 01:03.720 --> 01:12.200 run tests. Then we have a monitoring system that automatically gets the test results from the CI CD 01:12.200 --> 01:20.280 pipeline, performs some initial analysis and then reports the test failures to an engineer. 01:21.160 --> 01:27.800 The engineer have to make a decision whether to update existing bug report with a new 01:27.800 --> 01:35.720 instance of a given failure or treat it as a completely new bug, create a new bug report and if 01:35.800 --> 01:43.800 needed also start a regression isolation that we will try to find the faulty commit and 01:43.800 --> 01:50.760 reverberate. Now let's talk about the scale which is quite important in this solution. 01:51.880 --> 01:59.160 So on the x-axis you have a number of tests on y-axis the percentage of failures. So from the 01:59.160 --> 02:07.000 scale perspective if you have a low level level of failures then you are in a happy place you 02:07.000 --> 02:13.400 often don't need some complicated system or process to handle these failures. 02:14.360 --> 02:20.120 If you have a relatively low number of tests and a high percentage of failures this is still 02:20.120 --> 02:27.080 manageable with some semi automatic way you can handle all of the failures. However in our case 02:27.080 --> 02:39.800 we had like hundreds of thousands of test basis executed each day and hundreds of test failures 02:40.760 --> 02:47.080 that needed to be analyzed every day by engineers. So this did not scale and we needed to find something 02:47.080 --> 02:56.680 better. So this is a overview of the desired workflow that we try to implement as you can see 02:56.840 --> 03:03.960 most of the elements of the diagram are exactly the same with one major change. The engineer 03:03.960 --> 03:11.240 or the human in the loop was removed from the like critical path. So we tried to create a system that 03:11.240 --> 03:22.200 will automatically get the test results and by itself make a decision whether to update the bug or 03:22.200 --> 03:29.000 create a new one and start the regression isolation. The only interaction with the engineer was to 03:29.000 --> 03:38.760 create to send a some other report for verification. So in short our goal was to create a fully 03:38.760 --> 03:47.480 automated agent and the main problem that we faced was to how to determine if a failure is a new issue 03:47.480 --> 03:55.080 or an already reported one. So here are a few potential solutions that we considered. 03:55.800 --> 04:02.280 So the first one the most basic one is to simply have no aggregation at all so each test failure 04:02.280 --> 04:09.320 is a new bug report but we simply like move the issue from the test reporting side to a 04:09.320 --> 04:15.400 bug management side. So we have a lot of duplicates and also a risk of a waste of resources as the 04:16.360 --> 04:25.000 we unnecessary start the regression isolation process. Then we could go like once the further 04:25.960 --> 04:33.000 and try to aggregate the failures per test piece. So if we have exactly the same test piece 04:33.000 --> 04:39.400 that is failing over multiple bills then we simply can try to treat them as a one bug report. 04:40.120 --> 04:47.720 However, we still can have duplicates because the same bug can affect multiple test cases and we 04:47.720 --> 04:53.960 still have like noise in the system. Another issue of this approach is so called nested 04:53.960 --> 05:00.840 regressions. So we have exactly the same test case that on one build is failing due to some 05:00.840 --> 05:06.680 numerical errors and on another build started to fail due to like segmentation fault. With this 05:06.760 --> 05:12.920 knife approach we will not detect this and we will treat still these two issues as a one bug. 05:14.200 --> 05:22.360 Next we could also try to do a direct log comparison. So simply get that error messages from 05:22.360 --> 05:31.640 test failures, try to normalize them. So like remove time stamps, memory addresses or anything like 05:31.640 --> 05:37.720 that that changes from build to build but didn't change the error signature. However, this 05:39.080 --> 05:46.280 text operations are highly like complex and rule based so we will need to like maintain this 05:46.280 --> 05:55.960 over and over again. Another option is a new machine learning model. However, with this 05:56.040 --> 06:03.560 basic approach, dozens scale and it's really time consuming and error prone mostly due to the 06:04.520 --> 06:11.560 need to prepare a dataset. We can't go with like a few hundreds of examples. We need a few 06:11.560 --> 06:17.960 thousands and we need to label this data which is quite expensive. Of course there's another way. 06:18.840 --> 06:25.560 Now let me introduce three main concepts that are essential to our solution. 06:28.040 --> 06:35.000 First one is text embedding so what is even a text embedding. It's simply a numerical vector 06:35.400 --> 06:44.680 in a multidimensional space that represents a meaning of a sentence. So if we convert the 06:44.760 --> 06:52.840 like sentences into a vectors, then vectors that are close to each other should represent a 06:52.840 --> 06:59.480 similar meaning. So the like worth in a sentence can be different but the meaning will be treated as 06:59.480 --> 07:08.840 the same. Then we have a vector similarity search. So if we have these vectors, then we need a way 07:09.560 --> 07:19.240 to decide how similar they are. To do this, we use the standard way which is a cosin similarity 07:19.240 --> 07:25.720 metric which is simply a cosin of an angle between two vectors. So if cosin similarity is 07:25.720 --> 07:35.480 near one then we treat such vectors as highly related and if the cosin similarity is near zero 07:35.480 --> 07:48.040 then we treat them as unrelated. Finally the buy and coder architecture. Here we have an example 07:48.040 --> 07:56.440 of a training stage of such architecture and the model. So we have input sentence A that we pass to the 07:56.440 --> 08:02.840 and coder model. In most cases this is some like AI model like Berter and things like that. 08:03.400 --> 08:09.960 That we create an embedding and then we repeat this process with another sentence that we want to 08:10.760 --> 08:18.840 compare with. We go over the same encoder module, generate another embedding and then based on the 08:18.840 --> 08:28.040 cosin similarity we decide whether these embeddings are similar or not. With value of cosin similarity 08:28.120 --> 08:39.560 we can then compare them with the reference data in our like validation data set and update the 08:39.560 --> 08:46.920 whites of the encoder or in other terms send a positive feedback to the encoder if the embedding 08:46.920 --> 08:54.920 was good and like we get expected value because I similarity or send a negative feedback 08:55.000 --> 09:05.640 if the like embeddings are far off of the reference value. That's the like theory part. Now let's go to 09:05.640 --> 09:15.400 the implementation. So first we use the sentence transformers package that is available on the 09:15.400 --> 09:23.720 pie pie it is written in Python. As you can see with one line of coder we can load the already 09:23.720 --> 09:31.960 pre-trained model that is available online. Then we can create some example sentences that we will 09:31.960 --> 09:41.560 work on and finally we can simply call the model.ncode with input sentences and in such 09:41.560 --> 09:49.000 case we will get three embeddings because we had like three input sentences each embedding with 09:49.000 --> 09:56.600 384 dimensions which is related to the model that we used at the beginning. Then we have some 09:56.600 --> 10:02.760 utility functions from the sentence transformers package in that case we have like similarity 10:02.760 --> 10:08.680 that we can call with the embeddings and we will get the cosin similarity between these vectors 10:08.680 --> 10:25.400 in like tensor object. If you are not using the Python ecosystem then I really recommend 10:25.400 --> 10:35.480 the hugging faces text embedding inference project which has some pre-built Docker images 10:35.800 --> 10:42.200 with like CPU support with GPU supports and it can also be easily deployed on the Kubernetes 10:42.200 --> 10:52.920 clusters. In such case we can easily scale the solution so the server can serve given model 10:52.920 --> 11:01.800 and accept multiple requests at the same time. In this case we have web interface so we simply 11:01.800 --> 11:11.000 send a web request the server with an input sentence that we want to embed and as an output 11:11.000 --> 11:22.280 we will get the embedding vector. Then the question is which model should I use? So here we 11:22.280 --> 11:28.280 comes the M tab leaderboard that is available online. Here are some like examples from the 11:28.280 --> 11:34.280 levered their board and a few essential parameters that you need to take care of. 11:36.280 --> 11:44.200 So the first one is memory usage so as you can see the first model uses around like 44 gigabytes 11:44.200 --> 11:52.040 of memory so to run is you need a really powerful GPU or even multiple GPUs. On the other hand 11:52.040 --> 11:59.240 the last example is the one from the sentence transformers example and it uses only like 100 11:59.240 --> 12:07.160 megabytes of memory and you can easily run it on CPU. Then we have embedding dimensions 12:08.360 --> 12:16.840 so more dimensions interrupt the interior leads to a better accuracy. However the it's a cost 12:16.920 --> 12:26.760 of storage that is required to store these vectors. Finally we have max tokens parameter which represents 12:26.760 --> 12:36.840 the length of a text that we can embed in a single go when we call the model. So if we have like 12:36.840 --> 12:45.160 large, last text input but the model can't handle it then it will in most cases try and get it 12:45.160 --> 12:54.120 to the like number of tokens that it supports. We have text embeddings we have a way to 12:54.120 --> 13:02.520 compare them then we need a way to store them. In such case we use the PG vector extension to post 13:02.520 --> 13:11.080 the SQL. With this extension we have a new column type for vectors so of course we can insert the 13:11.080 --> 13:19.320 vectors to the database alongside any other standard data that is available in postgres and then we 13:19.320 --> 13:29.880 have few dedicated operators that allows us to query the database to retrieve vectors that are 13:30.120 --> 13:44.360 related to each other. Here we have like overall overview of the workflow that we prepared so this 13:44.360 --> 13:52.120 is we have a Python orchestrator that automatically gets the log from Jenkins that runs the tests 13:53.080 --> 14:03.560 then we have we are extracting failures from the error signatures from the logs from the 14:03.560 --> 14:10.360 Jenkins. We pass it to the hugging faces text embedding inference instance to get the embeddings. 14:12.040 --> 14:18.680 We've started embedding we are going to post the SQL to get the information whether we already 14:18.680 --> 14:26.520 have seen similar error in the database. If that's the case then we create automatically create 14:26.520 --> 14:36.280 black report and also automatically start the git bisect and git forever workflow to 14:36.280 --> 14:45.240 revire to the faulty comment from the system. Of course if we have an embedding that already exists 14:45.320 --> 14:56.280 in the database then we only update the back reporting the database. As always there are 14:56.280 --> 15:05.560 there is a room for improvements so here are a few recommendations from our side first is to analyze 15:05.560 --> 15:14.040 more logs so there are two cases the first one is simply the case when you have a really long error 15:14.120 --> 15:20.040 messages that you want to embed. As I said previously the default option is to 15:20.040 --> 15:27.320 truncate this log file and by default most of the services will truncate it from the beginning. 15:28.120 --> 15:34.440 From our experience in most cases the error messages and the like real error signature 15:35.160 --> 15:41.880 is available at the end of the log so the like basic option is to simply truncate from the 15:41.960 --> 15:51.000 bottom of the log file and then you will get like better output. Another option if you have resources 15:51.000 --> 15:57.960 is to simply use the bigger model with more max tokens parameter that will handle your example. 15:59.640 --> 16:08.440 Then there is an option of multiple log files so for example you have error messages from your 16:08.440 --> 16:15.320 testing framework and then you have some like log files from let's say web server that is tested 16:15.880 --> 16:21.720 and you would like to aggregate the failures based on this to like input streams. 16:22.440 --> 16:30.040 So the basic option is to simply get these two logs, merge them into one file and the embed 16:30.040 --> 16:41.720 everything as a one vector. Another option if the logs are too big then you can try to use an 16:41.720 --> 16:50.600 LLM or any other option that we will try to select more important log. So if for example you have 16:50.600 --> 16:58.280 some segmentations faults then we maybe we don't need to like analyze some basic log files and 16:58.280 --> 17:05.880 go straight to like some system core files or anything like that. Another option is to embed only 17:05.880 --> 17:13.960 error signatures so if you have like few thousands of lines in the log file and the error 17:13.960 --> 17:20.680 signatures are only like taking I don't know like 20 lines or anything like that then also with a 17:20.680 --> 17:27.720 help of LLM or any other such solution you could try to extract the error signature 17:28.280 --> 17:35.400 and then embed only this signature instead of a whole log file. Then we have fine tuning. 17:36.520 --> 17:44.680 So by default the models that are pre-trained are like pre-trained on like a lot of big data sets 17:44.680 --> 17:53.080 that are like generic and generalling in all. So we found out that there are some 17:53.800 --> 18:00.360 issues with domain specific patterns. So for example we had a test that compared to vectors 18:00.360 --> 18:09.480 and reported the percentage of incorrect values and for example if we had like in the same test case 18:09.480 --> 18:17.640 2% is 2% of difference and then we had another run on another build that resulted in 80% of 18:18.600 --> 18:28.040 difference by default it was treated as a the same error message and it was aggregated because 18:28.040 --> 18:35.160 like LLM didn't know that the such issues should be treated as a separate ones. In such 18:35.160 --> 18:42.120 case we can create a small data set that will like optimize the behavior of the encoder module 18:42.760 --> 18:52.680 that will force him to separate this domain specific patterns. Finally over the talk I 18:54.280 --> 19:02.520 presented you the failure aggregation part there are which focuses on linking the failures that 19:02.520 --> 19:09.080 look the same so we have the same error message. There is also an option for failure correlation 19:09.080 --> 19:16.280 and by that I mean to link failures that look different however have the same root cause. 19:16.280 --> 19:22.840 So if we have some unit integration tests that we know from the like statistical data 19:22.840 --> 19:29.800 that will in most cases fail together with some end-to-end test then we can automatically try to 19:29.800 --> 19:37.800 also correlate these failures and present it as one in a one view for further processing. 19:39.560 --> 19:48.280 So to sum up with text embedding for failure aggregation you can improve efficiency of your 19:48.280 --> 19:54.920 process by minimizing noise and I really encourage you to give it a try as the entry barrier 19:54.920 --> 20:02.200 is really really low all of the packages are open source there are free trade models available online 20:02.840 --> 20:10.680 and you can start running them even on a CPU. So the key takeaway is that text embeddings 20:10.680 --> 20:15.000 is a low effort way to turn CI noise into signal. Thank you. 20:22.280 --> 20:24.200 Now I'm open for questions. 20:32.600 --> 20:52.200 We didn't benchmark that but like the main issue is the scale so we've like hundreds of failures 20:52.200 --> 20:58.520 or like thousands of failures here we didn't want to like go like with this option 20:59.160 --> 21:03.720 having the the infrastructure that we already have at the end-to-end. 21:14.520 --> 21:19.000 Yeah so the question is about the false positives and negative so we measured the 21:19.640 --> 21:28.680 Rico as I remember so we wanted to have like no false positives. We wanted to have like multiple 21:28.680 --> 21:37.000 back reports instead of one that correlated or aggregated the wrong issue with itself and on 21:37.000 --> 21:46.600 hours like testing data set we had like a 90% so it was really really like good from the get we didn't 21:46.600 --> 21:48.600 had to like optimize it too much. 21:48.600 --> 22:18.520 So the question is about the stability of the 22:18.520 --> 22:28.520 tests and how they are aggregated so in such case we are not aggregating over the like single 22:28.520 --> 22:35.160 build run we are aggregating across the like multiple builds so if you have sporadic failures 22:35.160 --> 22:40.680 that are like occurring once a week or something like that this solution will automatically 22:40.680 --> 22:48.680 detect them yeah because we already have them in in a database. 22:58.680 --> 23:09.480 Yeah so the question is about the get-by-sext and whether we try to automatically detect 23:09.480 --> 23:17.480 whether we should update the test or fix the solution. So now we have haven't done it there are 23:17.480 --> 23:26.440 there were some work where we get the comments that could introduce the regression from multiple 23:26.440 --> 23:33.960 components and then try to use an LLM to ask him a question which from these comments is most 23:33.960 --> 23:40.920 highly related to the fail that we see in the system yeah so there were some work but it was not 23:40.920 --> 23:45.960 part of this solution.