WEBVTT 00:00.000 --> 00:11.360 I'll summarize what I've said, you're said. 00:11.360 --> 00:21.000 Our goal is to have a always-on verbus tracing system in such a way that your performance 00:21.000 --> 00:23.880 won't hurt too much, right? 00:23.880 --> 00:25.360 Zero is impossible, contrary? 00:25.360 --> 00:26.360 I think. 00:26.360 --> 00:27.360 Yeah. 00:27.360 --> 00:34.240 But that's the goal, because sometimes you get those issues that happen somewhere sometimes 00:34.240 --> 00:38.480 in a narrow and impossible to deal with, right? 00:38.480 --> 00:39.840 So that's the motivation. 00:39.840 --> 00:45.680 So it's technically, it's somewhat similar to F-trace, right? 00:45.680 --> 00:52.320 I hope you're familiar, it's tracing, but it's used in, usually, in development or in 00:52.320 --> 00:57.280 the debug, when you control the system, and it's a single term, right? 00:57.280 --> 01:02.960 So here we're talking about by repression, per your module, and where you control the, what is, 01:02.960 --> 01:05.960 you know, what, and how, etc. 01:05.960 --> 01:13.720 So the problem, yeah, as I just mentioned, fully formatting the loading is of a slow, even if 01:13.720 --> 01:20.080 you don't dump it into demensage, it will still hit over from it, right? 01:20.080 --> 01:28.200 So this is going to be a technical detail in what we did, and how we did it, yeah. 01:28.200 --> 01:33.760 So the problem, the problem is memory, memory, memory, that's the problem, right? 01:33.760 --> 01:41.400 So if you want to have good performance, you have to be hitting a close to your memory, 01:41.400 --> 01:46.480 and doing, doing as little as possible. 01:46.480 --> 01:58.520 So that means that the idea of increasing, meant that while, well, we may be reaching out 01:58.520 --> 02:04.080 of your, except the working space, right, because you're loading somewhere else, right? 02:04.080 --> 02:08.320 You have your memory that you're working memory, and you have your loading memory, going 02:08.320 --> 02:12.240 outside of your working memory, means you'll get your cache misses, and you will hurt 02:12.240 --> 02:13.240 performance. 02:13.240 --> 02:17.200 That's the main problem of your, of any logging or any files. 02:17.200 --> 02:23.320 So try to minimize the size of the trace, right? 02:23.320 --> 02:27.720 And the number of times you're accessing a remote memory, remote, and it's outside of 02:27.720 --> 02:29.080 a one two, right? 02:29.080 --> 02:33.320 So it's not really remote, but still something we try to minimize your performance 02:33.320 --> 02:34.320 hit. 02:34.320 --> 02:36.960 The goal is to make that as small as possible, right? 02:36.960 --> 02:37.960 So why I'm saying it? 02:37.960 --> 02:42.640 Because you will see some civilizations, and that's the idea of trying to make everything 02:42.640 --> 02:45.520 as compact as possible. 02:45.520 --> 02:53.560 That's why, also, if trace is a bit flat in my mind, for at least for this case, right? 02:53.560 --> 02:58.480 So why binary trace is, by interesting is kind of the fastest way to do it, and many 02:58.480 --> 03:02.360 of the things that you do, you can do it compile time, right? 03:02.360 --> 03:07.520 So we keep an in memory only the parameters that you save it, right? 03:07.520 --> 03:11.840 So you have some kernel letters, or some ID, some token. 03:11.840 --> 03:16.040 You save only, you don't need to save the whole string, because the string is somewhere 03:16.040 --> 03:21.560 else, all you save, so we save it just once, at compile time, right? 03:21.560 --> 03:28.720 So, we have, for each logging instance, we have this data structure, right? 03:28.720 --> 03:35.480 So we have, at the as delta, what at the as delta means that we don't keep the whole use 03:35.480 --> 03:41.760 64, for your time frame, we keep only a data from a baseline, right? 03:41.760 --> 03:47.800 We keep the baseline, actually, in the next slide, so, but we don't keep the whole time 03:47.800 --> 03:55.560 frame, only the data from a baseline where you allocate it a page buffer, instead of, for 03:55.560 --> 04:00.960 a shot in something that I'm going to talk about, so each entry has the data, so it's 04:00.960 --> 04:05.920 more compact, source ID is where your line was, right? 04:05.920 --> 04:11.000 So you save somewhere, each time you log in interest, right? 04:11.000 --> 04:17.760 On the first time you hit this line, you save the location, right? 04:17.760 --> 04:23.040 The file, the file name, the the line number, and gets an ID. 04:23.040 --> 04:27.760 So next time when you are logging, you only save the ID, right? 04:27.760 --> 04:32.880 And so when you are disserializing this log, you have the exact time, you don't need the 04:32.880 --> 04:39.000 string, you just have a U6 and a, why does it say, hey, fine, okay? 04:39.000 --> 04:46.640 And so you have this, and you have a user context, for example, in, in safe, we have FSI 04:46.720 --> 04:52.760 globally, but we also can save some client specific, something that specific to you, in 04:52.760 --> 04:56.200 your logging, so you can get it fast, right? 04:56.200 --> 05:02.040 And the left, so it's basically a TLV structure, and that's kind of the idea, using 05:02.040 --> 05:10.520 as compact as possible, a, how are we, you know, how we're doing, especially, so you have 05:10.520 --> 05:16.640 a serialization, with the LV, it compiles time, as much as you can do it, compiles time, 05:16.640 --> 05:21.840 you save it, and that's kind of the, the, the magic, right? 05:21.840 --> 05:25.840 And then you, you see always on demand, right? 05:25.840 --> 05:38.080 So, what's, what's the magic, the magic is that we, I'm not using actual circle buffers, 05:38.080 --> 05:43.080 but it's a circle memory allocation system, right? 05:43.080 --> 05:49.600 I should have invested in a slide of this kind of thing, but basically, we have a magazine 05:49.600 --> 06:00.160 allocator, some, the memcash work, so magazine allocator, you get a magazine buffers, right? 06:00.160 --> 06:06.320 An allocator per core, right, you can have two of those, so there's a paper on it, not, unfortunately 06:06.320 --> 06:12.040 not mine, and it's been long in the can offer for, but we're using this reason to make 06:12.040 --> 06:17.880 it, so the allocation is very fast, for each frug, right? 06:17.880 --> 06:20.600 But we are allocating very, very small segments, right? 06:20.600 --> 06:26.240 As I mentioned, we have this on your parameters, and only the, the TLV header, right? 06:26.240 --> 06:34.720 So each, in our use case, what I've saw, with this, in this way, we could save two 06:34.760 --> 06:38.120 logs per cache line, right? 06:38.120 --> 06:43.960 And it's improvement on, whatever, whatever it was before, basically means that when you 06:43.960 --> 06:49.200 have one log, it's, the cache line is only in your L1, the next log will be essentially 06:49.200 --> 06:51.840 free, right? 06:51.840 --> 06:52.840 So how do you allocate fast? 06:52.840 --> 06:57.840 So first, you have the magazines, allocator, so you have per core allocations, which 06:57.840 --> 07:07.120 you go to the global context once, once in a blue moon, again, it's a talk about magazines, 07:07.120 --> 07:12.000 but you know, if you're not familiar with pattern, my advice, you use it, no, it 07:12.000 --> 07:15.120 because it's very powerful. 07:15.120 --> 07:22.520 And the next thing we do is take each one of those buffers, and actually use it for a, 07:22.520 --> 07:29.320 what, I'm forgetting the name, we slice it up, right? 07:29.320 --> 07:31.480 It's again borrowed from networking, right? 07:31.480 --> 07:38.200 Neplications, what you have usually, you have 30 k, a buffer allocated, and then each time 07:38.200 --> 07:44.840 you have a network pocket, incoming just slice it up, so your allocation is just adding 07:44.840 --> 07:49.200 whatever the size of the buffer is, and that's the same mechanism we're using here. 07:49.200 --> 07:57.120 And so for my experience having this pair of magazines with, a, page frauds, with a fragment 07:57.120 --> 08:01.640 into this page, it's very, very fast, right? 08:01.640 --> 08:08.360 It's, in this, you, you want to see the performance, I use it with 100 gigabit nix, pair 08:08.360 --> 08:15.520 pocket, a memory operations, so this is not what core superfones, what core superfones 08:15.520 --> 08:16.960 is the actual axis, right? 08:17.040 --> 08:26.560 That's the first slide I saw, I, I, I, I showed, right, so memory apps, and so we have 08:26.560 --> 08:32.000 this magazine allocation on, on the one side, we have your page fraud allocation, when 08:32.000 --> 08:39.040 you're using it, and then you have a, an allocator that, also an magazine allocator that 08:39.040 --> 08:47.280 puts it in and exists in logs cache basically, right, and all magazines go into a global 08:47.280 --> 08:50.640 list eventually, and then recycled, right? 08:50.640 --> 08:57.640 So you have full control, and this cyclical, a, allocation scheme provides you two things, 08:57.640 --> 09:03.640 hey, you control them out of memory, you're willing to provide for your login, and then 09:03.640 --> 09:09.960 make sure that each log keeps its life as long as possible, right? 09:09.960 --> 09:16.320 So it's only when it goes through the whole, whole cycle, will your logs be recycled, 09:16.320 --> 09:18.920 right, in globally, right? 09:18.920 --> 09:28.760 So performance and, and, and life, so additionally we want to have process context, 09:28.840 --> 09:34.120 but, you know, not everyone happens in process, or that's something happened, something's 09:34.120 --> 09:38.360 happened in, I agree, or not, because that's inappropriate, right? 09:38.360 --> 09:46.280 So for this specific change, you're trying, and that's the one big change I need to 09:46.280 --> 09:51.480 kind of get in, we are adding a point in the structure, right? 09:51.480 --> 09:57.320 So when you have a process that is logging, like, we're talking about file system, right? 09:57.320 --> 10:02.760 So when it does the right read some operation, you want to see, you want to be able, when 10:02.760 --> 10:07.680 you get in the logs, back out, you want to see which process did what, right? 10:07.680 --> 10:12.520 You want to, you want to, be able to take your A Python script, and organize the logs in 10:12.520 --> 10:17.840 ever, which way you like, because you won't get them in a, a, a, a, a, a, a property 10:17.840 --> 10:22.440 serial assign, you'll get them each pagefrog, which will be properly sealed as, you know, 10:22.440 --> 10:24.880 by time, but they will be out of order, right? 10:24.880 --> 10:30.480 Anyone will organize it by your process, or whatever context that you have, right? 10:30.480 --> 10:34.960 And second thing you have the dynamic context, you have packets coming and we're seeing, 10:34.960 --> 10:38.120 so the process context is relevant is wrong, right? 10:38.120 --> 10:43.240 So that's why we, we separate between the two, and we have pair process logging, and pair 10:43.240 --> 10:50.720 CPU logging, and that separate, they all belong to your module context, right, and you'll 10:50.800 --> 10:55.920 get them, but you don't, but you will see something that happened when you actually did 10:55.920 --> 11:02.200 something, when you made the C school, and when something happened with your module, but 11:02.200 --> 11:06.000 you don't really know kind of, you don't know who, who paid the price, because, you know, 11:06.000 --> 11:10.200 whatever was running on the CPU, when they, you happen, so we tell, there was a guy, 11:10.200 --> 11:16.120 if you're not, it, something came from the wire, something came in your, I think, this 11:16.200 --> 11:21.600 is called the C school, and, you know, make sure you have the proper logging. 11:21.600 --> 11:31.880 A memory layout, let's try to make it, let's get the slide, so this is kind of the 11:31.880 --> 11:38.160 explanation of what I will, if you session with our spleen, so we have the empty mug, 11:38.160 --> 11:45.600 it goes to global least, that's kind of the blue, and it goes into a full mug, at least. 11:45.600 --> 11:51.400 So we allocate a page for context, as I said, so we have, we have a smelly, magazines, 11:51.400 --> 11:57.320 we allocate an entry, but it belongs to a page for our context, which has at least had, 11:57.320 --> 12:02.440 because as I mentioned, when we allocate one, it goes into this global least, and a page 12:02.440 --> 12:10.680 had, it remains alive and usable throughout the life, no matter where in the life cycle, 12:10.760 --> 12:16.160 each page for a is, and that's so that when we dump your information, when you want 12:16.160 --> 12:21.000 to actually want to see to be serial as a logging, you can get all the logs, right? 12:21.000 --> 12:27.240 So I have the list log, and we have the distribution, like the PID or the CPU, wherever, 12:27.240 --> 12:32.880 and the base G-fix, so when this becomes active, you record the base G-fix, and so 12:32.880 --> 12:37.880 the next time on each log, comments don't need the delta, right? 12:37.880 --> 12:40.880 That's how we make it small. 12:40.880 --> 12:46.800 Yeah, so that's kind of the description, so we have the buffer, the metadata, and then you 12:46.800 --> 12:52.680 have the logs, and when we deploy this page drop, instead of trying to do a cyclical 12:52.680 --> 13:00.040 location, we just move it aside into an existing logs, cache, and get a new one, again, 13:00.040 --> 13:06.720 because we don't, we want to keep the logs alive as long as possible, right? 13:06.720 --> 13:10.760 Global content, that's kind of what they start, so we have the location, but this is 13:10.760 --> 13:16.960 the kind of the live log batch, and I think it's a typo, and so you have the source 13:16.960 --> 13:21.640 map and the climb up, so the climb up, as I said, is the identifiers, they're specific 13:21.640 --> 13:27.800 to your module, whatever you want it to, to save, and the source map is the location, 13:27.800 --> 13:33.600 so you have the lines, and the, yeah, so it was a big dive, the original one talk was kind 13:33.600 --> 13:44.800 of longer, I hope it wasn't too dense, so thank you for questions. 13:44.800 --> 13:50.600 How do you read the trace, then once you put it up, how do you read the trace, then 13:50.600 --> 13:56.040 is there integration with any tracing GUI, or do you use some custom scripting to then 13:56.040 --> 13:57.040 reason about it? 13:57.040 --> 14:04.520 So, what you get, you have the current solution, is we have a debug a fast file for your 14:04.520 --> 14:09.480 module, and you just cut and read it, right? 14:09.480 --> 14:15.720 So I'm not sure it will be the main, so I sent back in October, the batch set, it's about 14:15.720 --> 14:22.480 long-dish, and Steven wrote that who wrote F-trace, long-dish, why would I use F-trace 14:22.480 --> 14:28.440 so the current implementation is debug a fast, and we'll see how it goes, they might 14:28.440 --> 14:30.440 go into it. 14:30.440 --> 14:33.640 Are you tracing only kernel space, only user space, both? 14:33.640 --> 14:34.640 It's only kernel space. 14:34.640 --> 14:35.640 Okay, thanks. 14:35.640 --> 14:41.000 It's not in the big dev solution, it's just inside your kind of module and compile. 14:41.000 --> 14:48.200 You started your talk with, with comments about memory latency, yeah, and does this really 14:48.200 --> 14:55.200 play such a big role, because as a tracing provider or creator, you're just writing, 14:55.200 --> 15:01.600 that means you never have to wait on memory, you just stored out into write buffers and 15:01.600 --> 15:05.400 it eventually, the system will flush them out, what are you saying stalls? 15:05.400 --> 15:11.640 It's not about the stalls, it's about the memory footprint, so whenever you're going 15:11.640 --> 15:17.880 outside of your memory, you're built in your L1, your L2, etc, so whenever you have 15:17.880 --> 15:26.120 a working context at the CPU, it's just a long-dish thing, I need to remember, the Intel 15:26.120 --> 15:34.280 has passed the, what do we call the non-curve or are you using them? 15:34.280 --> 15:38.840 Yeah, that would buy fast all these times. 15:38.840 --> 15:46.520 He actually tried, for my memory, the hustle is not very, so we used to, made some experiments 15:46.600 --> 15:53.520 work like a decade ago, in different contexts, not this one, it's actually an interesting 15:53.520 --> 16:01.400 idea, I would need to do it, so my memory of it was that it's not very good, it's not 16:01.400 --> 16:06.560 very useful, but maybe out of date and the specific contexts may have changed enough to 16:06.560 --> 16:09.360 make it very open, so it's actually good at it. 16:09.360 --> 16:15.600 Is this specific to kernel modules, you need to do something in your kernel module, is this 16:15.680 --> 16:20.520 specific to kernel modules, do you need to modify a kernel module, or can we use it in the core 16:20.520 --> 16:21.520 kernel? 16:21.520 --> 16:29.760 It's a subsystem that you can, it's a library, you can use and just change your logging 16:29.760 --> 16:45.560 like, the P trays to, you know, be locked trays and you get it, all right, and if you 16:45.560 --> 16:52.560 am, okay folks, any more questions, you want to see the, you know, double the badges, 16:52.560 --> 16:56.720 my email is right there, thank you for listening, thank you. 16:59.760 --> 17:06.760 Thank you, thank you, thank you, thank you, thank you, thank you, thank you, thank you, thank you