WEBVTT 00:00.000 --> 00:11.880 Right, okay, thank you, everyone, for staying till the end, as I said before, I know it's 00:11.880 --> 00:17.520 tempting to go get some beers and Sunday afternoon, but you stuck out, stuck with us. 00:17.520 --> 00:24.440 All these are going to talk about writing an application kernel in Rust, and let's give 00:25.360 --> 00:28.560 the last of your Sunday energy to welcome him. 00:37.200 --> 00:39.600 Okay, this is work, yeah? 00:39.600 --> 00:40.880 All right, hello, everyone. 00:40.880 --> 00:44.040 This is Sid Reipen and application kernel in Rust. 00:44.040 --> 00:47.080 Again, thank you for staying until this late hour. 00:47.080 --> 00:51.080 I hope I'll make it as entertaining as for you as possible, but it's a bit of a 00:51.080 --> 00:53.480 difficult subject, so bear with me. 00:53.640 --> 00:58.480 So I'm a little bit of an introduction about me, so maybe you trust me. 00:58.480 --> 01:04.200 I'm an external developer for the past 15 years, I used to be an external developer, and 01:04.200 --> 01:07.240 the main author of Sid Box, our topic today. 01:07.240 --> 01:12.200 I'm also an international test trainer, and the co-founder of just without boundaries, 01:12.200 --> 01:16.520 where we try to provide accessible materials to chess players with disabilities. 01:16.520 --> 01:21.240 And you can see my interest in email here if you want to contact me, feel free to. 01:22.200 --> 01:26.200 Here is the basic outline of what we are going to do today. 01:26.200 --> 01:28.840 First, we'll define what's an application kernel. 01:28.840 --> 01:33.560 It's a bit of a vague term, so making a definition versus logical. 01:33.560 --> 01:40.200 Then I'm going to explain you how physical interception actually works, so you'll get a more 01:40.200 --> 01:45.080 wide understanding of how application canals work, and then it's the main meat of the 01:45.080 --> 01:50.520 matter is of course rust, why did I pick rust, what are our memory safe to patterns, 01:50.520 --> 01:55.720 and the trade-offs between safety and performance, and then I'll finish talking a bit about 01:55.720 --> 01:57.880 our testing infrastructure and Q&A. 01:59.240 --> 02:01.720 So what's an application kernel? 02:01.720 --> 02:05.960 Application canals are really vague term, it's both an application and a kernel, 02:05.960 --> 02:09.560 but neither an application nor a kernel, so it's somewhere in between. 02:09.560 --> 02:12.600 And here is a nice description I found from an article. 02:12.600 --> 02:17.560 It's a library operating system variant that interseps emulates and transforms 02:17.560 --> 02:20.760 physical in user space for sandbox processes. 02:20.760 --> 02:23.800 These three are important, I'll explain a bit more later, 02:23.800 --> 02:26.920 interception emulation and transformation, right? 02:26.920 --> 02:33.640 And the way it does it is interseps system calls via set comp unotify, unotify stands for 02:33.640 --> 02:40.520 user notification, it's a new API added to set comp in recent Linux's, and it's main 02:40.520 --> 02:45.320 difference from petrase is you can handle system calls simultaneously from different 02:45.320 --> 02:51.000 threats, unlike petrase, where you have to serialize, and we also use petrase and 02:51.000 --> 02:56.440 unlock optionally, and what it does is it emulates file system network and process 02:56.440 --> 03:03.080 operations, it transforms paths, flags and credentials at runtime, and you can configure it 03:03.080 --> 03:10.440 dynamically via a website virtual path, I'll delve into a bit deeper later, and similar project 03:10.440 --> 03:17.640 are Google's device, which is written and go, very similar device reuse is the second trap API, 03:17.640 --> 03:24.440 and we use the comp unotify, so the idea is in the system calls are handled in different 03:24.440 --> 03:29.240 processes, but in device, it's all in the single process, so depending on your use case, 03:29.240 --> 03:34.920 either one may be more secure, and ramp canals of net BST fame, where you can develop your 03:34.920 --> 03:40.680 canals and applications, so you don't, you're not scared if a crash can crash the whole system, 03:40.680 --> 03:46.200 and other examples are not the containers or spheres, and there are many other examples 03:46.200 --> 03:53.960 what these are the standing zones, and so what's it does actually here, I try to make it as 03:53.960 --> 04:00.360 simple as possible, but not simpler, so on the left you see the path that an open call 04:00.360 --> 04:08.360 takes until it's implemented in sit, and open if you don't know is a very basic unique system call, 04:08.360 --> 04:15.560 which you use to open a file, and you get a file descriptor object from the canal with which 04:15.560 --> 04:21.960 you can use read, write, and so on, and when the sandbox process opens a file, the first thing 04:21.960 --> 04:28.600 that the Linux canal does is send this second-notification to the sit emulator threat that's 04:28.600 --> 04:33.480 handling it, there are many threats, I will tell you see one of them will pick it, and this 04:33.480 --> 04:40.520 notification has the system call number and the arguments, but the arguments are not, we have to first 04:40.520 --> 04:46.360 process them to make them useful, right? We get a pointer, not a string, right? The pointer 04:46.360 --> 04:54.360 tells us where the string is located in sandbox process memory, so in either another step with process 04:54.360 --> 05:00.840 VM read, we actually read this string into our process space, and process VM read, we is a bit more 05:00.840 --> 05:08.280 secure than the old way of reading the prokbitmem, because it respects the other space permissions 05:08.280 --> 05:14.360 of the sandbox process, and now we have a string, this thing is a path name from the perspective 05:14.360 --> 05:20.520 of the sandbox process, it can be a relative path, it can be an absolute path, so we need 05:20.520 --> 05:27.160 another step to actually turn it into real path that we can make a sandbox check on it, right? 05:27.160 --> 05:35.320 And this step is the canonicalization, and canonicalization gives you as a return value to values back, 05:35.320 --> 05:40.920 one is an old path file descriptor, the other one is a canonical path, and the idea is both of 05:40.920 --> 05:47.000 them point to the same thing, old path if you don't know is a type of file descriptor that you can 05:47.000 --> 05:53.160 only use full pet operations, not for reading or writing, so the actual open hasn't happened yet, 05:53.160 --> 05:59.480 right? And then the sandbox check happens, sandbox check can have three outcomes, if you can't say 05:59.480 --> 06:05.320 the pet must be hidden, in which case the sandbox process will get a no such file or directory, 06:05.320 --> 06:10.520 it may be denied, in which case you'll get an operation that permitted, or it may be allowed, right? 06:10.520 --> 06:16.120 When it is allowed, now the final state happens where they actually do the open system call via 06:16.120 --> 06:21.320 this eight item you see, we do a broken the interaction to prevent time of check to time of use 06:21.320 --> 06:29.480 vectors here, and finally we have a file descriptor, and we use second at FTI Octial to add this file 06:29.480 --> 06:35.240 descriptor to the to the process space, and this is the, this is the big picture, and on the right, 06:35.240 --> 06:41.000 you can see many transformations can happen, over this path, you can mask a path to change the path, 06:41.000 --> 06:46.920 it may be encrypted, in which case the encrypted threat will take over, it may be a pandon, 06:46.920 --> 06:51.640 in which case we will force the appant flag, as you can see the transformations 06:51.640 --> 06:57.560 can happen safely, because it all happens in sits process and it's safe, and finally we can 06:57.560 --> 07:03.480 randomize the file descriptor to prevent file descriptor use attacks, so this is the basic idea, 07:03.480 --> 07:10.280 don't worry if you don't get it, but we will get to it, so why does this is the main 07:10.280 --> 07:15.960 meat of the matter and why I'll be at all here, right? I started writing sitbox three, 07:15.960 --> 07:22.840 which became the restoration around three years ago, and my idea was to redesign it from scratch 07:22.840 --> 07:28.280 to make it a security boundary, sitbox one was written and see, and it wasn't meant to be a 07:28.280 --> 07:35.480 security boundary, so instead of doing the conventional rear IP operating in Rust, I actually 07:35.480 --> 07:42.440 took the time to redesign it from scratch in Rust, and this way we could take advantage of many 07:42.440 --> 07:47.960 Rust goodies, right? And this is what I can recommend, you don't just blindly rewrite things 07:47.960 --> 07:54.040 in Rust, redesign it from scratch with the powers of Rust that makes it much better, and here are 07:54.040 --> 08:00.200 a bit a few examples which I will live deeper in a bit, of course memory safety is one of the 08:00.200 --> 08:05.480 prime features of Rust, right? And the many modules, we have this forbidden safe code close, 08:05.480 --> 08:11.640 so unsafe code is outright forbidden, and there are other goodies, when you are working with 08:11.640 --> 08:17.320 untrusted data like the alpha parts and the globe measure, I'm going to delve into it in a bit, 08:17.320 --> 08:25.960 alpha is the executable file format of unixes, and sit parts is out to do some restrictions, 08:25.960 --> 08:31.240 and it's completely untrusted data, right? So things like forbidding arithmetic side effects, 08:31.240 --> 08:39.400 fluctuations or wrapping helps that a malicious alpha can occur as the sandbox, right? And more 08:39.400 --> 08:47.480 over the views, the type system for Rust in our advantage, and the main use cases, we have a safe 08:47.480 --> 08:54.360 interface for Linux's MCL system called to seal a memory region so it's immutable, and I'll 08:54.440 --> 08:59.640 delve into it in a bit later, and we have this generic seal box type that turns into seal, 08:59.640 --> 09:05.480 then it's sealed, and ownership is, of course, another prime feature of Rust, and in sit box, 09:06.280 --> 09:12.760 we use this mostly for file descriptors, and this is fantastic, because file descriptor leaks is a 09:12.760 --> 09:18.680 huge problem in container security, and in Rust you can actually make your compiler work for 09:18.680 --> 09:27.160 you, and prevent this file descriptor leaks, right? And of course, the other two are things we all know, 09:27.160 --> 09:33.240 zero-cost abstractions and fearless concurrency. As I said, sit, sit can be used simultaneously, 09:33.240 --> 09:38.680 so sit is a multi-traded process that can be many emulator processes to handle system calls. 09:39.560 --> 09:45.640 So, let's dive a bit deeper into memory safety patterns. To type state patterns, 09:45.640 --> 09:53.960 this seal box, as I said, is a interface to Linux's MCL system call, and what we do is 09:53.960 --> 10:00.920 we mark the sandbox policy as immutable, when it's locked, such that a compromised sit cannot 10:00.920 --> 10:07.000 edit the sandbox anymore, and as you can see, this enam on the online, there is an enam sealable, 10:07.000 --> 10:14.120 which is generic over the type T, and it has two variants, I'm sealed and sealed, in the default 10:14.120 --> 10:20.040 unsealed state, you can edit it as you wish, and then the one way I don't pull can seal function 10:20.040 --> 10:26.040 is called, and then it turns into sealed, after that you can only read it, you can no longer edit it anymore. 10:26.040 --> 10:32.040 So, as you can see, over the seal function, and this is how we do it safely in Rust. 10:32.920 --> 10:40.200 As a parser is another example, as I said, it works on completely untrusted data, 10:40.200 --> 10:48.200 so we have a handful of forbidden links to forbid, forbidden, that can be a dosa attack, 10:49.240 --> 10:59.480 and I actually run the alpharser over a set of 68,000 malwares from virus share, and it didn't crash. 10:59.480 --> 11:05.880 So, I'm fairly confident it does the right thing. And let's talk a bit about safety and performance 11:06.040 --> 11:13.240 and what are the trade-offs. In my experience, performance is no excuse to use unsafe code, 11:13.240 --> 11:21.160 which is not really common among Rust people, as far as I can say. And here is a very good example, 11:21.160 --> 11:27.560 glopy, if you don't know, means a file name matching, if you ever written a shell and used the 11:27.560 --> 11:32.840 characters star or question mark, this is what you're using, it's a bit similar to regular 11:32.840 --> 11:39.480 expressions, but not quite. And the original glopy method of seed was inherited from arcing, 11:39.480 --> 11:46.200 and it was written exactly 40 years ago in 1986. And as a 40, 30 present to arcing, 11:46.200 --> 11:51.720 I dirote this algorithm in Kirk Cross's festival, compared to algorithm, which is known to be 11:51.720 --> 11:58.600 the fastest out there. And here is a nice example of the benchmarks of two million test cases I generated. 11:58.600 --> 12:05.240 And the wild match code has no unsafe at all. And it performs almost two and a half times faster 12:05.240 --> 12:11.640 than lips is a fan match, which is C, and has to be fast, right? But this is not the case. 12:11.640 --> 12:20.520 And another thing we use to reduce small allocations is custom pad types. Sitting mostly works 12:20.520 --> 12:29.400 on small strings, right? And this tiny vector module allows us to store the small strings in stack. 12:29.400 --> 12:36.040 And only then it overflows, it will be allocated on hips. So this cuts a lot of small allocations 12:36.040 --> 12:42.600 that sit does. And this part is a corresponding dynamically size type, which is pretty much similar 12:42.600 --> 12:49.240 to standard lips pad, but it has comparisons with sims, so on resumcipy uses much faster. 12:49.240 --> 12:58.840 So our testing infrastructure, sit box is a portable sandbox. It only runs on Linux, but it runs on 12:58.840 --> 13:04.920 most architectures that lip-saccomps supports. Some of them are here. And we have a multi-architecture 13:04.920 --> 13:11.640 pipeline that tries to test all of them. And we feed is also a multi-personality sandbox, 13:11.640 --> 13:17.560 which means you can trace a 32-bit process from a 64-bit set just fine. And again, we have 13:17.720 --> 13:23.160 cross-compile tests for that. And another nice benefit is when you're writing a kernel, everyone's 13:23.160 --> 13:29.400 test is your test, right? So next, our Linux, we have package testing on my default, and everything runs 13:29.400 --> 13:36.360 under the sandbox. And if a test fails under the sandbox, but passes without, it's a sandbox bug. 13:36.360 --> 13:44.440 And we also run the Linux testing projects, Cisco test suite, which has over 4,000 tests 13:44.440 --> 13:50.920 and Gnullips, Potsics, Compatibility Test. So we are fairly certain, sit us what Linux would do. 13:50.920 --> 13:57.880 Right? And the idea is to just be a thin layer. And another thing is, of course, security, 13:57.880 --> 14:05.080 this is a security boundary. And for every sandbox escape you found in the past, we have an 14:05.080 --> 14:14.360 integration test that makes sure it doesn't reappear again. Yeah. So this is pretty much all 14:14.360 --> 14:21.320 I have. Here is our GitLab. The code is GPL 3 and forever free. So feel free to do whatever you 14:21.320 --> 14:27.160 want with it. We have extensive documentation in the form of manual pages. And if you have any questions, 14:27.160 --> 14:31.880 you could not ask here, come over, I ask your matrix and ask. And finally, thanks to Fender, 14:31.880 --> 14:36.920 one more data for sponsoring my attendance. That's all I have. I can take questions now. 14:44.360 --> 15:04.680 Thank you very much. Sorry if I missed the point, but I wanted to ask, is there already some 15:04.680 --> 15:12.280 kind of tooling to define rules, to run application under these things like what application 15:12.280 --> 15:17.960 can access, what should be forbidden or hidden from it? Yes, yes, yes, exactly. 15:17.960 --> 15:24.040 Sit box works with text-based policies. And there are over 30 categories of access, right? 15:24.040 --> 15:29.560 So you can say allow read this path or allow right read, write, exact and all that are 15:29.560 --> 15:34.520 are all categories. And you can configure them in a text-based policy. And you can load this 15:34.520 --> 15:40.680 into the runtime. You can also configure it dynamically on the underrun, like both of them are possible. 15:42.280 --> 15:50.200 Thank you for the question. I have a small one. Okay. You are showing two clip-y lines. 15:50.200 --> 15:55.880 I'm sorry. You are showing two clip-y lines that I'm not very familiar with. What are those? 15:55.880 --> 16:01.640 And what do they do prevent? This one, yeah? This one. Yes, that one. Yes, that one. Yes, that one. 16:02.680 --> 16:06.760 Unsafe forbidden, safe code, forbidden, safe code. That's all we know, right? 16:06.840 --> 16:13.960 Arithmetic side effects is for, then you multiply two numbers. For example, and the type of 16:13.960 --> 16:21.080 workflows, what's going to happen, right? Or you are trying to multiply two numbers that won't fit 16:21.080 --> 16:27.480 into a type. All of these arithmetic that can have side effects this way, right? Rapping or 16:27.480 --> 16:34.200 overflowing, you know, all that. And in the Elf parser, like imagine the Elf is completely 16:34.200 --> 16:39.640 untrusted. The size can be wrong. Everything can be wrong. You have to work with this untrusted 16:39.640 --> 16:46.280 data and this helps with it all that, right? To not overcome the boundaries that I can explain 16:46.280 --> 16:52.760 like that. And there are many more forbidden clauses. You can feel free to take a look at the code. 16:52.760 --> 16:59.560 I have comments there. Thank you for the question. Have one there? 17:00.440 --> 17:07.880 Just one small one. Why you had those protection for arithmetical operation? 17:07.880 --> 17:12.440 I'm sorry. Why you had those protection for arithmetical operation? The library did 17:12.440 --> 17:19.320 improve by out of the box or? I don't understand the question. Yes, in a slide before. 17:19.320 --> 17:23.560 The boss protects in your mind, right? Yeah. Yeah, that's fine. 17:23.960 --> 17:29.880 Brad, I'm sorry. Yeah, we need to specify the. This basically means 17:29.880 --> 17:36.040 during parsing the Elf is untrusted, right? And anytime and overflow happens, it will return 17:36.040 --> 17:40.600 an error. It will not overflow or do something undefined with it or things like this. 17:40.600 --> 17:45.000 So you are preventing all the undefined behavior. And instead, you are returning an error. 17:45.000 --> 17:50.280 It's simple as the case. Yeah. And why you should be employed? I'm sorry. 17:50.280 --> 17:56.120 And why you should be employed? Pink Floyd, yeah? I mean, it's said box. It's a better 17:56.120 --> 18:01.960 trite, so pink Floyd. Pink Floyd is a master. Thank you. Thank you for the question. 18:06.440 --> 18:09.800 Any other questions? You're good, yeah. 18:13.320 --> 18:15.800 All right, so thank you very much. Thank you, everyone.