WEBVTT 00:00.000 --> 00:05.000 Thank you. 00:05.000 --> 00:06.000 Thank you. 00:06.000 --> 00:10.000 So, everyone, I'm Masajardano from UCL. 00:10.000 --> 00:13.000 I'm a shoe from Ganty University. 00:13.000 --> 00:16.000 So, we were talking about how to accelerate 00:16.000 --> 00:20.000 scientific code using AI hardware 00:20.000 --> 00:23.000 and the drill-up package to react into jail. 00:23.000 --> 00:25.000 So, you know that every nowadays, 00:25.000 --> 00:28.000 the AI hardware dominates modern computing world 00:28.000 --> 00:31.000 if you wait a few weeks, you can see some news 00:31.000 --> 00:34.000 about some expanding on AI 00:34.000 --> 00:39.000 from different vendors and different AI providers, 00:39.000 --> 00:43.000 also not just GPUs, also GPUs and other hardware, 00:43.000 --> 00:45.000 and also they affect the world markets 00:45.000 --> 00:48.000 when these things happen. 00:48.000 --> 00:51.000 So, yeah, there is lots of this compute available, 00:51.000 --> 00:53.000 but it's mainly used for AI, 00:53.000 --> 00:55.000 but we want to do also something like using 00:55.000 --> 00:59.000 for regular old-school scientific computing. 00:59.000 --> 01:02.000 So, how do you use these accelerators? 01:02.000 --> 01:05.000 If you use Python, probably you will 01:05.000 --> 01:07.000 rewrite it in Jax or PyTorch, 01:07.000 --> 01:10.000 if you have already some code, 01:10.000 --> 01:12.000 but we are Julia uses. 01:12.000 --> 01:17.000 Julia is a compile language based on LLVM. 01:17.000 --> 01:21.000 So, this will not be an approach for us, 01:21.000 --> 01:22.000 so we need to do something else. 01:22.000 --> 01:24.000 But taking a step back, 01:24.000 --> 01:29.000 any important thing in modern scientific computing 01:29.000 --> 01:32.000 is to do automatic information, 01:32.000 --> 01:34.000 and why this is important. 01:34.000 --> 01:39.000 The derivatives are everywhere in the physical laws 01:39.000 --> 01:41.000 or natural laws. 01:41.000 --> 01:44.000 So, if you can compute the derivatives very fast, 01:44.000 --> 01:47.000 you can accelerate your code a lot, 01:47.000 --> 01:50.000 and it's also useful for machine learning applications 01:50.000 --> 01:52.000 because you can do the back propagation, 01:52.000 --> 01:54.000 or based on inference. 01:54.000 --> 01:57.000 You can see here some applications, 01:57.000 --> 01:59.000 like a folding of proteins, 01:59.000 --> 02:02.000 or reconstructing a Hello Kitty image, 02:02.000 --> 02:07.000 or accelerating the black hole imaging 02:07.000 --> 02:09.000 from the event horizon telescope. 02:09.000 --> 02:12.000 Yeah, so for example, in this case, 02:12.000 --> 02:14.000 they were able to go from analysis, 02:14.000 --> 02:16.000 taking one week on a cluster, 02:16.000 --> 02:20.000 but when they used automatic differentiation, 02:20.000 --> 02:22.000 they were able to reduce the compute time 02:22.000 --> 02:27.000 to one hour on a single core on a CPU. 02:27.000 --> 02:31.000 That's quite a speed up. 02:31.000 --> 02:35.000 How automatic differential differential works 02:35.000 --> 02:38.000 in most pipelines? 02:38.000 --> 02:41.000 So, you take your code, 02:41.000 --> 02:44.000 the source code, which could be written in C++, 02:44.000 --> 02:47.000 really aroused with or whatever language you like. 02:47.000 --> 02:51.000 You apply the automatic differentiation on the source code, 02:51.000 --> 02:54.000 and then that's lowered into some intermediary presentation 02:54.000 --> 02:56.000 of the compiler that can be optimized, 02:56.000 --> 03:00.000 and then you generate the native code. 03:00.000 --> 03:02.000 But this can be inefficient. 03:02.000 --> 03:07.000 A more efficient way to compute derivatives 03:07.000 --> 03:09.000 is the enzyme approach. 03:09.000 --> 03:12.000 And I mean, I am out of this engine based on LLVM, 03:12.000 --> 03:15.000 which works after optimization. 03:15.000 --> 03:16.000 So, you take your code, 03:16.000 --> 03:19.000 you lower to the compiler intermediary presentation, 03:19.000 --> 03:20.000 you optimize it, 03:20.000 --> 03:24.000 and then here you apply the automatic differentiation, 03:24.000 --> 03:27.000 and then possibly you can apply even further optimization 03:27.000 --> 03:29.000 to eventually get to the native code. 03:29.000 --> 03:31.000 And why this is important? 03:31.000 --> 03:35.000 As a case study of how the enzyme approach 03:35.000 --> 03:37.000 speeds up derivatives, 03:37.000 --> 03:40.000 for example, take the normalization of a vector. 03:41.000 --> 03:45.000 Here you have this function, which takes an input and output array, 03:45.000 --> 03:49.000 and you loop over all the elements of the input and the output, 03:49.000 --> 03:52.000 and you divide and you set the output array 03:52.000 --> 03:57.000 to the input array divided by the normalization of the input array. 03:57.000 --> 04:01.000 The problem of this is that you're basically computing 04:01.000 --> 04:04.000 the normalization factor in each iteration of the loop. 04:04.000 --> 04:06.000 The easiest thing to do is to just 04:06.000 --> 04:10.000 place the normalization out of the loop. 04:10.000 --> 04:12.000 But you need to remember to do this, right? 04:12.000 --> 04:15.000 You will select the compiler to do this for you. 04:15.000 --> 04:20.000 So, if you do automatic differentiation before optimization, 04:20.000 --> 04:26.000 the problem is you forget to place the normalization out of the loop. 04:26.000 --> 04:28.000 You do auto-diff, 04:28.000 --> 04:33.000 and so you get the derivative of your function inside the loop. 04:33.000 --> 04:37.000 And so the cost of the derivative is 0 of n squared, 04:37.000 --> 04:39.000 like it was here. 04:39.000 --> 04:42.000 And after optimization, it's not possible to hoist 04:42.000 --> 04:46.000 the factor out of the loop anymore, 04:46.000 --> 04:50.000 because it depends on the iteration. 04:50.000 --> 04:53.000 Instead, if you do what the order of the automatic differentiation 04:53.000 --> 04:58.000 and optimization, you get that you first do the optimization, 04:58.000 --> 05:02.000 so your friendly compiler can hoist the normalization out of the loop. 05:02.000 --> 05:06.000 And then the derivative, in this case, 05:06.000 --> 05:09.000 becomes a 0 of n derivative. 05:09.000 --> 05:13.000 So you get a much faster derivative automatic derivative 05:13.000 --> 05:18.000 by moving the auto-diff stage after optimization. 05:18.000 --> 05:22.000 And as a scientific motivating problem, 05:22.000 --> 05:26.000 to be used for applying all of these concepts together, 05:26.000 --> 05:27.000 was oceananic. 05:27.000 --> 05:30.000 So oceananic is a ocean model written in Julia. 05:30.000 --> 05:32.000 It's a very fast. 05:32.000 --> 05:37.000 It has been used up to 768 GPUs and demonstrating 05:37.000 --> 05:41.000 very high memory and energy efficiency, 05:41.000 --> 05:44.000 compared to other competitors. 05:44.000 --> 05:47.000 And you can also see some cool pictures like this, 05:47.000 --> 05:56.000 very high resolution of the ocean evolution on the globe. 05:56.000 --> 06:01.000 And about GPU programming via LLVM, 06:01.000 --> 06:06.000 which is, again, one of the most common compiler framework, 06:06.000 --> 06:10.000 the problem of doing GPU programming at the LLVM level 06:10.000 --> 06:14.000 is that LLVM IR is very low level. 06:14.000 --> 06:18.000 It doesn't have a good high level representation of the parallelism, 06:18.000 --> 06:24.000 which makes some optimization very hard or just impossible to do. 06:24.000 --> 06:28.000 And also specifically for GPU programming, 06:28.000 --> 06:32.000 the problem is the device code and the host code 06:32.000 --> 06:35.000 live in two different separate modules. 06:35.000 --> 06:39.000 So you have the host code calling a basically black box 06:39.000 --> 06:42.000 on the device code and on the GPU. 06:42.000 --> 06:46.000 And there is no way to do optimization between the device and the host code. 06:46.000 --> 06:48.000 So you cannot, for example, again, 06:48.000 --> 06:55.000 host some functions out of the loop inside the device code 06:55.000 --> 07:00.000 because there is no communication between these two. 07:00.000 --> 07:05.000 And if we look more deeply at the scientific codes, 07:05.000 --> 07:06.000 like in an ocean ion, 07:06.000 --> 07:11.000 we can say about 270 kernels, which look like this, 07:11.000 --> 07:14.000 where you are doing basically stencil code. 07:14.000 --> 07:19.000 And this is like the most natural way to write the code 07:19.000 --> 07:23.000 if you take like the paper formulas. 07:23.000 --> 07:26.000 But if you squint very hard, 07:26.000 --> 07:31.000 you realize that this stencil code are actually convolution. 07:31.000 --> 07:34.000 And much in learning accelerator are actually very good 07:34.000 --> 07:36.000 in convolution, like a tip used in particular, 07:36.000 --> 07:39.000 where the time to do convolution very efficiently. 07:39.000 --> 07:44.000 So one way will be you write your code to express 07:44.000 --> 07:48.000 this stencil code into convolutions, 07:48.000 --> 07:51.000 or you hope that your friendly compiler will do that for you. 07:51.000 --> 07:54.000 Again, this can be difficult to recognize this pattern 07:54.000 --> 07:58.000 and if there are optimizations, which require communication 07:58.000 --> 08:02.000 between the host code and the device code, 08:02.000 --> 08:06.000 like in the previous slide, this can be very hard to do. 08:07.000 --> 08:10.000 Instead, MLAR, MLAR is a compiler framework, 08:10.000 --> 08:14.000 which is based on MLVM, but it's more high level. 08:14.000 --> 08:19.000 So it has a better understanding of structures in your code 08:19.000 --> 08:23.000 and also parallelism and crucially in MLAR, 08:23.000 --> 08:28.000 the host code and the device code are actually in the same module. 08:28.000 --> 08:31.000 So if you take this, 08:31.000 --> 08:36.000 MLVM can erase into MLAR, 08:36.000 --> 08:39.000 then you can do the kind of optimization that you were saying before, 08:39.000 --> 08:42.000 like hosting a function out of a loop, 08:42.000 --> 08:45.000 when you know that this is constant. 08:45.000 --> 08:50.000 And also, MLAR is aware of parallelism 08:50.000 --> 08:54.000 and so can do optimization in this way. 08:55.000 --> 08:58.000 And if you apply derivatives or automatic derivatives, 08:58.000 --> 09:00.000 notative at this MLAR level, 09:00.000 --> 09:03.000 you also get much faster derivatives, 09:03.000 --> 09:07.000 which as you have seen before, can provide a big speed 09:07.000 --> 09:09.000 to your application. 09:09.000 --> 09:11.000 And now, Jules, we'll talk about the Java package, 09:11.000 --> 09:13.000 which brings all of this together. 09:13.000 --> 09:16.000 Yes, so MLAR is very cool, 09:16.000 --> 09:19.000 but you need a way to actually generate it. 09:19.000 --> 09:23.000 So reactant.jl is a Julia package that compiles 09:23.000 --> 09:26.000 Julia code to MLAR using the, 09:26.000 --> 09:28.000 than the MLAR code, 09:28.000 --> 09:31.000 when it can be compiled using the XLA compiler. 09:31.000 --> 09:34.000 And we also have these nice optimizations on top, 09:34.000 --> 09:37.000 and we can do automatic differentiation. 09:37.000 --> 09:42.000 There's actually two ways to get MLAR from Julia code. 09:42.000 --> 09:45.000 The first is to take existing kernels 09:45.000 --> 09:48.000 and to automatically erase them to MLAR. 09:48.000 --> 09:50.000 So in this example, 09:50.000 --> 09:53.000 we have this stencil kernel. 09:53.000 --> 09:57.000 It's lower to LLVM using the normal compilation flow. 09:57.000 --> 10:00.000 And then we have a pass that raises this to 10:00.000 --> 10:04.000 vendor, agnostic MLAR operations. 10:04.000 --> 10:08.000 And at this point, we try to recognize 10:08.000 --> 10:11.000 scalar operations and group these together 10:11.000 --> 10:13.000 in these TensorFlow level operations. 10:13.000 --> 10:15.000 And then we do additional optimizations. 10:15.000 --> 10:19.000 And so you can see throughout this compilation flow in the end, 10:19.000 --> 10:22.000 we end up with a convolution operation. 10:22.000 --> 10:25.000 And this can be executed very efficiently on, 10:25.000 --> 10:28.000 for example, Google TPUs. 10:28.000 --> 10:31.000 The other way to generate MLAR is to use 10:31.000 --> 10:33.000 reactant-straising engine. 10:33.000 --> 10:36.000 So you can write a regular Julia function. 10:36.000 --> 10:38.000 And then at the call site, 10:38.000 --> 10:42.000 you can pass it special array types. 10:42.000 --> 10:45.000 So two RRA creates a reactant array. 10:45.000 --> 10:49.000 And this will make it such that every function 10:49.000 --> 10:52.000 invocation with reactant array 10:52.000 --> 10:55.000 operant will actually generate an MLAR 10:55.000 --> 11:00.000 operation in the compiled MLAR. 11:00.000 --> 11:05.000 So this approach doesn't only work for array operants. 11:05.000 --> 11:08.000 You can also have arbitrary structures. 11:08.000 --> 11:10.000 For example, in this function, 11:10.000 --> 11:13.000 there's a function that takes two points. 11:13.000 --> 11:15.000 And each point has two fields. 11:15.000 --> 11:18.000 And when you compile this, when you trace this to MLAR, 11:18.000 --> 11:22.000 the MLAR code will have flattened each point 11:22.000 --> 11:26.000 into its two constituent sensors. 11:26.000 --> 11:29.000 It's also possible to return structs. 11:29.000 --> 11:32.000 And in this case, the flattened representation will be returned 11:32.000 --> 11:34.000 by the MLAR code. 11:34.000 --> 11:41.000 And there's a bridge to Julia that will reconstruct the structure. 11:41.000 --> 11:44.000 It's also possible to trace control flow. 11:44.000 --> 11:49.000 So in this case, we've added this trace macro in front of a four loop. 11:49.000 --> 11:54.000 And this will make it such that there's a loop in the generated MLAR code as well. 11:54.000 --> 11:56.000 If you don't include this trace, 11:56.000 --> 12:01.000 it will actually kind of unroll the loop by tracing through each iteration 12:01.000 --> 12:03.000 and adding the operations. 12:03.000 --> 12:05.000 There's an asterisk there, 12:05.000 --> 12:09.000 because this is a bit of a flaky system, 12:09.000 --> 12:12.000 that we're working on doing this automatically. 12:12.000 --> 12:18.000 So generating these loops without the annotation. 12:18.000 --> 12:24.000 Lastly, some people might know that stable each hello is a pure format. 12:24.000 --> 12:27.000 Each operation doesn't have side effects. 12:27.000 --> 12:32.000 So it's not possible for an operation to mutate its operands. 12:32.000 --> 12:35.000 While in the Julia code, this is possible. 12:35.000 --> 12:40.000 The first element of its operand 20, 12:40.000 --> 12:43.000 and it doesn't return anything. 12:43.000 --> 12:46.000 But in the traced MLAR code, this can't be expressed. 12:46.000 --> 12:49.000 So there's actually an operation that creates a new array, 12:49.000 --> 12:52.000 where this first element is mutated. 12:52.000 --> 12:54.000 And then it returns it. 12:54.000 --> 12:58.000 And again, in the binding from Julia to the compiler backend, 12:58.000 --> 13:03.000 it will insert this result value into the original operand. 13:04.000 --> 13:09.000 This was just a very quick overview of what reactant is capable of. 13:09.000 --> 13:11.000 But so there's more. 13:11.000 --> 13:17.000 There will be a lot of optimizations on top of the traced code. 13:17.000 --> 13:23.000 It's also possible to automatically distribute the execution. 13:23.000 --> 13:27.000 You can automatically differentiate using runtime. 13:27.000 --> 13:29.000 Like, Mozilla already talked about. 13:29.000 --> 13:32.000 And I focused on the Julia front end, 13:32.000 --> 13:38.000 but it's actually possible to do this with C++ code as well. 13:38.000 --> 13:43.000 So this is what we're looking at with ocean anagons. 13:43.000 --> 13:47.000 So we take existing Julia code from ocean anagons. 13:47.000 --> 13:50.000 And we compile it using reactant. 13:50.000 --> 13:56.000 And then we can automatically distribute it on a very large number of Google TPUs. 13:57.000 --> 14:00.000 So we're still working on performance, 14:00.000 --> 14:02.000 especially for multi-note cases. 14:02.000 --> 14:04.000 We want to get this line down, 14:04.000 --> 14:07.000 but we're getting there. 14:07.000 --> 14:13.000 So yeah, I maybe want to go into too much detail on the conclusion 14:13.000 --> 14:15.000 to get time for questions, 14:15.000 --> 14:19.000 but reactant is very cool and it's open source so you can try that. 14:19.000 --> 14:21.000 Thank you. 14:22.000 --> 14:24.000 I think it's reliable. 14:24.000 --> 14:27.000 So ocean anagons has hundreds of kernels. 14:27.000 --> 14:29.000 And these are successfully raised. 14:29.000 --> 14:31.000 Other people are using it and it works. 14:31.000 --> 14:33.000 And if there's a book, you can report it. 14:33.000 --> 14:36.000 Or like, if there's something missing in the raising pipeline, 14:36.000 --> 14:38.000 you can report it and we'll add it. 14:38.000 --> 14:40.000 So we have hundreds of kernels. 14:40.000 --> 14:42.000 And these are successfully raised. 14:42.000 --> 14:44.000 Other people are using it and it works. 14:44.000 --> 14:46.000 And if there's a book, you can report it. 14:46.000 --> 14:49.000 Or like, if there's something missing in the raising pipeline, 14:49.000 --> 14:54.000 you can report it and we'll add it. 14:54.000 --> 15:01.000 So you're, did reactant lower down into an amlyard. 15:01.000 --> 15:03.000 Like those, right? 15:03.000 --> 15:05.000 Composer, is it an amlyard. 15:05.000 --> 15:08.000 Yeah. 15:08.000 --> 15:15.000 So reactant lowers into stable ratio. 15:15.000 --> 15:17.000 HLO and the question is whether this is like, 15:17.000 --> 15:20.000 you can combine this with other dialects. 15:20.000 --> 15:22.000 The answer is not really stable. 15:22.000 --> 15:24.000 HLO is like, it's an amlyard. 15:24.000 --> 15:27.000 I like, but it contains everything you need. 15:27.000 --> 15:30.000 So you can also target different dialects. 15:30.000 --> 15:33.000 I think, like for example, tryton. 15:34.000 --> 15:37.000 There's like some experimentation going on, 15:37.000 --> 15:41.000 but like stable echelos is the main thing. 15:54.000 --> 15:58.000 Yeah. So the question was whether we can also target the MPI, 15:58.000 --> 16:01.000 amlyard dialects, yes. 16:01.000 --> 16:05.000 We haven't tested on a large scale, but the support is there 16:05.000 --> 16:10.000 to translate MPI operations on the Julia code, 16:10.000 --> 16:15.000 translate like, like, rise them to the MPI amlyard dialects. 16:19.000 --> 16:21.000 No more questions? 16:21.000 --> 16:24.000 Let's get started. 16:25.000 --> 16:27.000 Yeah. 16:39.000 --> 16:42.000 So the question is whether, 16:42.000 --> 16:44.000 if you do state operational arrays, 16:44.000 --> 16:46.000 is that from the command of original? 16:46.000 --> 16:50.000 Okay. So the question is whether it's recommended to do a mutation 16:50.000 --> 16:52.000 on a race or not. 16:52.000 --> 16:55.000 I think it's actually a good style, so especially. 16:55.000 --> 17:00.000 So if you act on sidestep the Julia garbage collector, 17:00.000 --> 17:03.000 because it does all the memory applications on the, 17:03.000 --> 17:07.000 like a mellar level, so it doesn't use the Julia runtime. 17:07.000 --> 17:10.000 So in a sense, like in normal Julia code, 17:10.000 --> 17:14.000 you would like to reuse as much code as possible, 17:14.000 --> 17:19.000 so use mutation a lot, but you can still do it 17:19.000 --> 17:23.000 and still find it because I can, or you want to. 17:23.000 --> 17:27.000 Yeah, maybe also, I think if you do mutation within a function, 17:27.000 --> 17:30.000 the back end compiler will actually realize, 17:30.000 --> 17:33.000 in many cases, that you can reuse the memory, 17:33.000 --> 17:37.000 so it will actually not not like materialize too much. 17:40.000 --> 17:43.000 Yeah, so it does the automatic donation, right? 17:43.000 --> 17:46.000 So the lucky installation of this concept of donating an argument 17:46.000 --> 17:52.000 to a function, to reuse it, and reactant is able to do this process automatically. 17:54.000 --> 17:55.000 Yeah? 17:55.000 --> 17:58.000 So you mentioned that you could get slide at the point 17:58.000 --> 18:02.000 that when it, like there's a boundary where it goes into M. 18:02.000 --> 18:07.000 I already unwrap and then back to Julia and just do that. 18:07.000 --> 18:11.000 Like if you have multiple functions that are all using the same thing, 18:11.000 --> 18:12.000 let me know. 18:16.000 --> 18:23.000 So, yeah, the question is whether, if you have multiple functions that take a point, 18:23.000 --> 18:28.000 whether it needs to generate the code multiple times, 18:30.000 --> 18:35.000 so the code to unwrap and wrap will actually be cache type belief when you run it, 18:35.000 --> 18:40.000 but if you compile a reactant function, you actually compile like the whole reactant program, 18:40.000 --> 18:46.000 typically, so you only unwrap at the beginning and wrap at the end, 18:46.000 --> 18:51.000 so during execution, you can call different functions without any unwraping happening. 18:57.000 --> 19:01.000 So the question is whether it's all ahead of time, 19:01.000 --> 19:07.000 whether you can use Julia's jet, so you can use Julia's jet. 19:07.000 --> 19:13.000 If you read the final function and you recall it with the adjet macro to have it reactant, 19:13.000 --> 19:19.000 it will actually recompile the function, which is kind of what Julia does as well 19:19.000 --> 19:24.000 if you run a function, it's ahead of time, compiles it just in time. 19:37.000 --> 19:42.000 Thank you very much.