WEBVTT

00:00.000 --> 00:05.000
Thank you.

00:05.000 --> 00:06.000
Thank you.

00:06.000 --> 00:10.000
So, everyone, I'm Masajardano from UCL.

00:10.000 --> 00:13.000
I'm a shoe from Ganty University.

00:13.000 --> 00:16.000
So, we were talking about how to accelerate

00:16.000 --> 00:20.000
scientific code using AI hardware

00:20.000 --> 00:23.000
and the drill-up package to react into jail.

00:23.000 --> 00:25.000
So, you know that every nowadays,

00:25.000 --> 00:28.000
the AI hardware dominates modern computing world

00:28.000 --> 00:31.000
if you wait a few weeks, you can see some news

00:31.000 --> 00:34.000
about some expanding on AI

00:34.000 --> 00:39.000
from different vendors and different AI providers,

00:39.000 --> 00:43.000
also not just GPUs, also GPUs and other hardware,

00:43.000 --> 00:45.000
and also they affect the world markets

00:45.000 --> 00:48.000
when these things happen.

00:48.000 --> 00:51.000
So, yeah, there is lots of this compute available,

00:51.000 --> 00:53.000
but it's mainly used for AI,

00:53.000 --> 00:55.000
but we want to do also something like using

00:55.000 --> 00:59.000
for regular old-school scientific computing.

00:59.000 --> 01:02.000
So, how do you use these accelerators?

01:02.000 --> 01:05.000
If you use Python, probably you will

01:05.000 --> 01:07.000
rewrite it in Jax or PyTorch,

01:07.000 --> 01:10.000
if you have already some code,

01:10.000 --> 01:12.000
but we are Julia uses.

01:12.000 --> 01:17.000
Julia is a compile language based on LLVM.

01:17.000 --> 01:21.000
So, this will not be an approach for us,

01:21.000 --> 01:22.000
so we need to do something else.

01:22.000 --> 01:24.000
But taking a step back,

01:24.000 --> 01:29.000
any important thing in modern scientific computing

01:29.000 --> 01:32.000
is to do automatic information,

01:32.000 --> 01:34.000
and why this is important.

01:34.000 --> 01:39.000
The derivatives are everywhere in the physical laws

01:39.000 --> 01:41.000
or natural laws.

01:41.000 --> 01:44.000
So, if you can compute the derivatives very fast,

01:44.000 --> 01:47.000
you can accelerate your code a lot,

01:47.000 --> 01:50.000
and it's also useful for machine learning applications

01:50.000 --> 01:52.000
because you can do the back propagation,

01:52.000 --> 01:54.000
or based on inference.

01:54.000 --> 01:57.000
You can see here some applications,

01:57.000 --> 01:59.000
like a folding of proteins,

01:59.000 --> 02:02.000
or reconstructing a Hello Kitty image,

02:02.000 --> 02:07.000
or accelerating the black hole imaging

02:07.000 --> 02:09.000
from the event horizon telescope.

02:09.000 --> 02:12.000
Yeah, so for example, in this case,

02:12.000 --> 02:14.000
they were able to go from analysis,

02:14.000 --> 02:16.000
taking one week on a cluster,

02:16.000 --> 02:20.000
but when they used automatic differentiation,

02:20.000 --> 02:22.000
they were able to reduce the compute time

02:22.000 --> 02:27.000
to one hour on a single core on a CPU.

02:27.000 --> 02:31.000
That's quite a speed up.

02:31.000 --> 02:35.000
How automatic differential differential works

02:35.000 --> 02:38.000
in most pipelines?

02:38.000 --> 02:41.000
So, you take your code,

02:41.000 --> 02:44.000
the source code, which could be written in C++,

02:44.000 --> 02:47.000
really aroused with or whatever language you like.

02:47.000 --> 02:51.000
You apply the automatic differentiation on the source code,

02:51.000 --> 02:54.000
and then that's lowered into some intermediary presentation

02:54.000 --> 02:56.000
of the compiler that can be optimized,

02:56.000 --> 03:00.000
and then you generate the native code.

03:00.000 --> 03:02.000
But this can be inefficient.

03:02.000 --> 03:07.000
A more efficient way to compute derivatives

03:07.000 --> 03:09.000
is the enzyme approach.

03:09.000 --> 03:12.000
And I mean, I am out of this engine based on LLVM,

03:12.000 --> 03:15.000
which works after optimization.

03:15.000 --> 03:16.000
So, you take your code,

03:16.000 --> 03:19.000
you lower to the compiler intermediary presentation,

03:19.000 --> 03:20.000
you optimize it,

03:20.000 --> 03:24.000
and then here you apply the automatic differentiation,

03:24.000 --> 03:27.000
and then possibly you can apply even further optimization

03:27.000 --> 03:29.000
to eventually get to the native code.

03:29.000 --> 03:31.000
And why this is important?

03:31.000 --> 03:35.000
As a case study of how the enzyme approach

03:35.000 --> 03:37.000
speeds up derivatives,

03:37.000 --> 03:40.000
for example, take the normalization of a vector.

03:41.000 --> 03:45.000
Here you have this function, which takes an input and output array,

03:45.000 --> 03:49.000
and you loop over all the elements of the input and the output,

03:49.000 --> 03:52.000
and you divide and you set the output array

03:52.000 --> 03:57.000
to the input array divided by the normalization of the input array.

03:57.000 --> 04:01.000
The problem of this is that you're basically computing

04:01.000 --> 04:04.000
the normalization factor in each iteration of the loop.

04:04.000 --> 04:06.000
The easiest thing to do is to just

04:06.000 --> 04:10.000
place the normalization out of the loop.

04:10.000 --> 04:12.000
But you need to remember to do this, right?

04:12.000 --> 04:15.000
You will select the compiler to do this for you.

04:15.000 --> 04:20.000
So, if you do automatic differentiation before optimization,

04:20.000 --> 04:26.000
the problem is you forget to place the normalization out of the loop.

04:26.000 --> 04:28.000
You do auto-diff,

04:28.000 --> 04:33.000
and so you get the derivative of your function inside the loop.

04:33.000 --> 04:37.000
And so the cost of the derivative is 0 of n squared,

04:37.000 --> 04:39.000
like it was here.

04:39.000 --> 04:42.000
And after optimization, it's not possible to hoist

04:42.000 --> 04:46.000
the factor out of the loop anymore,

04:46.000 --> 04:50.000
because it depends on the iteration.

04:50.000 --> 04:53.000
Instead, if you do what the order of the automatic differentiation

04:53.000 --> 04:58.000
and optimization, you get that you first do the optimization,

04:58.000 --> 05:02.000
so your friendly compiler can hoist the normalization out of the loop.

05:02.000 --> 05:06.000
And then the derivative, in this case,

05:06.000 --> 05:09.000
becomes a 0 of n derivative.

05:09.000 --> 05:13.000
So you get a much faster derivative automatic derivative

05:13.000 --> 05:18.000
by moving the auto-diff stage after optimization.

05:18.000 --> 05:22.000
And as a scientific motivating problem,

05:22.000 --> 05:26.000
to be used for applying all of these concepts together,

05:26.000 --> 05:27.000
was oceananic.

05:27.000 --> 05:30.000
So oceananic is a ocean model written in Julia.

05:30.000 --> 05:32.000
It's a very fast.

05:32.000 --> 05:37.000
It has been used up to 768 GPUs and demonstrating

05:37.000 --> 05:41.000
very high memory and energy efficiency,

05:41.000 --> 05:44.000
compared to other competitors.

05:44.000 --> 05:47.000
And you can also see some cool pictures like this,

05:47.000 --> 05:56.000
very high resolution of the ocean evolution on the globe.

05:56.000 --> 06:01.000
And about GPU programming via LLVM,

06:01.000 --> 06:06.000
which is, again, one of the most common compiler framework,

06:06.000 --> 06:10.000
the problem of doing GPU programming at the LLVM level

06:10.000 --> 06:14.000
is that LLVM IR is very low level.

06:14.000 --> 06:18.000
It doesn't have a good high level representation of the parallelism,

06:18.000 --> 06:24.000
which makes some optimization very hard or just impossible to do.

06:24.000 --> 06:28.000
And also specifically for GPU programming,

06:28.000 --> 06:32.000
the problem is the device code and the host code

06:32.000 --> 06:35.000
live in two different separate modules.

06:35.000 --> 06:39.000
So you have the host code calling a basically black box

06:39.000 --> 06:42.000
on the device code and on the GPU.

06:42.000 --> 06:46.000
And there is no way to do optimization between the device and the host code.

06:46.000 --> 06:48.000
So you cannot, for example, again,

06:48.000 --> 06:55.000
host some functions out of the loop inside the device code

06:55.000 --> 07:00.000
because there is no communication between these two.

07:00.000 --> 07:05.000
And if we look more deeply at the scientific codes,

07:05.000 --> 07:06.000
like in an ocean ion,

07:06.000 --> 07:11.000
we can say about 270 kernels, which look like this,

07:11.000 --> 07:14.000
where you are doing basically stencil code.

07:14.000 --> 07:19.000
And this is like the most natural way to write the code

07:19.000 --> 07:23.000
if you take like the paper formulas.

07:23.000 --> 07:26.000
But if you squint very hard,

07:26.000 --> 07:31.000
you realize that this stencil code are actually convolution.

07:31.000 --> 07:34.000
And much in learning accelerator are actually very good

07:34.000 --> 07:36.000
in convolution, like a tip used in particular,

07:36.000 --> 07:39.000
where the time to do convolution very efficiently.

07:39.000 --> 07:44.000
So one way will be you write your code to express

07:44.000 --> 07:48.000
this stencil code into convolutions,

07:48.000 --> 07:51.000
or you hope that your friendly compiler will do that for you.

07:51.000 --> 07:54.000
Again, this can be difficult to recognize this pattern

07:54.000 --> 07:58.000
and if there are optimizations, which require communication

07:58.000 --> 08:02.000
between the host code and the device code,

08:02.000 --> 08:06.000
like in the previous slide, this can be very hard to do.

08:07.000 --> 08:10.000
Instead, MLAR, MLAR is a compiler framework,

08:10.000 --> 08:14.000
which is based on MLVM, but it's more high level.

08:14.000 --> 08:19.000
So it has a better understanding of structures in your code

08:19.000 --> 08:23.000
and also parallelism and crucially in MLAR,

08:23.000 --> 08:28.000
the host code and the device code are actually in the same module.

08:28.000 --> 08:31.000
So if you take this,

08:31.000 --> 08:36.000
MLVM can erase into MLAR,

08:36.000 --> 08:39.000
then you can do the kind of optimization that you were saying before,

08:39.000 --> 08:42.000
like hosting a function out of a loop,

08:42.000 --> 08:45.000
when you know that this is constant.

08:45.000 --> 08:50.000
And also, MLAR is aware of parallelism

08:50.000 --> 08:54.000
and so can do optimization in this way.

08:55.000 --> 08:58.000
And if you apply derivatives or automatic derivatives,

08:58.000 --> 09:00.000
notative at this MLAR level,

09:00.000 --> 09:03.000
you also get much faster derivatives,

09:03.000 --> 09:07.000
which as you have seen before, can provide a big speed

09:07.000 --> 09:09.000
to your application.

09:09.000 --> 09:11.000
And now, Jules, we'll talk about the Java package,

09:11.000 --> 09:13.000
which brings all of this together.

09:13.000 --> 09:16.000
Yes, so MLAR is very cool,

09:16.000 --> 09:19.000
but you need a way to actually generate it.

09:19.000 --> 09:23.000
So reactant.jl is a Julia package that compiles

09:23.000 --> 09:26.000
Julia code to MLAR using the,

09:26.000 --> 09:28.000
than the MLAR code,

09:28.000 --> 09:31.000
when it can be compiled using the XLA compiler.

09:31.000 --> 09:34.000
And we also have these nice optimizations on top,

09:34.000 --> 09:37.000
and we can do automatic differentiation.

09:37.000 --> 09:42.000
There's actually two ways to get MLAR from Julia code.

09:42.000 --> 09:45.000
The first is to take existing kernels

09:45.000 --> 09:48.000
and to automatically erase them to MLAR.

09:48.000 --> 09:50.000
So in this example,

09:50.000 --> 09:53.000
we have this stencil kernel.

09:53.000 --> 09:57.000
It's lower to LLVM using the normal compilation flow.

09:57.000 --> 10:00.000
And then we have a pass that raises this to

10:00.000 --> 10:04.000
vendor, agnostic MLAR operations.

10:04.000 --> 10:08.000
And at this point, we try to recognize

10:08.000 --> 10:11.000
scalar operations and group these together

10:11.000 --> 10:13.000
in these TensorFlow level operations.

10:13.000 --> 10:15.000
And then we do additional optimizations.

10:15.000 --> 10:19.000
And so you can see throughout this compilation flow in the end,

10:19.000 --> 10:22.000
we end up with a convolution operation.

10:22.000 --> 10:25.000
And this can be executed very efficiently on,

10:25.000 --> 10:28.000
for example, Google TPUs.

10:28.000 --> 10:31.000
The other way to generate MLAR is to use

10:31.000 --> 10:33.000
reactant-straising engine.

10:33.000 --> 10:36.000
So you can write a regular Julia function.

10:36.000 --> 10:38.000
And then at the call site,

10:38.000 --> 10:42.000
you can pass it special array types.

10:42.000 --> 10:45.000
So two RRA creates a reactant array.

10:45.000 --> 10:49.000
And this will make it such that every function

10:49.000 --> 10:52.000
invocation with reactant array

10:52.000 --> 10:55.000
operant will actually generate an MLAR

10:55.000 --> 11:00.000
operation in the compiled MLAR.

11:00.000 --> 11:05.000
So this approach doesn't only work for array operants.

11:05.000 --> 11:08.000
You can also have arbitrary structures.

11:08.000 --> 11:10.000
For example, in this function,

11:10.000 --> 11:13.000
there's a function that takes two points.

11:13.000 --> 11:15.000
And each point has two fields.

11:15.000 --> 11:18.000
And when you compile this, when you trace this to MLAR,

11:18.000 --> 11:22.000
the MLAR code will have flattened each point

11:22.000 --> 11:26.000
into its two constituent sensors.

11:26.000 --> 11:29.000
It's also possible to return structs.

11:29.000 --> 11:32.000
And in this case, the flattened representation will be returned

11:32.000 --> 11:34.000
by the MLAR code.

11:34.000 --> 11:41.000
And there's a bridge to Julia that will reconstruct the structure.

11:41.000 --> 11:44.000
It's also possible to trace control flow.

11:44.000 --> 11:49.000
So in this case, we've added this trace macro in front of a four loop.

11:49.000 --> 11:54.000
And this will make it such that there's a loop in the generated MLAR code as well.

11:54.000 --> 11:56.000
If you don't include this trace,

11:56.000 --> 12:01.000
it will actually kind of unroll the loop by tracing through each iteration

12:01.000 --> 12:03.000
and adding the operations.

12:03.000 --> 12:05.000
There's an asterisk there,

12:05.000 --> 12:09.000
because this is a bit of a flaky system,

12:09.000 --> 12:12.000
that we're working on doing this automatically.

12:12.000 --> 12:18.000
So generating these loops without the annotation.

12:18.000 --> 12:24.000
Lastly, some people might know that stable each hello is a pure format.

12:24.000 --> 12:27.000
Each operation doesn't have side effects.

12:27.000 --> 12:32.000
So it's not possible for an operation to mutate its operands.

12:32.000 --> 12:35.000
While in the Julia code, this is possible.

12:35.000 --> 12:40.000
The first element of its operand 20,

12:40.000 --> 12:43.000
and it doesn't return anything.

12:43.000 --> 12:46.000
But in the traced MLAR code, this can't be expressed.

12:46.000 --> 12:49.000
So there's actually an operation that creates a new array,

12:49.000 --> 12:52.000
where this first element is mutated.

12:52.000 --> 12:54.000
And then it returns it.

12:54.000 --> 12:58.000
And again, in the binding from Julia to the compiler backend,

12:58.000 --> 13:03.000
it will insert this result value into the original operand.

13:04.000 --> 13:09.000
This was just a very quick overview of what reactant is capable of.

13:09.000 --> 13:11.000
But so there's more.

13:11.000 --> 13:17.000
There will be a lot of optimizations on top of the traced code.

13:17.000 --> 13:23.000
It's also possible to automatically distribute the execution.

13:23.000 --> 13:27.000
You can automatically differentiate using runtime.

13:27.000 --> 13:29.000
Like, Mozilla already talked about.

13:29.000 --> 13:32.000
And I focused on the Julia front end,

13:32.000 --> 13:38.000
but it's actually possible to do this with C++ code as well.

13:38.000 --> 13:43.000
So this is what we're looking at with ocean anagons.

13:43.000 --> 13:47.000
So we take existing Julia code from ocean anagons.

13:47.000 --> 13:50.000
And we compile it using reactant.

13:50.000 --> 13:56.000
And then we can automatically distribute it on a very large number of Google TPUs.

13:57.000 --> 14:00.000
So we're still working on performance,

14:00.000 --> 14:02.000
especially for multi-note cases.

14:02.000 --> 14:04.000
We want to get this line down,

14:04.000 --> 14:07.000
but we're getting there.

14:07.000 --> 14:13.000
So yeah, I maybe want to go into too much detail on the conclusion

14:13.000 --> 14:15.000
to get time for questions,

14:15.000 --> 14:19.000
but reactant is very cool and it's open source so you can try that.

14:19.000 --> 14:21.000
Thank you.

14:22.000 --> 14:24.000
I think it's reliable.

14:24.000 --> 14:27.000
So ocean anagons has hundreds of kernels.

14:27.000 --> 14:29.000
And these are successfully raised.

14:29.000 --> 14:31.000
Other people are using it and it works.

14:31.000 --> 14:33.000
And if there's a book, you can report it.

14:33.000 --> 14:36.000
Or like, if there's something missing in the raising pipeline,

14:36.000 --> 14:38.000
you can report it and we'll add it.

14:38.000 --> 14:40.000
So we have hundreds of kernels.

14:40.000 --> 14:42.000
And these are successfully raised.

14:42.000 --> 14:44.000
Other people are using it and it works.

14:44.000 --> 14:46.000
And if there's a book, you can report it.

14:46.000 --> 14:49.000
Or like, if there's something missing in the raising pipeline,

14:49.000 --> 14:54.000
you can report it and we'll add it.

14:54.000 --> 15:01.000
So you're, did reactant lower down into an amlyard.

15:01.000 --> 15:03.000
Like those, right?

15:03.000 --> 15:05.000
Composer, is it an amlyard.

15:05.000 --> 15:08.000
Yeah.

15:08.000 --> 15:15.000
So reactant lowers into stable ratio.

15:15.000 --> 15:17.000
HLO and the question is whether this is like,

15:17.000 --> 15:20.000
you can combine this with other dialects.

15:20.000 --> 15:22.000
The answer is not really stable.

15:22.000 --> 15:24.000
HLO is like, it's an amlyard.

15:24.000 --> 15:27.000
I like, but it contains everything you need.

15:27.000 --> 15:30.000
So you can also target different dialects.

15:30.000 --> 15:33.000
I think, like for example, tryton.

15:34.000 --> 15:37.000
There's like some experimentation going on,

15:37.000 --> 15:41.000
but like stable echelos is the main thing.

15:54.000 --> 15:58.000
Yeah. So the question was whether we can also target the MPI,

15:58.000 --> 16:01.000
amlyard dialects, yes.

16:01.000 --> 16:05.000
We haven't tested on a large scale, but the support is there

16:05.000 --> 16:10.000
to translate MPI operations on the Julia code,

16:10.000 --> 16:15.000
translate like, like, rise them to the MPI amlyard dialects.

16:19.000 --> 16:21.000
No more questions?

16:21.000 --> 16:24.000
Let's get started.

16:25.000 --> 16:27.000
Yeah.

16:39.000 --> 16:42.000
So the question is whether,

16:42.000 --> 16:44.000
if you do state operational arrays,

16:44.000 --> 16:46.000
is that from the command of original?

16:46.000 --> 16:50.000
Okay. So the question is whether it's recommended to do a mutation

16:50.000 --> 16:52.000
on a race or not.

16:52.000 --> 16:55.000
I think it's actually a good style, so especially.

16:55.000 --> 17:00.000
So if you act on sidestep the Julia garbage collector,

17:00.000 --> 17:03.000
because it does all the memory applications on the,

17:03.000 --> 17:07.000
like a mellar level, so it doesn't use the Julia runtime.

17:07.000 --> 17:10.000
So in a sense, like in normal Julia code,

17:10.000 --> 17:14.000
you would like to reuse as much code as possible,

17:14.000 --> 17:19.000
so use mutation a lot, but you can still do it

17:19.000 --> 17:23.000
and still find it because I can, or you want to.

17:23.000 --> 17:27.000
Yeah, maybe also, I think if you do mutation within a function,

17:27.000 --> 17:30.000
the back end compiler will actually realize,

17:30.000 --> 17:33.000
in many cases, that you can reuse the memory,

17:33.000 --> 17:37.000
so it will actually not not like materialize too much.

17:40.000 --> 17:43.000
Yeah, so it does the automatic donation, right?

17:43.000 --> 17:46.000
So the lucky installation of this concept of donating an argument

17:46.000 --> 17:52.000
to a function, to reuse it, and reactant is able to do this process automatically.

17:54.000 --> 17:55.000
Yeah?

17:55.000 --> 17:58.000
So you mentioned that you could get slide at the point

17:58.000 --> 18:02.000
that when it, like there's a boundary where it goes into M.

18:02.000 --> 18:07.000
I already unwrap and then back to Julia and just do that.

18:07.000 --> 18:11.000
Like if you have multiple functions that are all using the same thing,

18:11.000 --> 18:12.000
let me know.

18:16.000 --> 18:23.000
So, yeah, the question is whether, if you have multiple functions that take a point,

18:23.000 --> 18:28.000
whether it needs to generate the code multiple times,

18:30.000 --> 18:35.000
so the code to unwrap and wrap will actually be cache type belief when you run it,

18:35.000 --> 18:40.000
but if you compile a reactant function, you actually compile like the whole reactant program,

18:40.000 --> 18:46.000
typically, so you only unwrap at the beginning and wrap at the end,

18:46.000 --> 18:51.000
so during execution, you can call different functions without any unwraping happening.

18:57.000 --> 19:01.000
So the question is whether it's all ahead of time,

19:01.000 --> 19:07.000
whether you can use Julia's jet, so you can use Julia's jet.

19:07.000 --> 19:13.000
If you read the final function and you recall it with the adjet macro to have it reactant,

19:13.000 --> 19:19.000
it will actually recompile the function, which is kind of what Julia does as well

19:19.000 --> 19:24.000
if you run a function, it's ahead of time, compiles it just in time.

19:37.000 --> 19:42.000
Thank you very much.