WEBVTT

00:00.000 --> 00:09.280
Now we've got Shagan, who's going to tell us about tie.

00:09.280 --> 00:14.720
I think if people are in the Python community, you've probably heard of this really cool

00:14.720 --> 00:23.560
typing here, type checker, yeah, but I'll let you explain it all, so take it away, so thank

00:23.560 --> 00:24.560
you.

00:24.560 --> 00:36.600
So, welcome, hello everyone, my name is Shagan, and I want to talk about tie tricking

00:36.600 --> 00:38.720
Python in Rust.

00:38.720 --> 00:45.640
This is about a project called TY, it is a tie tricker for Python from Astral, the

00:45.640 --> 00:52.520
same people behind Roppen UV, both are very popular projects in Python ecosystem, it

00:52.520 --> 00:57.000
is available as a CLI tool, it connects to your editor, we are language server protocols,

00:57.000 --> 01:02.200
so you can use it in VS Code and VIM, and it's also available on Play.ty.dev, where you can

01:02.200 --> 01:08.080
run it on the browser, it focuses on a speed, and reached out to no sticks, as you can see

01:08.080 --> 01:13.920
there, it's pretty fast, that's for running TY on the CLI for touching your whole project,

01:13.920 --> 01:18.420
but TY is also pretty fast when you run it on the editor, because when you run it on the

01:18.420 --> 01:21.660
editor, it actually needs to constantly type trick what you are typing, and what

01:21.660 --> 01:28.100
the project has to give you accurate autocompletion and diagnostics, and for that one,

01:28.100 --> 01:31.860
TY made a couple of decisions that really helped to achieve that speed, which I'm going to

01:31.860 --> 01:38.180
be talking about in the next slide, so this is going to be more technical talk, and yeah,

01:38.180 --> 01:42.980
I'm a contributor to TY, and I'm going to be talking about some of the interesting aspects

01:42.980 --> 01:50.860
of the internals, so before looking into it, I'm just going to give you a very small example

01:50.860 --> 01:55.900
of what a type tricker does for you, so this is a Python piece of code that I have,

01:55.900 --> 02:01.340
I have a function, it should be taking integers and input, and then I'm calling it with a string,

02:01.340 --> 02:07.260
obviously if you run this, you would get an error in run time with a type tricker, it would

02:07.260 --> 02:11.940
tell you that hey, you're passing a string that's all here instead of an integer, so it's

02:11.940 --> 02:18.500
going to obviously error on the run time, but it's also inspired by the rust compiler, which

02:18.500 --> 02:23.180
you might be familiar with, that it gives you more information about what other things can

02:23.180 --> 02:27.780
be helpful to fix this type or for you, so here, for example, we do show the function signature

02:27.780 --> 02:34.060
and the parameter that is declared, so you can get some help from the type tricker here

02:34.060 --> 02:43.420
to fix the problem, now let's look into how a talk tricker would work in a very simple way,

02:43.420 --> 02:47.740
just an overview, you might have a talk tricker that has more components than this, but

02:47.740 --> 02:54.180
this is the standard things that you would have in a type tricker, so the type tricker consists

02:54.180 --> 02:59.780
of a couple of passes that run over the source code and each one produce some data structures

02:59.780 --> 03:05.260
that the next pass can use to do more stuff on the source code, and then finally, we have

03:05.260 --> 03:08.500
a lot of information and then we can do the type inference, because we know about all the

03:08.500 --> 03:16.780
variables, all the functions and everything, and we can do the type tricker, so what

03:16.780 --> 03:22.020
does the parser does? Well, it takes your source code as ticks, and it will transform

03:22.020 --> 03:27.980
it into a set of strokes that we have, so it's a big statement, you know, and with a lot

03:27.980 --> 03:32.060
of a strokes in it, and because every Python program is just a set of statements, you

03:32.060 --> 03:37.500
can convert any line of code that you have into a statement, or a couple of lines

03:37.500 --> 03:41.540
going to one statement, and then in the later parts of the program, you can work with

03:41.540 --> 03:47.020
those statements since that, so we can do matches on them, and cover all the cases, and

03:47.020 --> 03:53.260
you can see here, I have a, I'm defining X, with annotation int, and assigning one to it,

03:53.260 --> 04:00.460
and I capture all that information into this stroke, and here the value is an expression,

04:00.460 --> 04:07.540
annotation is an expression, the target is just a name here, then we have the semantic

04:07.580 --> 04:14.380
analyzer, which goes over the ASD that parser produce for us, goes over all of the

04:14.380 --> 04:19.260
structures, and then we would try to drive more information from that, things like what are

04:19.260 --> 04:24.740
the definitions in my program, functions, classes, scopes, all these things, what are the

04:24.740 --> 04:29.700
variables defined, so I have all of this information, and then later in the type inference,

04:29.700 --> 04:33.500
when you're assigning another variable to another variable, then the target tricker can

04:33.500 --> 04:37.980
ask the semantic index, where is this variable defined, and I have all these definitions

04:37.980 --> 04:45.820
index and the semantic index, so I can look it up and do the inference, then finally we

04:45.820 --> 04:51.620
have the type inference, and it just goes over all the statements and expressions I have

04:51.620 --> 04:58.020
in the program, and infer a type for them, so here I had int, it was an annotation, it's

04:58.020 --> 05:03.180
an expression, so we infer the type of it as class literal int is referring to the int

05:03.300 --> 05:14.300
inside Python, then I have one and the same type because one was assigned to x, now let's

05:14.300 --> 05:19.020
look at the semantic index more close, with some examples of stuff that it records for

05:19.020 --> 05:26.900
us, here I have a class, and it function inside of it, so a method, and then these two

05:26.940 --> 05:34.420
create two scopes, I have classes scope, and then I have the function scope, we need to record

05:34.420 --> 05:39.420
these scopes so later, for example, looking up variables in the method food, then you can

05:39.420 --> 05:45.140
know what are the pants scopes, so you can walk up the scopes and see what are all the variables

05:45.140 --> 05:53.020
that are visible to us there, and we do this by recording every scope into a vector, we just

05:53.020 --> 05:58.180
put scopes in the vector, every scopes gets an ID, so scopes can refer to each other by

05:58.180 --> 06:04.780
an ID, that's where the file scope ID comes in, and scopes inside themselves have their parent

06:04.780 --> 06:09.580
attribute, where it just points to another scope with that ID, this is a pattern that you

06:09.580 --> 06:15.060
would see in TY a lot with having IDs that refer to stop, so you can have the IDs inside

06:15.060 --> 06:19.740
different structs to point to each other, and then you would look up the vector with that

06:19.740 --> 06:27.980
ID to get the actual information, and our example of stuff we record are definitions, so here

06:27.980 --> 06:33.780
I have a function is defined, a parameter is defined for a function, and all of these are

06:33.780 --> 06:40.180
recorded in the hash map with the node key, with all the definitions, I think I have multiple

06:40.180 --> 06:45.420
definitions in Python, you can just overwrite the stuff, so you need to have all the definitions

06:45.420 --> 06:51.780
for that, and we also record the constraints, so constraints are two kinds, you have a narrowing

06:51.780 --> 06:57.580
constraint, like the if statement I have over there, and saying X is either type of integer

06:57.580 --> 07:02.700
or non, and then I'm checking if X is not non, so then this would narrow the type of X in

07:02.700 --> 07:09.740
that if branch for me, and it does that by recording all the constraints within the scopes,

07:09.740 --> 07:17.900
so I can know what is the type of X inside that if statement, and then it would be integer,

07:17.900 --> 07:24.940
then I have reachability constraints, and these would be things like you have, and if statement

07:24.940 --> 07:30.060
that would say if Python version is bigger than this tree point 11, for example, they find

07:30.060 --> 07:35.180
this variable, and then the type to record that reachability constraints, and then if you type

07:35.260 --> 07:39.660
a project with a smaller Python version, then it would note that that part of code is not

07:39.660 --> 07:46.140
reachable, and the variable should not be defined, and you get an error if you use it, then let's

07:46.140 --> 07:51.180
look at the type inference with more examples, so here I have a, again, a small piece of code,

07:51.980 --> 07:57.980
and the type inference is set of calls that recursively call each other until all the expressions

07:57.980 --> 08:04.220
are inferred, so in this example if you focus on the second result, a plus 3, we would call

08:04.220 --> 08:09.020
infer expression, that is just a big if statement, and then it would call the appropriate function

08:09.020 --> 08:13.820
that should be inferred binary expression here, because that's a binary expression, and then that

08:13.820 --> 08:19.100
again needs to call infer expressions, so it recursively calling itself with a and 3 to get the

08:19.100 --> 08:23.100
type for those things, and then finally based on the type of those things, it knows that what happens

08:23.100 --> 08:31.500
when you add those things up, and gives us the final type, and this is the big if statement

08:31.500 --> 08:35.820
I talked about, so it's a match statement on the expression enum, and then we have cases for

08:35.820 --> 08:42.940
all the different expressions that you can have for a in the Python program, here just one is a

08:42.940 --> 08:48.780
numberally troll, so it matches the case, and we infer the type for it, for the result a plus 3,

08:48.780 --> 08:55.260
it's a bit more complicated, I need to call infer expression again on a and then on 3, and then it

08:55.260 --> 08:59.820
would resolve or a was defined in the previous line, and it was assigned one, so the type of it would

08:59.820 --> 09:06.540
be integral 1, and then finally with the two things I have, again I have another big match case where

09:06.540 --> 09:12.380
has the all the result for what happens when you apply add on two integerly trolls, and it has

09:12.380 --> 09:19.660
different cases for all the other types that you would have, so this was about the usual

09:19.660 --> 09:25.500
type checking stuff, one of the things that he why does that makes it fast in the editors is

09:25.580 --> 09:33.020
incremental type checking, this is about only type checking what you need, and when you need it,

09:33.020 --> 09:38.780
so imagine this program I have a main dot pie and a lip dot pie, and main dot pie is only

09:38.780 --> 09:44.780
importing a part of lip dot pie, and it's not depending on all of it, so when I type check

09:44.780 --> 09:51.100
main dot pie my editor, but I'm open main dot pie my editor, TY only infers the type of lip

09:51.100 --> 09:55.660
dot pie that I actually need, and in this case it would be a square function, and it would just

09:55.660 --> 10:04.540
a skip over the square with function, and also when you like as you work with more files and more

10:04.540 --> 10:10.460
files it will type check those, and it also catches this stuff for all of you, so if I don't change

10:10.460 --> 10:15.980
the lip dot pie at all, because it's a library, and I just focus on main, it will reuse the

10:15.980 --> 10:23.180
cash from the type inference, so you never need to do the compute again and again, and the way

10:23.180 --> 10:28.860
does that is really a library called salsa, it is, if you're familiar with rostanalyzer, it's also

10:28.860 --> 10:35.820
uses this, and salsa provides us a couple of components that we can build our compilers, type

10:35.820 --> 10:42.940
triggers in this way, with the incremental compute model, and here's an example of one of the things

10:42.940 --> 10:49.020
that salsa provides, so this is what we call a salsa query, because of that macro at the beginning,

10:49.020 --> 10:54.380
and this macro is telling salsa to track dysfunction, and let's see what it does it track, so

10:56.780 --> 11:02.140
I passed the function a database, the database is used to track things like what dysfunction does,

11:02.140 --> 11:07.260
and the memoization of the function, it is a central database that salsa just stores everything

11:07.260 --> 11:12.700
inside of it, so every query gets that database argument, and then I'm passing also a definition

11:12.700 --> 11:19.180
to it, so I want to infer the top square definition, and inside of the function, I am calling

11:19.180 --> 11:25.340
two other queries, again parse module and semantic index, and salsa records that this query depends

11:25.340 --> 11:30.140
on these two other queries, so if the results of this other queries change, it means that this query

11:30.140 --> 11:35.180
is also invalidated, so I need to recompute this, otherwise if the input is the same, this query

11:35.180 --> 11:41.260
is the same, I can reuse the compute, so it would have a graph that looks like this kind of,

11:41.260 --> 11:47.020
so input and what it calls, and then as long as those things are staying the same,

11:47.020 --> 11:50.140
then I don't need to recompute the value and I can just use the cache.

11:51.980 --> 11:56.540
This is the database that you saw in the previous file, it's just a salsa DV macro,

11:56.540 --> 12:02.620
and salsa does all the stuff for us. salsa also provides a couple of other macros for our

12:02.620 --> 12:09.180
structs, so we have the salsa input macro, this one tells salsa that this is an input to our

12:09.180 --> 12:12.700
program, and you can think of, for example, the source code as an input to our program,

12:13.580 --> 12:18.540
and we can mutate the input because that's what the user changes. The other things that we have

12:18.540 --> 12:25.100
are intermediate values that we are computing, we are query functions, and the input is the thing

12:25.100 --> 12:32.780
that users mutate, and we have the intern structs, interning is a technique to, if you have

12:32.780 --> 12:38.380
something repeatedly happening in your program, you can just store one of that ones, that thing

12:38.380 --> 12:43.580
once, and then refer to that thing. For example, if you have a function name, and then if you

12:43.580 --> 12:48.620
encounter that function name over and over again, once it's defined, once it's called, and so on,

12:48.620 --> 12:55.980
you can store it once, and then get back an ID from salsa, and we did ID, you can refer to it.

12:55.980 --> 13:00.940
So you just allocate the memory once to the Cisco once, and then just have the ID, and then if you

13:00.940 --> 13:06.620
intern something over and over again, salsa just gives you back the same ID. Also here, if you

13:06.620 --> 13:12.380
intern a function type, if you have a lot of functions with the same type, it's just going to be

13:12.380 --> 13:19.020
one object in our database, and the risks are going to be using that, and then we have the queries

13:19.020 --> 13:24.620
that we just looked at, so it takes a database, something that implements the database create

13:25.580 --> 13:31.580
with the DB lifetime. So as a result of this, all of our structs would be copy, we can pass

13:31.580 --> 13:36.620
a stuff around, you don't have the problems that people usually face in draft projects,

13:36.620 --> 13:41.420
you can move stuff around, you don't need to clone, you don't worry about it, it's just an ID.

13:41.420 --> 13:45.260
Everything, all the field values, all the stuff are stored in the salsa database, you need to

13:45.260 --> 13:49.420
look them up in the database when you actually use them, but if you just want to pass in stuff,

13:49.420 --> 13:54.780
it's an ID, and most of our structs will get the DB lifetime. This will help us to, for

13:54.780 --> 13:59.100
example, if you want to pass references around or return references, everything has this

13:59.180 --> 14:03.580
DB lifetime, so you also don't need to really worry about what should be the lifetime,

14:03.580 --> 14:08.620
because it's all DB lifetime, and then it also prevents some things like, for example,

14:08.620 --> 14:14.220
mutating something in the middle of when you are using a cash value and stops you from mutating

14:14.220 --> 14:21.340
the input, for example, at that point. In order to take advantage of these functionalities that

14:21.420 --> 14:30.860
also provide, we need to think about the problem a bit differently, so if I just infer all the

14:30.860 --> 14:36.380
Python program all at once or all at once, if something in the file changes, then I need to

14:36.380 --> 14:43.740
redo that again, so the cashing is not as good as it can be. One technique here that PY does is

14:44.380 --> 14:48.780
you split the inference into multiple regions, so to scopes and definitions, and then

14:49.420 --> 14:55.900
if the file changes, some of that inference region can survive, because if you touch something that

14:55.900 --> 15:01.340
is not affecting other scopes, then other scopes can still be in the cash and can reuse the cash in.

15:02.220 --> 15:08.860
So, same with the inference, also we have like a smaller query, so for example, here we can

15:08.860 --> 15:13.180
depend on the place table, instead of depending on the whole semantic index, if something is only

15:13.260 --> 15:18.300
using that, or we only infer the definition types instead of inferring the whole program

15:18.300 --> 15:24.060
or all at once with one giant query. Well, it's also does a lot of nice things, but it's not

15:24.060 --> 15:28.780
done yet, here's an other big problem that you would face in type taking, and that is cycles.

15:30.380 --> 15:38.700
Cycles happen usually when you have a thing that refers to itself, and although it might seem,

15:38.780 --> 15:43.740
I don't know, confusing, but this can actually run as a Python program, because someone can

15:43.740 --> 15:49.740
make an object of D, and then assign x to it outside of this class, and then create another object,

15:49.740 --> 15:55.020
and just call D copy with the order of object that they assign the x to. And this can work,

15:55.020 --> 15:59.260
but from the perspective of a type tricker, if we could, or something to the shoe, it would be

16:01.580 --> 16:06.540
exactly like, I'm trying to find out what is the type of self.x. To do that, I need to know

16:06.700 --> 16:12.700
what is the type of order of the x. The type of order is D, so I need to go to the class D,

16:12.700 --> 16:18.620
and find the attribute x, so I know what is the type of self.x, and that was exactly what I was

16:18.620 --> 16:24.380
doing, and those infer expression calls keep calling each other repeatedly until you crash,

16:24.380 --> 16:31.820
and you never actually quit, so this kind of loop happens, and solso has a solution for this,

16:31.900 --> 16:40.700
it is called fixed point iterations. With fixed point iterations, you would have, so let's think

16:40.700 --> 16:45.260
about a problem again, we have one function calling another function, and then finally calling itself

16:45.260 --> 16:52.780
again. If we have some kind of if statement at a function, so it knows when it calls itself again,

16:52.780 --> 16:58.940
then it can stop that infinite loop from happening, but when we need to know what should we

16:58.940 --> 17:06.940
return at that point, we are calling it or so again, and you can't set the cycle, if in,

17:06.940 --> 17:14.940
and cycle initial attributes in the tract, solso tract macro, and it would exactly do those things

17:14.940 --> 17:21.660
for you. So, cycle initial would tell solso what type should I use when I'm calling myself again,

17:21.660 --> 17:26.700
and I have no way out, and it would return that type, so you can have all the functions that

17:26.700 --> 17:31.340
recursively call each other, do their compute, and then finally resolve, and then you get back to the

17:31.340 --> 17:36.860
first function that called itself, and then you would have two things, I assume the value, and I also

17:36.860 --> 17:41.900
computed a value, or these two things are same, if they are the same, I assume something and I ended up

17:41.900 --> 17:47.500
with the same thing, so if I do that again, it will also be the same, but sometimes you might

17:47.500 --> 17:53.820
assume type never, which is the bottom type in Python, it is representing as distinct never happens,

17:54.780 --> 17:59.100
and if I assume something like type never, and then I do my real computation, and then I end up with type

17:59.100 --> 18:07.100
unknown, well, this is not valid, so I can do this cycle again, I would iterate again, I now have

18:07.100 --> 18:12.540
type unknown, and I can say, okay, imagine that assume that I have type unknown, and do the compute

18:12.540 --> 18:17.740
all over again, and if I end it up with the unknown type again, it means that if I just do that

18:17.740 --> 18:22.940
over and over, it's just going to be the same result, so it's at least a stable, it converged

18:23.100 --> 18:27.980
the cycle, otherwise if it never converges, you might have something that just toggles

18:27.980 --> 18:34.860
between two values or things like that, then the cycle has two ways, you can either iterate until

18:34.860 --> 18:39.100
sometime that you would panic, or you can also fall back into an unknown value, and just say,

18:39.100 --> 18:44.940
I couldn't solve the value for this, and yeah, this is really helpful, because with all the

18:44.940 --> 18:50.780
recursive calls that you have, manually instrumenting your code with the big hassle, but it's also

18:50.860 --> 18:58.860
it does this for you, with this solution. Then we have the testing framework in TY, it is built

18:58.860 --> 19:06.380
in house, it is a very nice thing, so all the tests only a huge chunk of the test suit is written

19:06.380 --> 19:12.140
in markdown, and we have big markdown files, we Python code is typically outside of them, and then

19:12.140 --> 19:17.980
we explain, for example, this test case is testing this behavior, and then inside the Python code,

19:17.980 --> 19:23.020
there is a magic function like that reveal type, and you would write your own Python program,

19:23.020 --> 19:29.420
and you can use the reveal type to assert that the type of that thing is what you want, so in the

19:29.420 --> 19:34.700
comment, so it's a valid Python program, and it's also a test case, and here I have, for example,

19:34.700 --> 19:39.580
revealed int, it is not an int, it is literal one, TY tries to be very specific with the types,

19:39.580 --> 19:43.660
and when it knows about it, so it knows that you assign one, it is definitely this for one,

19:44.620 --> 19:49.980
and when you run core but test, you would get a pointer to that annotations MD file, so that's

19:49.980 --> 19:55.340
your test case, and then the exact line that the problem happened, and then it would say,

19:55.900 --> 20:03.660
it reveals literal one, not integer, and whenever you submit PRCTY, there is a huge test suit

20:03.660 --> 20:08.860
that runs, you have this MD test, the same markdown test, I just showed you running,

20:08.860 --> 20:12.780
then there is a conformant test in Python, that's the thing that Python type in

20:12.780 --> 20:17.180
console works on, and they have sets of rules of what type figure should show for different

20:17.180 --> 20:23.420
annotations, and it also runs on a lot of open source projects to know, what is the impact of your

20:23.420 --> 20:28.620
change on these projects, how many more diagnostics are you showing, what are your memory and timing

20:28.620 --> 20:34.300
metrics changing, and this is an example of one of those reports that you have, and then you can

20:34.300 --> 20:41.100
go to the projects to see, for example, what were the new diagnostics that you added, and she's

20:41.180 --> 20:45.180
very helpful for her, if you don't really know what's the bright answer for some decision,

20:45.180 --> 20:48.940
you can just try it out and see what happens on the open source projects, because people

20:48.940 --> 20:56.220
write Python in a lovely different ways, that's it, here's my blog, there's more Tribal Posts