WEBVTT 00:00.000 --> 00:09.280 Now we've got Shagan, who's going to tell us about tie. 00:09.280 --> 00:14.720 I think if people are in the Python community, you've probably heard of this really cool 00:14.720 --> 00:23.560 typing here, type checker, yeah, but I'll let you explain it all, so take it away, so thank 00:23.560 --> 00:24.560 you. 00:24.560 --> 00:36.600 So, welcome, hello everyone, my name is Shagan, and I want to talk about tie tricking 00:36.600 --> 00:38.720 Python in Rust. 00:38.720 --> 00:45.640 This is about a project called TY, it is a tie tricker for Python from Astral, the 00:45.640 --> 00:52.520 same people behind Roppen UV, both are very popular projects in Python ecosystem, it 00:52.520 --> 00:57.000 is available as a CLI tool, it connects to your editor, we are language server protocols, 00:57.000 --> 01:02.200 so you can use it in VS Code and VIM, and it's also available on Play.ty.dev, where you can 01:02.200 --> 01:08.080 run it on the browser, it focuses on a speed, and reached out to no sticks, as you can see 01:08.080 --> 01:13.920 there, it's pretty fast, that's for running TY on the CLI for touching your whole project, 01:13.920 --> 01:18.420 but TY is also pretty fast when you run it on the editor, because when you run it on the 01:18.420 --> 01:21.660 editor, it actually needs to constantly type trick what you are typing, and what 01:21.660 --> 01:28.100 the project has to give you accurate autocompletion and diagnostics, and for that one, 01:28.100 --> 01:31.860 TY made a couple of decisions that really helped to achieve that speed, which I'm going to 01:31.860 --> 01:38.180 be talking about in the next slide, so this is going to be more technical talk, and yeah, 01:38.180 --> 01:42.980 I'm a contributor to TY, and I'm going to be talking about some of the interesting aspects 01:42.980 --> 01:50.860 of the internals, so before looking into it, I'm just going to give you a very small example 01:50.860 --> 01:55.900 of what a type tricker does for you, so this is a Python piece of code that I have, 01:55.900 --> 02:01.340 I have a function, it should be taking integers and input, and then I'm calling it with a string, 02:01.340 --> 02:07.260 obviously if you run this, you would get an error in run time with a type tricker, it would 02:07.260 --> 02:11.940 tell you that hey, you're passing a string that's all here instead of an integer, so it's 02:11.940 --> 02:18.500 going to obviously error on the run time, but it's also inspired by the rust compiler, which 02:18.500 --> 02:23.180 you might be familiar with, that it gives you more information about what other things can 02:23.180 --> 02:27.780 be helpful to fix this type or for you, so here, for example, we do show the function signature 02:27.780 --> 02:34.060 and the parameter that is declared, so you can get some help from the type tricker here 02:34.060 --> 02:43.420 to fix the problem, now let's look into how a talk tricker would work in a very simple way, 02:43.420 --> 02:47.740 just an overview, you might have a talk tricker that has more components than this, but 02:47.740 --> 02:54.180 this is the standard things that you would have in a type tricker, so the type tricker consists 02:54.180 --> 02:59.780 of a couple of passes that run over the source code and each one produce some data structures 02:59.780 --> 03:05.260 that the next pass can use to do more stuff on the source code, and then finally, we have 03:05.260 --> 03:08.500 a lot of information and then we can do the type inference, because we know about all the 03:08.500 --> 03:16.780 variables, all the functions and everything, and we can do the type tricker, so what 03:16.780 --> 03:22.020 does the parser does? Well, it takes your source code as ticks, and it will transform 03:22.020 --> 03:27.980 it into a set of strokes that we have, so it's a big statement, you know, and with a lot 03:27.980 --> 03:32.060 of a strokes in it, and because every Python program is just a set of statements, you 03:32.060 --> 03:37.500 can convert any line of code that you have into a statement, or a couple of lines 03:37.500 --> 03:41.540 going to one statement, and then in the later parts of the program, you can work with 03:41.540 --> 03:47.020 those statements since that, so we can do matches on them, and cover all the cases, and 03:47.020 --> 03:53.260 you can see here, I have a, I'm defining X, with annotation int, and assigning one to it, 03:53.260 --> 04:00.460 and I capture all that information into this stroke, and here the value is an expression, 04:00.460 --> 04:07.540 annotation is an expression, the target is just a name here, then we have the semantic 04:07.580 --> 04:14.380 analyzer, which goes over the ASD that parser produce for us, goes over all of the 04:14.380 --> 04:19.260 structures, and then we would try to drive more information from that, things like what are 04:19.260 --> 04:24.740 the definitions in my program, functions, classes, scopes, all these things, what are the 04:24.740 --> 04:29.700 variables defined, so I have all of this information, and then later in the type inference, 04:29.700 --> 04:33.500 when you're assigning another variable to another variable, then the target tricker can 04:33.500 --> 04:37.980 ask the semantic index, where is this variable defined, and I have all these definitions 04:37.980 --> 04:45.820 index and the semantic index, so I can look it up and do the inference, then finally we 04:45.820 --> 04:51.620 have the type inference, and it just goes over all the statements and expressions I have 04:51.620 --> 04:58.020 in the program, and infer a type for them, so here I had int, it was an annotation, it's 04:58.020 --> 05:03.180 an expression, so we infer the type of it as class literal int is referring to the int 05:03.300 --> 05:14.300 inside Python, then I have one and the same type because one was assigned to x, now let's 05:14.300 --> 05:19.020 look at the semantic index more close, with some examples of stuff that it records for 05:19.020 --> 05:26.900 us, here I have a class, and it function inside of it, so a method, and then these two 05:26.940 --> 05:34.420 create two scopes, I have classes scope, and then I have the function scope, we need to record 05:34.420 --> 05:39.420 these scopes so later, for example, looking up variables in the method food, then you can 05:39.420 --> 05:45.140 know what are the pants scopes, so you can walk up the scopes and see what are all the variables 05:45.140 --> 05:53.020 that are visible to us there, and we do this by recording every scope into a vector, we just 05:53.020 --> 05:58.180 put scopes in the vector, every scopes gets an ID, so scopes can refer to each other by 05:58.180 --> 06:04.780 an ID, that's where the file scope ID comes in, and scopes inside themselves have their parent 06:04.780 --> 06:09.580 attribute, where it just points to another scope with that ID, this is a pattern that you 06:09.580 --> 06:15.060 would see in TY a lot with having IDs that refer to stop, so you can have the IDs inside 06:15.060 --> 06:19.740 different structs to point to each other, and then you would look up the vector with that 06:19.740 --> 06:27.980 ID to get the actual information, and our example of stuff we record are definitions, so here 06:27.980 --> 06:33.780 I have a function is defined, a parameter is defined for a function, and all of these are 06:33.780 --> 06:40.180 recorded in the hash map with the node key, with all the definitions, I think I have multiple 06:40.180 --> 06:45.420 definitions in Python, you can just overwrite the stuff, so you need to have all the definitions 06:45.420 --> 06:51.780 for that, and we also record the constraints, so constraints are two kinds, you have a narrowing 06:51.780 --> 06:57.580 constraint, like the if statement I have over there, and saying X is either type of integer 06:57.580 --> 07:02.700 or non, and then I'm checking if X is not non, so then this would narrow the type of X in 07:02.700 --> 07:09.740 that if branch for me, and it does that by recording all the constraints within the scopes, 07:09.740 --> 07:17.900 so I can know what is the type of X inside that if statement, and then it would be integer, 07:17.900 --> 07:24.940 then I have reachability constraints, and these would be things like you have, and if statement 07:24.940 --> 07:30.060 that would say if Python version is bigger than this tree point 11, for example, they find 07:30.060 --> 07:35.180 this variable, and then the type to record that reachability constraints, and then if you type 07:35.260 --> 07:39.660 a project with a smaller Python version, then it would note that that part of code is not 07:39.660 --> 07:46.140 reachable, and the variable should not be defined, and you get an error if you use it, then let's 07:46.140 --> 07:51.180 look at the type inference with more examples, so here I have a, again, a small piece of code, 07:51.980 --> 07:57.980 and the type inference is set of calls that recursively call each other until all the expressions 07:57.980 --> 08:04.220 are inferred, so in this example if you focus on the second result, a plus 3, we would call 08:04.220 --> 08:09.020 infer expression, that is just a big if statement, and then it would call the appropriate function 08:09.020 --> 08:13.820 that should be inferred binary expression here, because that's a binary expression, and then that 08:13.820 --> 08:19.100 again needs to call infer expressions, so it recursively calling itself with a and 3 to get the 08:19.100 --> 08:23.100 type for those things, and then finally based on the type of those things, it knows that what happens 08:23.100 --> 08:31.500 when you add those things up, and gives us the final type, and this is the big if statement 08:31.500 --> 08:35.820 I talked about, so it's a match statement on the expression enum, and then we have cases for 08:35.820 --> 08:42.940 all the different expressions that you can have for a in the Python program, here just one is a 08:42.940 --> 08:48.780 numberally troll, so it matches the case, and we infer the type for it, for the result a plus 3, 08:48.780 --> 08:55.260 it's a bit more complicated, I need to call infer expression again on a and then on 3, and then it 08:55.260 --> 08:59.820 would resolve or a was defined in the previous line, and it was assigned one, so the type of it would 08:59.820 --> 09:06.540 be integral 1, and then finally with the two things I have, again I have another big match case where 09:06.540 --> 09:12.380 has the all the result for what happens when you apply add on two integerly trolls, and it has 09:12.380 --> 09:19.660 different cases for all the other types that you would have, so this was about the usual 09:19.660 --> 09:25.500 type checking stuff, one of the things that he why does that makes it fast in the editors is 09:25.580 --> 09:33.020 incremental type checking, this is about only type checking what you need, and when you need it, 09:33.020 --> 09:38.780 so imagine this program I have a main dot pie and a lip dot pie, and main dot pie is only 09:38.780 --> 09:44.780 importing a part of lip dot pie, and it's not depending on all of it, so when I type check 09:44.780 --> 09:51.100 main dot pie my editor, but I'm open main dot pie my editor, TY only infers the type of lip 09:51.100 --> 09:55.660 dot pie that I actually need, and in this case it would be a square function, and it would just 09:55.660 --> 10:04.540 a skip over the square with function, and also when you like as you work with more files and more 10:04.540 --> 10:10.460 files it will type check those, and it also catches this stuff for all of you, so if I don't change 10:10.460 --> 10:15.980 the lip dot pie at all, because it's a library, and I just focus on main, it will reuse the 10:15.980 --> 10:23.180 cash from the type inference, so you never need to do the compute again and again, and the way 10:23.180 --> 10:28.860 does that is really a library called salsa, it is, if you're familiar with rostanalyzer, it's also 10:28.860 --> 10:35.820 uses this, and salsa provides us a couple of components that we can build our compilers, type 10:35.820 --> 10:42.940 triggers in this way, with the incremental compute model, and here's an example of one of the things 10:42.940 --> 10:49.020 that salsa provides, so this is what we call a salsa query, because of that macro at the beginning, 10:49.020 --> 10:54.380 and this macro is telling salsa to track dysfunction, and let's see what it does it track, so 10:56.780 --> 11:02.140 I passed the function a database, the database is used to track things like what dysfunction does, 11:02.140 --> 11:07.260 and the memoization of the function, it is a central database that salsa just stores everything 11:07.260 --> 11:12.700 inside of it, so every query gets that database argument, and then I'm passing also a definition 11:12.700 --> 11:19.180 to it, so I want to infer the top square definition, and inside of the function, I am calling 11:19.180 --> 11:25.340 two other queries, again parse module and semantic index, and salsa records that this query depends 11:25.340 --> 11:30.140 on these two other queries, so if the results of this other queries change, it means that this query 11:30.140 --> 11:35.180 is also invalidated, so I need to recompute this, otherwise if the input is the same, this query 11:35.180 --> 11:41.260 is the same, I can reuse the compute, so it would have a graph that looks like this kind of, 11:41.260 --> 11:47.020 so input and what it calls, and then as long as those things are staying the same, 11:47.020 --> 11:50.140 then I don't need to recompute the value and I can just use the cache. 11:51.980 --> 11:56.540 This is the database that you saw in the previous file, it's just a salsa DV macro, 11:56.540 --> 12:02.620 and salsa does all the stuff for us. salsa also provides a couple of other macros for our 12:02.620 --> 12:09.180 structs, so we have the salsa input macro, this one tells salsa that this is an input to our 12:09.180 --> 12:12.700 program, and you can think of, for example, the source code as an input to our program, 12:13.580 --> 12:18.540 and we can mutate the input because that's what the user changes. The other things that we have 12:18.540 --> 12:25.100 are intermediate values that we are computing, we are query functions, and the input is the thing 12:25.100 --> 12:32.780 that users mutate, and we have the intern structs, interning is a technique to, if you have 12:32.780 --> 12:38.380 something repeatedly happening in your program, you can just store one of that ones, that thing 12:38.380 --> 12:43.580 once, and then refer to that thing. For example, if you have a function name, and then if you 12:43.580 --> 12:48.620 encounter that function name over and over again, once it's defined, once it's called, and so on, 12:48.620 --> 12:55.980 you can store it once, and then get back an ID from salsa, and we did ID, you can refer to it. 12:55.980 --> 13:00.940 So you just allocate the memory once to the Cisco once, and then just have the ID, and then if you 13:00.940 --> 13:06.620 intern something over and over again, salsa just gives you back the same ID. Also here, if you 13:06.620 --> 13:12.380 intern a function type, if you have a lot of functions with the same type, it's just going to be 13:12.380 --> 13:19.020 one object in our database, and the risks are going to be using that, and then we have the queries 13:19.020 --> 13:24.620 that we just looked at, so it takes a database, something that implements the database create 13:25.580 --> 13:31.580 with the DB lifetime. So as a result of this, all of our structs would be copy, we can pass 13:31.580 --> 13:36.620 a stuff around, you don't have the problems that people usually face in draft projects, 13:36.620 --> 13:41.420 you can move stuff around, you don't need to clone, you don't worry about it, it's just an ID. 13:41.420 --> 13:45.260 Everything, all the field values, all the stuff are stored in the salsa database, you need to 13:45.260 --> 13:49.420 look them up in the database when you actually use them, but if you just want to pass in stuff, 13:49.420 --> 13:54.780 it's an ID, and most of our structs will get the DB lifetime. This will help us to, for 13:54.780 --> 13:59.100 example, if you want to pass references around or return references, everything has this 13:59.180 --> 14:03.580 DB lifetime, so you also don't need to really worry about what should be the lifetime, 14:03.580 --> 14:08.620 because it's all DB lifetime, and then it also prevents some things like, for example, 14:08.620 --> 14:14.220 mutating something in the middle of when you are using a cash value and stops you from mutating 14:14.220 --> 14:21.340 the input, for example, at that point. In order to take advantage of these functionalities that 14:21.420 --> 14:30.860 also provide, we need to think about the problem a bit differently, so if I just infer all the 14:30.860 --> 14:36.380 Python program all at once or all at once, if something in the file changes, then I need to 14:36.380 --> 14:43.740 redo that again, so the cashing is not as good as it can be. One technique here that PY does is 14:44.380 --> 14:48.780 you split the inference into multiple regions, so to scopes and definitions, and then 14:49.420 --> 14:55.900 if the file changes, some of that inference region can survive, because if you touch something that 14:55.900 --> 15:01.340 is not affecting other scopes, then other scopes can still be in the cash and can reuse the cash in. 15:02.220 --> 15:08.860 So, same with the inference, also we have like a smaller query, so for example, here we can 15:08.860 --> 15:13.180 depend on the place table, instead of depending on the whole semantic index, if something is only 15:13.260 --> 15:18.300 using that, or we only infer the definition types instead of inferring the whole program 15:18.300 --> 15:24.060 or all at once with one giant query. Well, it's also does a lot of nice things, but it's not 15:24.060 --> 15:28.780 done yet, here's an other big problem that you would face in type taking, and that is cycles. 15:30.380 --> 15:38.700 Cycles happen usually when you have a thing that refers to itself, and although it might seem, 15:38.780 --> 15:43.740 I don't know, confusing, but this can actually run as a Python program, because someone can 15:43.740 --> 15:49.740 make an object of D, and then assign x to it outside of this class, and then create another object, 15:49.740 --> 15:55.020 and just call D copy with the order of object that they assign the x to. And this can work, 15:55.020 --> 15:59.260 but from the perspective of a type tricker, if we could, or something to the shoe, it would be 16:01.580 --> 16:06.540 exactly like, I'm trying to find out what is the type of self.x. To do that, I need to know 16:06.700 --> 16:12.700 what is the type of order of the x. The type of order is D, so I need to go to the class D, 16:12.700 --> 16:18.620 and find the attribute x, so I know what is the type of self.x, and that was exactly what I was 16:18.620 --> 16:24.380 doing, and those infer expression calls keep calling each other repeatedly until you crash, 16:24.380 --> 16:31.820 and you never actually quit, so this kind of loop happens, and solso has a solution for this, 16:31.900 --> 16:40.700 it is called fixed point iterations. With fixed point iterations, you would have, so let's think 16:40.700 --> 16:45.260 about a problem again, we have one function calling another function, and then finally calling itself 16:45.260 --> 16:52.780 again. If we have some kind of if statement at a function, so it knows when it calls itself again, 16:52.780 --> 16:58.940 then it can stop that infinite loop from happening, but when we need to know what should we 16:58.940 --> 17:06.940 return at that point, we are calling it or so again, and you can't set the cycle, if in, 17:06.940 --> 17:14.940 and cycle initial attributes in the tract, solso tract macro, and it would exactly do those things 17:14.940 --> 17:21.660 for you. So, cycle initial would tell solso what type should I use when I'm calling myself again, 17:21.660 --> 17:26.700 and I have no way out, and it would return that type, so you can have all the functions that 17:26.700 --> 17:31.340 recursively call each other, do their compute, and then finally resolve, and then you get back to the 17:31.340 --> 17:36.860 first function that called itself, and then you would have two things, I assume the value, and I also 17:36.860 --> 17:41.900 computed a value, or these two things are same, if they are the same, I assume something and I ended up 17:41.900 --> 17:47.500 with the same thing, so if I do that again, it will also be the same, but sometimes you might 17:47.500 --> 17:53.820 assume type never, which is the bottom type in Python, it is representing as distinct never happens, 17:54.780 --> 17:59.100 and if I assume something like type never, and then I do my real computation, and then I end up with type 17:59.100 --> 18:07.100 unknown, well, this is not valid, so I can do this cycle again, I would iterate again, I now have 18:07.100 --> 18:12.540 type unknown, and I can say, okay, imagine that assume that I have type unknown, and do the compute 18:12.540 --> 18:17.740 all over again, and if I end it up with the unknown type again, it means that if I just do that 18:17.740 --> 18:22.940 over and over, it's just going to be the same result, so it's at least a stable, it converged 18:23.100 --> 18:27.980 the cycle, otherwise if it never converges, you might have something that just toggles 18:27.980 --> 18:34.860 between two values or things like that, then the cycle has two ways, you can either iterate until 18:34.860 --> 18:39.100 sometime that you would panic, or you can also fall back into an unknown value, and just say, 18:39.100 --> 18:44.940 I couldn't solve the value for this, and yeah, this is really helpful, because with all the 18:44.940 --> 18:50.780 recursive calls that you have, manually instrumenting your code with the big hassle, but it's also 18:50.860 --> 18:58.860 it does this for you, with this solution. Then we have the testing framework in TY, it is built 18:58.860 --> 19:06.380 in house, it is a very nice thing, so all the tests only a huge chunk of the test suit is written 19:06.380 --> 19:12.140 in markdown, and we have big markdown files, we Python code is typically outside of them, and then 19:12.140 --> 19:17.980 we explain, for example, this test case is testing this behavior, and then inside the Python code, 19:17.980 --> 19:23.020 there is a magic function like that reveal type, and you would write your own Python program, 19:23.020 --> 19:29.420 and you can use the reveal type to assert that the type of that thing is what you want, so in the 19:29.420 --> 19:34.700 comment, so it's a valid Python program, and it's also a test case, and here I have, for example, 19:34.700 --> 19:39.580 revealed int, it is not an int, it is literal one, TY tries to be very specific with the types, 19:39.580 --> 19:43.660 and when it knows about it, so it knows that you assign one, it is definitely this for one, 19:44.620 --> 19:49.980 and when you run core but test, you would get a pointer to that annotations MD file, so that's 19:49.980 --> 19:55.340 your test case, and then the exact line that the problem happened, and then it would say, 19:55.900 --> 20:03.660 it reveals literal one, not integer, and whenever you submit PRCTY, there is a huge test suit 20:03.660 --> 20:08.860 that runs, you have this MD test, the same markdown test, I just showed you running, 20:08.860 --> 20:12.780 then there is a conformant test in Python, that's the thing that Python type in 20:12.780 --> 20:17.180 console works on, and they have sets of rules of what type figure should show for different 20:17.180 --> 20:23.420 annotations, and it also runs on a lot of open source projects to know, what is the impact of your 20:23.420 --> 20:28.620 change on these projects, how many more diagnostics are you showing, what are your memory and timing 20:28.620 --> 20:34.300 metrics changing, and this is an example of one of those reports that you have, and then you can 20:34.300 --> 20:41.100 go to the projects to see, for example, what were the new diagnostics that you added, and she's 20:41.180 --> 20:45.180 very helpful for her, if you don't really know what's the bright answer for some decision, 20:45.180 --> 20:48.940 you can just try it out and see what happens on the open source projects, because people 20:48.940 --> 20:56.220 write Python in a lovely different ways, that's it, here's my blog, there's more Tribal Posts