WEBVTT 00:00.000 --> 00:12.000 Hello everyone, I would like to introduce our next speaker, a Patrick Steinhardt. 00:12.000 --> 00:23.400 Thank you, so good morning everyone and welcome to my talk, evolving it for the next 00:23.400 --> 00:25.200 decade. 00:25.200 --> 00:29.400 Let me first introduce myself, so my name is Patrick Steinhardt. 00:29.400 --> 00:36.080 My interest in open-source software started around 22, when it was 11 years old. 00:36.080 --> 00:40.560 My elementary school teacher back then was very big into computers, so we spent almost 00:40.560 --> 00:43.680 every week in a well equipped computer room. 00:43.680 --> 00:47.960 One of the things that my teacher introduced me to back then was this magical thing called 00:47.960 --> 00:48.960 Linux. 00:48.960 --> 00:51.440 Linux really got me hooked. 00:51.440 --> 00:56.240 You could play around with it, make it to weird stuff, and break the whole computer while 00:56.240 --> 01:01.480 trying to make your Windows burn down when you close them, or just to get a fancy 3G cube 01:01.480 --> 01:05.080 if you for example want to have virtual work spaces. 01:05.080 --> 01:10.240 My father on the other side wasn't that happy, because we had frequent arguments around 01:10.240 --> 01:15.360 why I wiped the computer once again, or why the engineer doesn't work. 01:15.360 --> 01:20.320 That being said, that eventually kickstarted my interest in software engineering. 01:20.320 --> 01:26.000 I bought my first book about programming when I was 12 years old, and eventually started 01:26.000 --> 01:32.640 to do some small contributions to open-source software in 2011. 01:32.640 --> 01:37.440 In 2015, my involvement with open-source software development changed significantly when I found 01:37.440 --> 01:42.800 a job posting that was about contributing to an open-source version control system. 01:42.800 --> 01:44.680 The deal was rather simple. 01:44.680 --> 01:49.120 I had to do something related to version control systems, and that knowledge that I gained 01:49.120 --> 01:53.560 was then sold to customers by doing trainings and consulting. 01:53.560 --> 01:58.360 I had reached a choice of which software projects I wanted to contribute to, and there's 01:58.360 --> 02:02.800 exactly what I wanted to do there, which was awesome. 02:02.800 --> 02:07.600 For me, the choice was basically between subversion and git, and I had to release that 02:07.600 --> 02:10.280 by both ecosystems. 02:10.280 --> 02:14.400 To be honest, I just came from a job where I had to use subversion, so you might understand 02:14.400 --> 02:19.320 why I'm not wanted to have anything to do with subversion in the first place. 02:19.320 --> 02:23.520 I instead chose git, and that's how I eventually became one of the core contributions 02:23.520 --> 02:27.920 just to both git and lip-git two. 02:27.920 --> 02:33.960 In 2020, I then switched to git lab as a back-end engineer, where we're working on giddily. 02:33.960 --> 02:40.320 Giddily is the RPC service that sits between git lab and your git repositories. 02:40.320 --> 02:44.640 Eventually, my responsibility shifted the gattle a bit, to us contributing to git upstream 02:44.640 --> 02:45.640 only. 02:45.640 --> 02:49.960 We faced multiple performance bottlenecks that we wanted fixed, and fixing those with 02:49.960 --> 02:53.800 required significant long-term investments to git. 02:53.800 --> 02:57.840 One of these efforts was, for example, the Reftable Back-end, which I've been talking about 02:57.840 --> 03:02.000 way too much over the last couple of years, and this somehow also gave me the nickname 03:02.000 --> 03:07.520 shiny the Reftable Guy and some contexts. 03:07.520 --> 03:08.520 Giddily's done. 03:08.520 --> 03:10.520 What is there to change? 03:10.520 --> 03:15.400 This is something that I heard quite often over the last year, and I get it. 03:15.400 --> 03:18.600 Just last year, git has turned 20 years old. 03:18.600 --> 03:19.600 Anyone uses it. 03:19.600 --> 03:24.920 It works, so why mess with the success? 03:24.920 --> 03:28.560 The success of git is indeed quite staggering. 03:28.560 --> 03:32.920 94% of all the developers out there are using a day-to-day, and there are hundreds of 03:32.920 --> 03:38.200 millions of git repositories out there, and many more strips depending on it. 03:38.200 --> 03:45.080 It is safe to say that git is everywhere nowadays in the world of self-redevelopment. 03:45.080 --> 03:48.520 So that appears the question, is git really done? 03:48.520 --> 03:53.440 Is it the perfect version control system that doesn't require any changes anymore? 03:53.440 --> 03:59.720 Well, it might not surprise you, but for me the answer is a definite no. 03:59.720 --> 04:05.240 The world has changed quite significantly since 2005. 04:05.240 --> 04:08.520 Git was designed for a different era. 04:08.520 --> 04:13.720 In 2005, Shawan was considered to be a secure hash function, but that has changed with 04:13.800 --> 04:17.400 the shatter-to-tack and other attacks on Shawan. 04:17.400 --> 04:21.800 Back then, the linear strunnel was considered to rather large repositories. 04:21.800 --> 04:25.720 Nowadays, it's dwarfed by repose like the cromely repository, which are almost a hundred 04:25.720 --> 04:28.480 gigabyte in size. 04:28.480 --> 04:32.400 CI systems were kind of the exception, and it could count yourself lucky if you've 04:32.400 --> 04:36.120 had, for example, a Jenkins instance available to you. 04:36.120 --> 04:40.720 Nowadays, frequently see projects where you have crazy huge CI pipelines, where every single 04:40.800 --> 04:44.440 commit kicks of thousands of jobs. 04:44.440 --> 04:49.240 Monor repose is a term that nobody has heard about back then, but nowadays, everyone 04:49.240 --> 04:51.600 is using them. 04:51.600 --> 04:57.840 And also, it was very hard to use back them, but to be quite honest, it's still hard to 04:57.840 --> 05:06.040 use nowadays. 05:06.040 --> 05:11.440 So the world has changed, and that makes us clear that Git needs to change as well. 05:11.440 --> 05:15.280 But the unique position of Git means that we cannot have a revolution. 05:15.280 --> 05:19.640 There are millions of developers out there, and many projects that rely on Git. 05:19.640 --> 05:24.680 So we must make sure to not break the world, which simply can't break the world. 05:24.680 --> 05:30.840 Too many people depend on Git, but we can make it better, one step at a time. 05:30.840 --> 05:33.560 This evolution is what I want to talk about today. 05:33.560 --> 05:37.080 I want to highlight a couple of important transitions that Git is currently going through 05:37.080 --> 05:43.960 to ensure that the product stays relevant as the world keeps on changing. 05:43.960 --> 05:46.680 So let's dive right into our first topic. 05:46.680 --> 05:51.520 This is one of the most visible, user-visible changes that has currently happening in the 05:51.520 --> 05:56.560 Git world, the Shart of 56 Transition. 05:56.560 --> 05:59.120 Shawan is a central part of Git's design. 05:59.120 --> 06:03.320 Every single object has an identity, and that identity is computed by hashing the 06:03.360 --> 06:05.200 contents of the object. 06:05.200 --> 06:10.080 So for blobs, we hash the file contents, for trees, we hash the directory structure, 06:10.080 --> 06:15.880 and for commits we hash authorship information, commit message, and the root tree. 06:15.880 --> 06:18.560 Git objects are set to be content addressable. 06:18.560 --> 06:22.040 Given the contents, you know the name of the object. 06:22.040 --> 06:26.200 The result is that you have implicit integrity verification for your object. 06:26.200 --> 06:31.400 You can easily deduplicate them, and the history becomes immutable. 06:31.400 --> 06:34.440 This object name is computed by using Shawan. 06:34.440 --> 06:38.040 If you have, for example, a blob that contains the storing hello world, you only need 06:38.040 --> 06:42.640 to prefix the object header, and then compute the Shawan to get the object name. 06:42.640 --> 06:44.800 This one problem though. 06:44.800 --> 06:48.920 Shawan is not a secure hash function anymore. 06:48.920 --> 06:55.440 In 2017, Google and the CWI Amsterdam Research Institute proved that known theoretical attacks 06:55.440 --> 06:59.520 on Shawan are viable in practice with the Shattered Attack. 06:59.520 --> 07:04.200 The result of this partnership are two syntactically valid PDF files that both result 07:04.200 --> 07:06.560 on the Shawan hash. 07:06.560 --> 07:15.040 The attack requires around 9kF1 computations, which requires 110 years of single GPU computations. 07:15.040 --> 07:19.200 This may seem like a lot, but if you for example want to brute the force Shawan instead, 07:19.200 --> 07:23.360 a hash collision would require 12 million GPU years to compute. 07:23.360 --> 07:26.640 So quite broken. 07:26.640 --> 07:31.440 Also, you can imagine that with all the recent type that we have around artificial intelligence, 07:31.440 --> 07:36.920 data centers have become increasingly expanded their GPU capacity. 07:36.920 --> 07:41.720 So nowadays, it is very much in reach for a large player to compute hash collisions at 07:41.720 --> 07:45.080 well, if I want to. 07:45.080 --> 07:50.200 Of course, as Git heavily relies on Shawan, the Shattered Attack has kicked off a huge 07:50.200 --> 07:53.200 and intense discussion on the Git mailing list. 07:53.280 --> 07:57.400 It has been asserted since the beginning that the use of Shawan is not primarily for security 07:57.400 --> 07:58.400 though. 07:58.400 --> 08:02.920 There is a couple of arguments that are made in this context. 08:02.920 --> 08:07.880 First, the object hash is mostly used as an integrity check to detect disruption, bit 08:07.880 --> 08:10.640 flips, and transmission errors. 08:10.640 --> 08:13.200 Also, source code is transparent. 08:13.200 --> 08:17.320 If you see a merchant best for example, where somebody enters random collision data into 08:17.320 --> 08:22.320 your source code, then you might probably ask some questions. 08:22.320 --> 08:28.320 Also, the object format that Git adds has at some protection, because we prevent the object 08:28.320 --> 08:29.880 length to the object. 08:29.880 --> 08:35.680 This means that you can adjust like, prevent collision data to that object. 08:35.680 --> 08:40.280 And last but not least, there is also other security measures, like GPU signatures, and 08:40.280 --> 08:45.840 correct the transports at a web of trust between developers. 08:45.840 --> 08:50.360 But the reality is that things are a little bit more complicated. 08:50.360 --> 08:54.920 Also, you can use GPU signatures to sign your commit, but unfortunately, that signature 08:54.920 --> 08:57.480 is on the Shawan commit hash. 08:57.480 --> 09:04.920 Subsequently, if you can create a collision, you cannot trust GPU signatures at all anymore. 09:04.920 --> 09:06.960 Also, not everything is source code. 09:06.960 --> 09:11.400 Whether you like it or not, some repositories out there contain binary blocks like firmware 09:11.400 --> 09:13.800 or compiled assets, for example. 09:13.800 --> 09:19.640 It's almost impossible to verify whether those might have contained trusted collisions or 09:19.640 --> 09:23.760 whether because they are not human readable. 09:23.760 --> 09:27.720 Also, a lot of our modern tooling builds its trust on top of Git commit-tashes. 09:27.720 --> 09:34.640 You see this in CI-CD, in scripts that interact with Git programming languages that perform 09:34.640 --> 09:37.080 dependency-pinning and so on. 09:37.080 --> 09:42.360 And many of those use cases, we implicitly trust the Git commit hash. 09:42.360 --> 09:48.800 And finally, government and enterprise policies also mandate their removal of Shawan by 2030 09:48.800 --> 09:52.400 in favor of more secure hash functions. 09:52.400 --> 09:57.320 So overall, it's safe to say that even if Git itself does not rely on Shawan for security, 09:57.320 --> 10:01.840 the ecosystem very much does. 10:01.840 --> 10:04.080 The fix for this is of course quite obvious. 10:04.080 --> 10:07.840 Let's just swap out Shawan and replace it with a different hash function. 10:07.840 --> 10:09.920 And that's exactly what happened. 10:09.920 --> 10:16.000 In Git-229, which has been published in October 2020, we have added support for Shawan 10:16.000 --> 10:18.000 and Shathive-56 instead. 10:18.000 --> 10:23.720 You can simply create the new repository with saying Git in it, dash-upperic format equal 10:23.720 --> 10:27.880 Shathive-56, and then you'll get to use a different hash function. 10:27.880 --> 10:30.480 The code is there, it works. 10:30.480 --> 10:33.440 You can use it today for your new repositories. 10:33.440 --> 10:36.640 But somehow, nobody out there is using Shathive-56. 10:36.640 --> 10:40.160 So what is taking us so long? 10:40.160 --> 10:43.440 The problem is that Git has very strong networking effects. 10:43.440 --> 10:48.560 You cannot just implement Shathive-56 in the Git command line and then call it the day, 10:48.560 --> 10:51.720 because we also need to consider the ecosystem. 10:51.720 --> 10:55.320 But unfortunately, this situation looks somewhat grim here. 10:55.320 --> 11:00.960 Next to Git, there's only a single forge and a single library that foolishly support Shathive-56. 11:00.960 --> 11:05.520 GitLab and if you other libraries have also experimental Shathive-56 support. 11:05.520 --> 11:10.880 But some, well, rather insignificant players like for example GitHub, don't support Shathive-56 11:10.880 --> 11:13.280 at all. 11:13.280 --> 11:15.680 This creates a chickened act problem. 11:15.680 --> 11:20.840 Nobody's moving to Shathive-56 because it is not supported by large forges, but large forges 11:20.840 --> 11:25.280 don't implement support because there's no demand. 11:25.280 --> 11:27.600 The problem is that we cannot wait forever. 11:27.600 --> 11:31.160 It will become more and more feasible over time to break Shawan. 11:31.160 --> 11:35.720 And the next cryptographic weakness may be just around the corner. 11:35.720 --> 11:40.040 We need to consider that even if we had full support for Shathive-56, projects still need 11:40.040 --> 11:42.160 time to migrate. 11:42.160 --> 11:45.800 So that's why we need to break the cycle. 11:45.800 --> 11:50.520 The Git project has decided to make Shathive-56 the default hash for newly created repositories 11:50.520 --> 11:52.400 in Git3.0. 11:52.400 --> 11:57.680 Our hope is that by making Shathive-56 the default hash function, we are forcing both forges 11:57.680 --> 12:00.720 and third-party implementations to adapt. 12:00.720 --> 12:07.320 The message is clear, Shathive-56 is the future, get ready for it. 12:07.320 --> 12:10.960 This transition will likely not really be an easy one, and it may result in a couple 12:11.000 --> 12:13.000 of hiccups along the road. 12:13.000 --> 12:15.200 But if you're interested, you can also help. 12:15.200 --> 12:19.040 We can start playing around the Shathive-56 backshared and repost with it. 12:19.040 --> 12:24.280 You can show your favorite code forges that you care about Shathive-56 so that they 12:24.280 --> 12:30.200 bump the priority, and you can even try and help third-party tools that depend on Git 12:30.200 --> 12:33.600 by adding Shathive-56 support. 12:33.600 --> 12:37.600 Together, we can hopefully get the ecosystem to move before the next vulnerability in 12:37.600 --> 12:38.600 Shathive-56. 12:42.600 --> 12:46.560 I need to live up to my nickname Shathive-56, and talk a little bit about my favorite 12:46.560 --> 12:49.080 topic, which is RefTables. 12:49.080 --> 12:53.200 RefTables are another significant shift that is currently happening in Git repositories 12:53.200 --> 12:57.160 as a new back and to store your references. 12:57.160 --> 13:00.400 Before we talk about RefTables, though, I first want to give you a little bit of context 13:00.400 --> 13:06.040 about how references are starting Git right now, and why that is a problem. 13:06.040 --> 13:10.640 When trading or updating references, they are by default stored in the loose format. 13:10.640 --> 13:13.520 Every reference is stored as a separate file. 13:13.520 --> 13:16.560 The reference format is really easy to understand. 13:16.560 --> 13:22.360 To demonstrate, let's create a simple repository with a commit and a branch. 13:22.360 --> 13:24.760 The first file will examine its hat. 13:24.760 --> 13:27.480 This file indicates what you are currently checked out branches. 13:27.480 --> 13:33.560 As you can see, it has a rough column prefix, which means that this is a zombolic reference, 13:33.560 --> 13:35.800 where the target is roughshath's main. 13:35.800 --> 13:40.400 So we know we have the main branch checked out in that repo. 13:40.400 --> 13:44.200 All the other references are stored in the RefToric to rehire key. 13:44.200 --> 13:46.560 As we can see, we got two files in there. 13:46.560 --> 13:49.000 RefTables feature and refshath's main. 13:49.000 --> 13:56.760 These are all branches, and they contain the object that either pointing to as contents. 13:56.760 --> 14:00.720 Now storing every single reference as a separate file works well when your repository only 14:00.720 --> 14:02.760 has a handful of them. 14:02.760 --> 14:05.720 But when you have hundreds or even thousands of refs, then it becomes 14:05.720 --> 14:07.400 really inefficient. 14:07.400 --> 14:11.520 Every reference needs a separate I node, which typically also means that it needs a full 14:11.520 --> 14:13.560 disk sector. 14:13.560 --> 14:17.200 Listing all your references also becomes increasingly expensive, as you have to read many 14:17.200 --> 14:19.320 many files. 14:19.320 --> 14:22.960 So git regularly packs your references to contract this. 14:22.960 --> 14:26.720 Instead of storing each reference in a separate file, they get packed into a packed refs 14:26.720 --> 14:28.840 file. 14:28.840 --> 14:32.640 You can manually pack references by saying git pack refs-shall. 14:32.640 --> 14:37.080 You don't typically have to do that automatically that yourself because git does it automatically 14:37.080 --> 14:39.280 for you. 14:39.280 --> 14:44.040 As you can see after executing this command, our loose references have gone. 14:44.040 --> 14:48.560 Instead, we now have a simple packed refs file that is a simple sorted list of all the 14:48.560 --> 14:52.320 references that have just been packed. 14:52.320 --> 14:55.840 Now you know how the files back and works for storing references. 14:55.840 --> 14:58.760 Why does it need to change? 14:58.760 --> 15:02.760 The first problem is that file systems are simply weird. 15:02.760 --> 15:08.160 Once special system here in my heart is Windows, which resolves all kinds of file names. 15:08.160 --> 15:12.240 And as we encode reference names, via the file system path, it means that you cannot 15:12.240 --> 15:20.360 create a branch that is named con, pure n, ox, null, con1 to 9, LPT1 to 9 and more. 15:20.360 --> 15:25.720 There's also many file systems out there like nTFS, fat or hfs plus that are a case insensitive 15:25.720 --> 15:26.720 by default. 15:26.960 --> 15:31.120 And again, the consequence is that you cannot create two branches that only differ in 15:31.120 --> 15:33.320 casing. 15:33.320 --> 15:37.760 Mac OS also does somewhat weird stuff, where they may change the way that your file name 15:37.760 --> 15:42.240 is represented and encoded in case they can say in unicode characters. 15:42.240 --> 15:46.760 So what did one store on disk and what Mac OS actually decides to store on disk might 15:46.760 --> 15:48.760 be different. 15:48.760 --> 15:52.520 In the best case, you know that these restrictions apply and won't ever try to create 15:52.520 --> 15:53.520 such branches. 15:53.920 --> 16:01.200 In the worst case, you're stuck on Windows. 16:01.200 --> 16:05.920 If you want to write 20 references, you have to create 20 separate files. 16:05.920 --> 16:10.040 This does not only take long when you consider performance, but for typical file systems 16:10.040 --> 16:15.840 it also means that each of these references may require four kilobytes of storage. 16:15.840 --> 16:18.000 And that is up rather quickly. 16:18.000 --> 16:23.160 Packer references though is expensive, but mandatory when a repository has many branches 16:23.200 --> 16:25.240 to retain good performance. 16:25.240 --> 16:30.040 You have to rewrite the complete pack rest file on every repack though, which is typically 16:30.040 --> 16:33.720 fine for repost that only have a handful of them. 16:33.720 --> 16:36.480 But, get users are not always reasonable. 16:36.480 --> 16:41.640 One of the worst repositories that we for example host at GitLab is containing around 16:41.640 --> 16:48.280 20 million references, which sums up to a pack rest file that is two gigabytes in size. 16:48.280 --> 16:49.960 Which brings me to the next point. 16:49.960 --> 16:55.040 Deleting a reference also requires us to rewrite the complete pack rest file. 16:55.040 --> 16:59.600 So every time someone did release a reference and that repository, we have to rewrite 16:59.600 --> 17:01.720 two gigabytes of data. 17:01.720 --> 17:06.080 And to add insult to injury, this repository typically deletes references every couple 17:06.080 --> 17:11.960 seconds, not exactly efficient. 17:11.960 --> 17:13.480 Concurrency is an afterthought. 17:13.480 --> 17:17.680 There are multiple issues with representing references as single files when you have multiple 17:17.680 --> 17:22.160 readers and writers in your repository at the same point of time. 17:22.160 --> 17:27.000 One of the problems is that it is impossible to get a consistent view of all your references. 17:27.000 --> 17:32.200 To get that you would have to open multiple files and each of them could change concurrently. 17:32.200 --> 17:36.480 So when somebody writes to the repository while you read references, you cannot tell whether 17:36.480 --> 17:43.080 the result you got is consistent or whether it is a mixture of the old and the new state. 17:43.080 --> 17:46.840 Similarly, it is impossible to write more than one reference. 17:46.840 --> 17:51.120 Each reference you want, you want to write is a separate file and this you cannot commit 17:51.120 --> 17:53.640 them all at once. 17:53.640 --> 17:56.560 There is also no way to log the reference database. 17:56.560 --> 18:01.680 Well there could be a central log file including that retroactively would likely break 18:01.680 --> 18:04.400 all kinds of use cases out there. 18:04.400 --> 18:07.720 These problems have all been known for a very long time already and that is where the 18:07.720 --> 18:11.240 Refttable Backend comes into play. 18:11.240 --> 18:16.000 You can create a newer repository with the Refttable Backend by passing dash as Reftformit 18:16.000 --> 18:18.000 equals Refttable. 18:18.000 --> 18:21.320 To get in it, that is basically all you need to know. 18:21.320 --> 18:25.600 After what the repository is expected to behave the exact same as with the files from 18:25.600 --> 18:27.240 it. 18:27.240 --> 18:31.680 But still let's have a deeper look at what the repository is structured like. 18:31.680 --> 18:36.360 Here isly enough we still see that we have the Reft directory and the Had file, which 18:36.360 --> 18:38.880 were also present in the files backend. 18:38.880 --> 18:41.440 But these files are merricompatibility stops. 18:41.440 --> 18:45.960 They don't contain any actual data, but they have to exist because it does not require 18:45.960 --> 18:53.000 a repository to be a repository unless those files exist. 18:53.000 --> 18:56.080 What's new is that we now also have a Refttable directory. 18:56.080 --> 19:00.480 The first data structure that is specific to Reftables is the tables.list file. 19:00.480 --> 19:05.440 This file tracks the currently active list of tables in the repo. 19:05.440 --> 19:11.000 So whenever your update reference, give it a right and new table and depend it to this list. 19:11.000 --> 19:15.080 This mechanism is really important because it allows for atomic updates. 19:15.080 --> 19:18.760 You can get a consistent snapshot by reading the table slot list file and then loading 19:18.760 --> 19:20.880 all of the reference files. 19:20.880 --> 19:25.440 Then you can perform an atomic right of multiple tables by writing a table, writing a 19:25.440 --> 19:32.360 temporary table slot list file, and then atomic killer renaming that into place. 19:32.360 --> 19:35.760 The tables themselves are stored in a binary format. 19:35.760 --> 19:39.400 While the binary format is more complex than a text-based format, it allows us to store 19:39.400 --> 19:41.520 data more efficiently. 19:41.520 --> 19:45.040 Also as reference names are not in code if you have the file system path anymore, you 19:45.040 --> 19:50.400 are not subject to file system limitations here. 19:50.400 --> 19:52.320 Reftables use a block-based structure. 19:52.320 --> 19:56.880 Every block is exactly 4 kilobytes of data that so that it fits exactly into a disk 19:56.880 --> 19:58.040 sector. 19:58.040 --> 20:01.280 This allows us to efficiently read a single block. 20:01.280 --> 20:03.280 Each block also has a specific type. 20:03.280 --> 20:07.480 Reft blocks, for example, store our references, but there are also other types that we 20:07.480 --> 20:10.560 will not go into detail today. 20:10.560 --> 20:14.080 Furthermore, every section of blocks may have an optional index. 20:14.080 --> 20:18.640 Each index entry stores the last record name that a given block contains, which allows us 20:18.640 --> 20:21.240 to quickly find a specific record. 20:21.240 --> 20:26.080 So let's say before we want to look at the branch called D, we would then first with the 20:26.080 --> 20:32.040 index block, it tells us that the first block contains references up to rest hat C, and 20:32.040 --> 20:35.760 we know that this block will not contain the reference that we are searching for. 20:35.760 --> 20:41.360 Second, we see that the second block contains also references up to rest hat G. 20:41.360 --> 20:45.000 So if the reference we are searching for exists, it must be in that block. 20:45.000 --> 20:50.720 So read the target block and search our reference there. 20:50.720 --> 20:54.640 Now if we zoom into one of those ref blocks, we can see that the blocks contain a lexicographically 20:54.640 --> 20:58.560 sorted list of refs with the irrespective object ID. 20:58.560 --> 21:01.880 One important bit here is the great out part of the ref names. 21:01.880 --> 21:05.280 These are prefixes that are common with the preceding reference. 21:05.280 --> 21:09.280 The ref table format uses prefix compression to save a little bit of space. 21:09.280 --> 21:13.120 Instead of storing the full ref name, we only state how many bytes to reuse from the 21:13.120 --> 21:19.560 preceding reference, and then only have the deferring bits. 21:19.560 --> 21:23.280 A major difference is also how we pack references. 21:23.280 --> 21:28.160 With the files back end, we write lots of ref loose references all the time and eventually 21:28.160 --> 21:30.840 do an all into one repack. 21:30.840 --> 21:34.000 With the ref table back end, thanks for a little bit differently. 21:34.000 --> 21:38.040 Every time we write references, we append a new table to the stack of tables. 21:38.040 --> 21:42.000 Afterwards, we verify whether the tables form a geometric sequence. 21:42.000 --> 21:45.880 The next table must be at most half the size of the current one. 21:45.880 --> 21:49.320 This check happens every single time we update the stack. 21:49.320 --> 21:54.440 So we basically ensure that the stack is well optimized as we go. 21:54.440 --> 21:58.960 To demonstrate, we start out with two tables, one that contains eight references and one that 21:58.960 --> 22:01.760 only contains a single ref. 22:01.760 --> 22:04.800 So let's write a new table with a single reference. 22:04.800 --> 22:09.960 We see the geometric sequence, property is not maintained anymore, as one is not smaller 22:09.960 --> 22:14.440 or equal than half of one, so you're merged them together. 22:14.440 --> 22:19.360 There is no need to come back further though, as two is less than half of eight and 22:19.360 --> 22:22.640 so the remaining tables form a geometric sequence. 22:22.640 --> 22:24.600 We create another table with a single ref. 22:24.600 --> 22:29.160 We see the geometric sequence obtained, no need to merge. 22:29.160 --> 22:33.200 If you now create another table, then we will have to merge one time, but the geometric 22:33.200 --> 22:38.040 sequence is still not maintained, so we have to merge a second time. 22:38.040 --> 22:42.200 The result is that we always have at most all lock ref's many tables, which ensures 22:42.200 --> 22:46.680 that reads continue to be fast. 22:46.680 --> 22:51.200 We now have a very rough understanding of how ref tables work, but why do we do this whole 22:51.200 --> 22:55.760 exercise to swap out the storage layer in the first place? 22:55.760 --> 23:00.960 With the files back end, we're subject to a lot of specifics of a file system, as reference 23:00.960 --> 23:03.760 names are derived from path names. 23:03.760 --> 23:08.160 Reft ref tables all of these issues go away, as the ref names are encoded in the individual 23:08.160 --> 23:11.160 tables. 23:11.160 --> 23:14.600 Also while very much hope that you don't have to rock and repose that contain million 23:14.600 --> 23:19.280 of references, such repose exist out there, and if you work in them, then you might 23:19.280 --> 23:24.320 very well appreciate the improved performance. 23:24.320 --> 23:28.080 All of the files back end does not allow for atomic updates, as references are written 23:28.080 --> 23:30.640 one by one, as separate files. 23:30.640 --> 23:34.800 This issue goes away with the ref table back end, where our reads are consistent and rides 23:34.800 --> 23:35.800 our atomic. 23:35.800 --> 23:40.240 You probably don't care too much about this on the client side, but on the server side, 23:40.240 --> 23:43.440 this is a huge improvement. 23:43.440 --> 23:47.440 In last but not least, ref tables will also become the default in get 3.0. 23:47.440 --> 23:51.560 So if you for example use githin scripts, or if you use it on the server side, then you 23:51.560 --> 23:54.920 should verify that you don't play weird games by accessing reference directly 23:54.920 --> 23:56.640 if you're at the file system. 23:56.640 --> 24:01.120 You should always access references via githin command, and if you do so, then you shouldn't 24:01.120 --> 24:03.440 observe any differences. 24:03.440 --> 24:11.840 So, we talked a little bit about how we improving support of our references in get. 24:11.840 --> 24:16.160 But this addresses some theoretical issues, I would claim that most of you in this room 24:16.160 --> 24:20.880 probably don't encounter those problems in practice. 24:20.880 --> 24:24.640 When it comes to scalability bottlenecks, the more important problem tends to be large 24:24.640 --> 24:25.640 files. 24:25.640 --> 24:30.680 The storing large binary files in get is unfortunately not a use case that is well-supported 24:30.680 --> 24:31.680 nowadays. 24:31.680 --> 24:36.600 As a workaround, develop a set to reserve to third-party solutions like, for example, get 24:36.600 --> 24:38.160 LFS. 24:38.160 --> 24:41.760 This is something that we want to change. 24:41.760 --> 24:47.440 But first, why are large files a hard problem for a get in the first place? 24:47.440 --> 24:49.720 There's two important issues here. 24:49.720 --> 24:53.000 The first problem is how get compresses objects. 24:53.000 --> 24:59.240 Get works extremely well for repositories with text files, like for example source code. 24:59.240 --> 25:03.880 First, it uses z-lib compression to reduce the general size of objects. 25:03.880 --> 25:09.160 And second, get no store incremental changes to objects as deltas. 25:09.160 --> 25:13.040 Together, this achieves great compression ratios for text files. 25:13.040 --> 25:16.840 After all, this is what get was designed for. 25:16.840 --> 25:21.200 But unfortunately, z-lib compression tends to not work well for binary files. 25:21.200 --> 25:26.440 And computing the deltas becomes increasingly expensive for larger your files get. 25:26.440 --> 25:31.840 The consequence is that even small edits, to such files, end up trading entirely new objects 25:31.840 --> 25:36.120 without using any deltas at all. 25:36.120 --> 25:39.040 The second problem occurs on the networking layer. 25:39.040 --> 25:44.840 Whenever you clone a get repository, you get a full copy of all of the history by default. 25:44.840 --> 25:47.160 This is what you want for normal repose. 25:47.160 --> 25:51.360 But once we're talking about large monorepos, with binary files in them, then you probably 25:51.360 --> 25:56.000 don't want download hundreds of gigabytes of data. 25:56.000 --> 26:00.240 This is further stressed by the fact that there is no support for resimble the clones. 26:00.240 --> 26:05.640 So if you have download 400 gigabytes out of a 500 gigabyte repository and your network 26:05.640 --> 26:10.040 disconnects, then you will have to read download everything. 26:10.040 --> 26:13.920 And because deltification does not work for large binary files, you have to read download 26:13.920 --> 26:18.160 the full block contents every single time a large binary file changes instead of only 26:18.160 --> 26:21.600 fetching the incremental changes. 26:21.600 --> 26:27.080 The result is that many teams work with large files simply avoid using get all together, 26:27.080 --> 26:30.560 which is unfortunate. 26:30.560 --> 26:34.480 Of course, large monorepos don't only cause issue on the client side. 26:34.480 --> 26:37.200 Code forges are also struggling with them. 26:37.200 --> 26:41.880 First and foremost, forges don't have the luxury of partial clones, for example. 26:41.880 --> 26:45.520 It needs to have all objects available, as it would otherwise not be able to serve those 26:45.520 --> 26:47.320 to the client. 26:47.320 --> 26:50.040 The consequence is a significant storage cost. 26:50.040 --> 26:55.800 Our analysis on GitLab.com has shown that 75% of our storage space for get repositories 26:55.800 --> 27:00.840 can be attributed to binary files larger than one megabyte. 27:00.840 --> 27:05.640 The huge repose sizes also cause repository maintenance to become very expensive. 27:05.640 --> 27:10.440 We have to rewrite objects every once in a while, for example, to delete some of them. 27:10.440 --> 27:14.960 If your repository contains large binaries, then this data becomes computationally very 27:14.960 --> 27:16.960 expensive. 27:16.960 --> 27:22.480 Also, it is not possible to offload any of those objects to a content delivery network. 27:22.480 --> 27:26.200 All data needs to be served by the git server, and that makes it become a significant 27:26.200 --> 27:28.200 bottleneck. 27:28.200 --> 27:33.080 So in summary, large objects are a significant cost factor for any large git provider 27:33.080 --> 27:35.800 out there. 27:35.800 --> 27:40.120 Git users have adapted to work around those shortcomings with band aids. 27:40.120 --> 27:42.760 GitLab as is one such solution. 27:42.760 --> 27:46.960 Instead of storing actual file contents in Git, you end up storing only a pointer to the 27:46.960 --> 27:48.480 object contents. 27:48.480 --> 27:52.080 The actual content is then stored on a separate server that is better suited for storing 27:52.080 --> 27:54.920 binary data. 27:54.920 --> 27:59.400 This solution keeps the repository small, and is well supported by hosting providers. 27:59.400 --> 28:01.160 But it's not part of Git. 28:01.160 --> 28:05.000 It doesn't know to transfer deltas, and once you have accidentally committed any large 28:05.000 --> 28:10.080 blob into your history, then it's stuck there forever. 28:10.080 --> 28:14.440 Azure clones allow you to clone a repository while filtering out certain objects, like 28:14.440 --> 28:16.280 for example, blobs. 28:16.280 --> 28:20.600 This mechanism is a native part of Git, and transparent to the user, as Git will automatically 28:20.600 --> 28:24.160 fetch those missing objects on demand. 28:24.160 --> 28:26.800 But users have to know how to use it. 28:26.800 --> 28:30.960 There is no automatic pruning of large old files that have not been accessed for a long 28:30.960 --> 28:36.520 time, and servers still cannot offload the traffic. 28:36.520 --> 28:40.960 Azure clones already have existed for quite a while, but I bet many of you have never 28:40.960 --> 28:42.840 used them before. 28:42.840 --> 28:44.480 Overall, it's quite simple. 28:44.480 --> 28:49.840 It's simply specifying object filter, and from there on Git will fetch those filter 28:49.840 --> 28:52.600 blobs whenever required. 28:52.600 --> 28:57.960 If we, for example, clone with dashed to filter blob none, then Git would first fetch all 28:57.960 --> 29:02.320 the non-blob objects, which basically leaves us with the repository shape only, but we 29:02.320 --> 29:05.080 don't have any contents at all yet. 29:05.080 --> 29:09.480 When the check-out begins, and Git realizes that it actually needs to fetch some file 29:09.480 --> 29:10.480 contents. 29:10.480 --> 29:18.040 So, it does a back-to-fetch for only the blobs needed to satisfy checking out the hack commit. 29:18.040 --> 29:22.240 If we eventually need other blobs, that we don't have yet, then Git knows to fetch those 29:22.240 --> 29:24.760 on demand. 29:24.760 --> 29:29.000 For this specific repository, the result is that we have to only download one of the two 29:29.000 --> 29:32.760 gigabytes of data instead of two.8. 29:32.760 --> 29:37.320 Let us mention the problem is that Git users need to know how to use them, and on the 29:37.320 --> 29:41.760 server side, the traffic's still cannot be offloaded to secondary servers. 29:41.760 --> 29:46.240 That's the problem that large object promises aim to solve. 29:46.240 --> 29:47.960 The idea is quite simple. 29:47.960 --> 29:52.720 When a client clones, the server announces a set of promising modes to the client. 29:52.720 --> 29:55.920 Where each promising mode has an object filter attached to it. 29:55.920 --> 30:00.000 The client then automatically selects a subset of those promises and uses the attached 30:00.000 --> 30:02.640 filters to perform the initial clone. 30:02.640 --> 30:06.920 The use promises get stored in the configuration, and are from there on, used whenever the 30:06.920 --> 30:09.600 client needs to backfill some data. 30:09.600 --> 30:12.520 This tries to solve multiple problems. 30:12.520 --> 30:17.720 First, we can now offload traffic to a secondary git server, and this reduced load on the 30:17.720 --> 30:19.720 primary. 30:19.720 --> 30:22.600 Also, the functionality is built right into Git. 30:22.600 --> 30:26.000 There is no need for external tools anymore. 30:26.000 --> 30:29.000 This solution is also fully transparent to the client. 30:29.000 --> 30:34.840 The server can announce an optimal filter, and the client can automatically use it. 30:34.840 --> 30:36.960 But there is one key feature here though. 30:36.960 --> 30:42.100 Git doesn't only support HDPS or SSH clones, but in theory it can also support fetching 30:42.100 --> 30:44.560 via arbitrary transports. 30:44.560 --> 30:48.560 This support is extensible by having so-called remote helpers. 30:48.560 --> 30:52.520 A remote helper is a binary that is simply called Git remote something. 30:52.520 --> 30:57.760 So when Git for example sees the S3 protocol, it loads most to look for a binary called 30:57.760 --> 31:04.000 Git remote S3, and if it exists, it reduces that one to talk to the remote. 31:04.000 --> 31:07.320 The key realization now is that announced problems are remote. 31:07.320 --> 31:13.040 Main of example uses a protocol that stores large objects in an S3 compatible store. 31:13.040 --> 31:17.840 This allows us to offload objects to a content delivery network, and it allows us to store 31:17.840 --> 31:24.800 large blocks in a format that is much better suited for them. 31:24.800 --> 31:29.040 With large object promises we will have the infrastructure in place to let service offload 31:29.040 --> 31:33.960 binary files, and clients will know to automatically use them if desired. 31:33.960 --> 31:36.320 But we still have another issue. 31:36.320 --> 31:40.800 Even with promises, Git's object format still doesn't handle binary files efficiently on 31:40.800 --> 31:42.560 the client side. 31:42.560 --> 31:46.320 This is where a plug will object databases come into play, which will allow us to introduce 31:46.320 --> 31:51.720 a new storage format for a large binary file specifically. 31:51.720 --> 31:56.840 As mentioned before, Git's pack file format uses data compression to store incremental changes 31:56.840 --> 31:59.040 to objects efficiently. 31:59.040 --> 32:04.000 This works amazingly for text files, but for large binaries, computing Delta's this way 32:04.000 --> 32:07.800 is way too expensive, so Git doesn't even try. 32:07.800 --> 32:12.480 Instead, even small edits create entirely new objects once the objects reach a certain 32:12.480 --> 32:14.200 size. 32:14.200 --> 32:19.120 We need a format design for binaries, where incremental changes to a binary file only leads 32:19.120 --> 32:21.480 to a small storage increase. 32:21.520 --> 32:25.200 This new storage format also needs to be efficient for any file size. 32:25.200 --> 32:31.120 The computational complexity should at most grow linearly with a file size. 32:31.120 --> 32:34.480 In last but not least, the format also needs to be compatible with the existing format 32:34.480 --> 32:39.360 somehow, so that you can mix and match the old storage format for text files and the new 32:39.360 --> 32:43.400 storage format for large binaries. 32:43.400 --> 32:47.560 The storage format is deeply baked into Git, but alternative implementations like 32:47.640 --> 32:52.240 Git2, Gogit and JGIT already have plugable back ends. 32:52.240 --> 32:56.280 So there is no fundamental reason why Git can't do this too. 32:56.280 --> 33:02.120 It requires a lot of plumbing and effecturing, but it's certainly a feasible thing. 33:02.120 --> 33:06.360 Assuming that we had plugable databases and that we could swap out the back end, the 33:06.360 --> 33:10.200 idea would be to introduce chunking into Git. 33:10.200 --> 33:14.080 With our current electrification logic, we have to do expensive calculations to find 33:14.080 --> 33:18.040 ideal data, which is simply too costly for binaries. 33:18.040 --> 33:23.400 With chunking, though, we can deduplicate common parts by cutting a large binary file into 33:23.400 --> 33:29.360 smaller chunks, and each of these chunks can then be deduplicated individually. 33:29.360 --> 33:33.800 There's two significantly different ways of doing chunking. 33:33.800 --> 33:38.120 The first and obvious way is to simply split a file into fixed size chunks. 33:38.120 --> 33:44.120 To on this example, use for example, cut the file after every fourth character. 33:44.120 --> 33:47.960 The problem, though, is that if you insert new data at any point in the file, then all 33:47.960 --> 33:51.160 the chunks that follow afterwards will now change. 33:51.160 --> 33:55.480 The result is that we cannot deduplicate those chunks. 33:55.480 --> 33:57.960 The alternative is contentifying chunking. 33:57.960 --> 34:02.680 The key insight of contentifying chunking is that boundaries are determined not by length, 34:02.680 --> 34:05.280 but by the content itself. 34:05.280 --> 34:09.200 Every time a specific property is triggered, we cut a new chunk. 34:09.200 --> 34:16.360 The result is that the file will cut the cut into chunks of variable length. 34:16.360 --> 34:21.120 So if you insert data at the beginning or anywhere in the file now, then the first chunk 34:21.120 --> 34:23.040 will of course change. 34:23.040 --> 34:27.000 But because the boundary is defined by the content, we know that we are still going to cut 34:27.000 --> 34:34.480 subsequent chunks at the exact same boundary, and this remaining chunks will remain identical. 34:34.480 --> 34:39.200 The mechanism used for this is to pick a rolling hash function over a sliding window. 34:39.200 --> 34:44.520 When the hash matches a condition like for example, being divisible by n, then we cut. 34:44.520 --> 34:47.600 This is how tools like for example, the rest is or bark. 34:47.600 --> 34:53.480 It also our clone can handle large file backups efficiently. 34:53.480 --> 34:56.640 We don't really need to replace its entire storage format. 34:56.640 --> 35:01.760 It works quite well for text files, and contentifying chunking would likely make compression 35:01.760 --> 35:04.440 ratios worse for them. 35:04.440 --> 35:09.880 We get already supports multiple object sources attached to it, so that you can use the 35:09.880 --> 35:13.320 alternatives mechanism to have two different storage types. 35:13.320 --> 35:18.760 The idea is to connect two object sources, and based on whether or not a binary file is 35:18.760 --> 35:24.280 a binary file, you would either store it in the chunked format, or you would store it using 35:24.280 --> 35:28.520 pack files. 35:28.520 --> 35:31.960 The two efforts to introduce large object promises a pluggable object that they are based 35:31.960 --> 35:33.960 on progress in parallel. 35:33.960 --> 35:39.000 The initial protocol implementation for large object promises has landed in get-to-50, 35:39.000 --> 35:42.040 and has been extended in get-to-52. 35:42.040 --> 35:47.200 The next steps are to automatically use filters and promises on the client side. 35:47.200 --> 35:50.440 Overall, this is quite close to being usable in production. 35:50.440 --> 35:55.360 I would assume that over the next couple of releases, we will have all the required parts. 35:55.360 --> 36:00.480 What is of course the missing is also support in getforges. 36:00.480 --> 36:04.440 The effort around pluggable object databases is not that far yet. 36:04.440 --> 36:07.840 Over the last couple of get releases, we have spent some significant time refactoring the 36:07.840 --> 36:11.560 code base and how get access is objects. 36:11.560 --> 36:16.680 Stouching with get-to-53, which will be released tomorrow, no, in two days actually, 36:16.680 --> 36:21.240 we will have a unifiesed object database in the face that makes it easy for us to change 36:21.240 --> 36:23.600 the format going forward. 36:23.600 --> 36:28.240 In get-to-54, I then expect that we will have an initial proof of concept, but implementing 36:28.240 --> 36:32.920 the chunk forward will probably take a little bit longer. 36:32.920 --> 36:37.040 Once those parts have landed though, get will become a lot more viable for large binary 36:37.040 --> 36:42.840 files without work around. 36:42.840 --> 36:46.040 The last couple of sections have been about technical details. 36:46.040 --> 36:50.920 One core area though that get does get a lot of complaints about is its UI. 36:50.920 --> 36:55.920 Many commands are extremely confusing, and some workflows are significantly harder than 36:55.920 --> 36:58.200 they have any right to be. 36:58.200 --> 37:02.680 And recently there's been a competitor that makes us have a hard look at ourselves and 37:02.680 --> 37:05.320 what we're doing. 37:05.320 --> 37:10.400 Judges is a modern version control system that's fully compatible with get-to-repositories. 37:10.400 --> 37:14.600 It will start to the couple years ago by MatiFunzvike back at Google, but by now it has 37:14.600 --> 37:17.320 a growing open source community. 37:17.320 --> 37:21.840 You can use it and exist in existing repositories, get-to-repositories, push to large 37:21.840 --> 37:26.240 forges like GitLab and GitHub, and your collaborators won't even have an idea that you're 37:26.240 --> 37:28.800 using JJ. 37:28.800 --> 37:33.520 Everyone knows that gets user experience is not exactly the most loved one, and indeed, 37:33.520 --> 37:38.440 many people seem to prefer the JJ's experience way more. 37:38.440 --> 37:40.880 It's of course not much of a surprise. 37:40.880 --> 37:45.720 The get-to-user interface has grown somewhat organically over the last two decades, which 37:45.720 --> 37:50.440 leads to inconsistencies and commands that simply don't feel modern. 37:50.440 --> 37:54.200 JJs is started from scratch, and it took all of the lessons that I get learned the heart 37:54.200 --> 37:58.040 right way directly to heart. 37:58.040 --> 38:03.440 As a get-to-repstopper, I was naturally quite curious, so I had a look at JJ quite early. 38:03.440 --> 38:09.680 I looked at it, found a confusing, and it's called it stupid. 38:09.680 --> 38:14.520 It just didn't make any sense to me at all, so I simply discarded it. 38:14.520 --> 38:18.560 But there was a steady influx of people who have seen the lights at a say. 38:18.560 --> 38:24.200 So I decided to eventually have another look, and that's been a finally clicked. 38:24.200 --> 38:28.880 That moment when you realize that a tool simply fixes all the UI issues that you had, 38:28.880 --> 38:33.520 and that you have been developing for the last 20 years, was not exactly great. 38:33.520 --> 38:38.840 But I had two options, either I can despair or I can learn from the competition, and 38:38.840 --> 38:42.120 I chose to learn from it. 38:42.120 --> 38:45.960 There's a couple of significant departures of from what get-tas. 38:45.960 --> 38:52.280 First, history is malleable by default, and you can basically shape your commits as you go. 38:52.280 --> 38:56.520 It's almost as if you were permanently in an interactive rebase mode, but without all of the 38:56.520 --> 38:59.480 confusing parts. 38:59.480 --> 39:04.600 Also when you re-write history, dependence update automatically, so if you added a commit, 39:04.600 --> 39:08.600 all children are rebased automatically. 39:08.600 --> 39:10.920 There is no special detached hat mode. 39:10.920 --> 39:16.200 In fact, you often often don't even have local name branches, so you're constantly working 39:16.200 --> 39:18.960 with detached hats, so to say. 39:18.960 --> 39:22.480 And also, conflicts are data, not emergencies. 39:22.480 --> 39:27.120 You can commit them, and resolve them at any later point in time. 39:27.120 --> 39:28.920 These are just nice to have. 39:28.920 --> 39:32.560 They fundamentally change how you think about your commits. 39:32.560 --> 39:36.680 You stop treating them as precious artifacts, and rather start treating them at drafts 39:36.680 --> 39:39.360 that it can freely edit. 39:39.360 --> 39:45.160 As said in the intro, get us old, so we cannot just completely revamp our UI, and those 39:45.160 --> 39:47.160 break all the workflows up there. 39:47.160 --> 39:53.000 But there are some things that we can definitely steal from JJ. 39:53.000 --> 39:57.960 The primary way to re-write history is by using git rebase, and specifically interactive 39:57.960 --> 39:59.760 rebases. 39:59.760 --> 40:03.560 But interactive rebases are making some tasks a lot harder than they have any right 40:03.560 --> 40:05.560 to be. 40:05.560 --> 40:07.840 One example is splitting up a commit. 40:07.840 --> 40:12.160 First, you need to figure out which commits you want to split up, and let's pretend 40:12.160 --> 40:16.960 we want to, for example, split up the commit, introduce A and B. 40:16.960 --> 40:20.160 You would now start an interactive rebase. 40:20.160 --> 40:22.440 This already causes the first confusing moment. 40:22.440 --> 40:28.920 In order to edit that commit, you have to rebase on top of its parent, not the commit itself. 40:28.920 --> 40:32.960 You're now presented with an instruction sheet in your editor, where you have to manually 40:32.960 --> 40:38.040 search for your commit, edit the instruction from pick to edit, and then save. 40:38.040 --> 40:42.280 You get used to it, but it's somewhat weird. 40:42.280 --> 40:45.760 You're now put on top of the commit that you want to edit. 40:45.760 --> 40:51.160 You have to undo it, because you want to split it up and trade to new commits. 40:51.160 --> 40:54.920 Be careful, you stitch the first file that we want to put into the first commit, and commit 40:54.920 --> 40:55.920 it. 40:55.920 --> 41:00.320 The original commit message is kind of gone at this point of time, except if you know that 41:00.320 --> 41:08.040 you can pass dash dash re-edit message equals hat at curly braces 1. 41:08.040 --> 41:11.040 Not exactly ergonomic either. 41:11.040 --> 41:15.160 We can edit another file now, and then we create the second commit. 41:15.160 --> 41:20.440 And finally, you conclude the action by saying get rebase dash dash continue. 41:20.440 --> 41:25.800 All of this requires 7 commands, editing an arcane instruction sheet, and some scary operations 41:25.800 --> 41:28.600 like discarding commits. 41:28.600 --> 41:32.480 And if you had other branches depending on this commit, well, they're no pointing at the 41:32.480 --> 41:33.920 old object at you. 41:33.920 --> 41:39.160 As you can see, the old introduced A plus B commit still exists in this history. 41:39.160 --> 41:43.920 And is referenced by both the other branches feature A and feature B. 41:43.920 --> 41:48.560 In Jiu Jitsu, all you have to say is JJ split, and then it asks you which changes should 41:48.560 --> 41:52.960 be part of what commit, and what the commit messages, and what the commit messages should 41:52.960 --> 41:53.960 be. 41:53.960 --> 41:59.920 I kind of get why people actually prefer this workflow. 41:59.920 --> 42:04.120 As mentioned, dependent branches would have to be rebased manually after an interactive 42:04.120 --> 42:05.320 free base. 42:05.320 --> 42:09.160 This is becoming a problem though, if you want to work with stacked branches. 42:09.160 --> 42:14.040 The style of working that is becoming increasingly more popular. 42:14.040 --> 42:19.920 Let's assume you want to build a new feature that consists of a couple of logical steps. 42:19.920 --> 42:24.360 The traditional workflow typically creates a single branch that contains each of these steps 42:24.360 --> 42:26.480 as individual commits. 42:26.480 --> 42:29.320 The end result is one big merge request. 42:29.320 --> 42:33.640 This is easy for the developer, but painful for the review, because they now have to 42:33.640 --> 42:38.520 read through the hundreds of lines of change. 42:38.520 --> 42:42.880 The alternative that gains more and more traction though is to have stacked branches. 42:42.880 --> 42:47.960 Instead of putting every commit into the same branch, you create a set of dependent branches. 42:47.960 --> 42:52.960 Each of these branches builds one small part of the bigger feature, and each of them uses 42:52.960 --> 42:55.640 a separate merge request. 42:55.640 --> 43:01.080 This overall of course requires more steps, but the review will now go a lot faster, because 43:01.080 --> 43:07.400 the changes that a other person needs to review are much smaller overall. 43:07.400 --> 43:11.480 The problem is that it makes maintaining these stacked branches quite painful. 43:11.480 --> 43:15.360 Let's say you for example work on top of feet off to address review feedback. 43:15.360 --> 43:19.960 You simply fix a typo in the commit message, and then suddenly all the dependent branches 43:19.960 --> 43:24.120 will be orphaned now as they point to the old version of feet off. 43:24.120 --> 43:28.400 You have to manually rebase feet API and feet UI. 43:28.400 --> 43:36.160 Do this a few times a day, and you will quickly abandon the stacked branch workflow entirely. 43:36.160 --> 43:40.120 We aim to solve these issues, and make stacked branch workflows easier with a new get command, 43:40.120 --> 43:42.120 get history. 43:42.120 --> 43:45.440 The goal is to have a couple of opinionated subcommands. 43:45.440 --> 43:48.800 These subcommands do one thing, and they do it well. 43:48.800 --> 43:53.520 Get history, we were, for example, we were the commit message of one of your commits. 43:53.520 --> 43:59.520 It's just like get commit dashed as a man, except for an arbitrary commit in your history. 43:59.520 --> 44:00.520 Get history split. 44:00.520 --> 44:02.520 We'll work just like JJ split. 44:02.520 --> 44:07.880 You give it a commit, get ass, which part should be part of what commit, and then you type 44:07.880 --> 44:11.280 into commit messages, nice and easy. 44:11.280 --> 44:15.640 Get history absorbed, takes all of your stage changes, and figures out automatically which 44:15.640 --> 44:20.080 commits to apply them to, and squashes them into those commits. 44:20.080 --> 44:24.080 All of these commands are heavily inspired by what JJ provides. 44:24.080 --> 44:28.200 In fact, JJ itself was also inspired by other version control systems. 44:28.200 --> 44:32.600 Get history absorbed, for example, and JJ absorbed have originally been implemented 44:32.600 --> 44:36.400 in material a long time ago already. 44:36.400 --> 44:41.040 This is, of course, only a start, and we plan to add more subcommands to it that make editing 44:41.040 --> 44:44.240 your commit history way easier going forward. 44:44.240 --> 44:49.720 Some examples include squashing a range of commits, dropping a specific commit, reordering 44:49.720 --> 44:52.400 them, and so on. 44:52.400 --> 44:57.320 The more important part, though, is that we don't only aim to make recurring tasks easier. 44:57.320 --> 45:02.560 This command also knows to automatically revise dependent branches. 45:02.560 --> 45:07.120 So let's revisit our example from earlier on by using get history instead. 45:07.120 --> 45:11.640 We again have the same state as before with get rebase. 45:11.640 --> 45:16.080 To split up the commit, we simply execute get history split with the commit ID. 45:16.080 --> 45:19.960 It now goes through all the changes one by one, and asks, should that be part of the 45:19.960 --> 45:21.960 first commit or not? 45:21.960 --> 45:28.040 So we answered the first question with yes, and the second question with no, which means 45:28.040 --> 45:32.760 that file a will be part of commit one, and file b will be part of commit two. 45:32.760 --> 45:37.040 If you confirm, get will now ask you for two commit messages. 45:37.040 --> 45:41.760 In both cases, it also knows to retain the original commit message so that you have it 45:41.760 --> 45:43.760 as context. 45:43.760 --> 45:48.240 But if we do another lock, you can see that the commit has been split up into two commits, 45:48.240 --> 45:52.880 but more importantly, you can also see that all of the dependent branches, main feature 45:52.880 --> 46:00.240 A and feature B, have been updated automatically to point to the new revert in commits. 46:00.240 --> 46:03.800 Get history has undergone a very long discussion on the get mailing list. 46:03.800 --> 46:08.680 The first version was posted in get 250 tier realized hack on August last year. 46:08.680 --> 46:13.920 Since then, it has been significantly worse with a lot of bike shedding and swan to 46:13.920 --> 46:17.680 handle stack branches, workloads way better. 46:17.680 --> 46:22.120 The initial architecture for get history has emerged upstream, and will likely be part of 46:22.120 --> 46:25.600 the next get release in about three months probably. 46:25.600 --> 46:31.200 For now, it only supports re-wording, but splitting up a commit will move into review next. 46:31.200 --> 46:35.760 More subcommands will follow, and in that context we will have a look at first class conflicts 46:35.760 --> 46:37.720 as well. 46:37.720 --> 46:42.360 Which are a central part of how jj works with stack branches. 46:42.360 --> 46:46.760 I'm also very certain that the get project will have a deeper look at what else jj has to 46:46.760 --> 46:48.080 offer. 46:48.080 --> 46:52.240 Not all of these changes will land and get, because the design is simply different, 46:52.240 --> 46:56.400 and as a consequence not everything that makes sense in the jj world also makes sense in 46:56.400 --> 46:58.080 the get world. 46:58.080 --> 47:02.640 But I'm very certain that there will be more to come. 47:02.640 --> 47:06.640 So this has been a little bit of a whirlwind through two or through what's happening and 47:06.680 --> 47:08.800 what's cooking and get right now. 47:08.800 --> 47:12.960 I hope you have learned a bunch of new things, and have a little bit of a clear picture 47:12.960 --> 47:16.320 of where the get project has had it to. 47:16.320 --> 47:20.760 We have been looking at the chart of the sixth transition, the refibled back and to story 47:20.760 --> 47:25.600 references, upcoming changes to improve large object support, and some of the upcoming 47:25.600 --> 47:27.600 history re-writing features. 47:27.600 --> 47:32.240 If you have any feedback or questions, I'm very happy to discuss them, so please just 47:32.280 --> 47:34.880 approach me on the hallway and have a chat with me. 47:34.880 --> 47:35.880 Thanks a lot for your attention.