1 00:00:00,000 --> 00:00:06,000 2 00:00:06,000 --> 00:00:10,000 3 00:00:10,000 --> 00:00:11,000 4 00:00:11,000 --> 00:00:14,000 This presentation is delivered by the Stanford Center for Professional 5 00:00:14,000 --> 00:00:24,000 Development. 6 00:00:24,000 --> 00:00:26,000 Hey. Welcome to 7 00:00:26,000 --> 00:00:28,000 Wednesday. Things that have happened and we're talking about sorting. We've got lots of sorting 8 00:00:28,000 --> 00:00:31,000 to talk about today. Some more sorting we'll talk about on Friday. 9 00:00:31,000 --> 00:00:33,000 So I'm going to pick up on where I left off on Monday. 10 00:00:33,000 --> 00:00:37,000 And this was the code for the selection sort algorithm. 11 00:00:37,000 --> 00:00:39,000 12 00:00:39,000 --> 00:00:42,000 And, in fact, I'm actually not gonna - I'm gonna be showing the code for each of the algorithms that I'm using here, 13 00:00:42,000 --> 00:00:44,000 but I'm gonna encourage you to not 14 00:00:44,000 --> 00:00:47,000 actually get really focused on the details of the code. This is actually one 15 00:00:47,000 --> 00:00:51,000 of those places where the code is an embodiment of the idea and the idea is 16 00:00:51,000 --> 00:00:54,000 the one that actually we're interested in studying, which is the approach it takes to sorting, 17 00:00:54,000 --> 00:00:56,000 how it decides to get that data in sorted order, 18 00:00:56,000 --> 00:00:59,000 and then how that algorithm performs and the kind of details 19 00:00:59,000 --> 00:01:02,000 of where It's a plus one and minus one less than - It's actually a little bit 20 00:01:02,000 --> 00:01:04,000 less important for this particular topic. 21 00:01:04,000 --> 00:01:07,000 So what I'm gonna show you actually is a little 22 00:01:07,000 --> 00:01:08,000 demo 23 00:01:08,000 --> 00:01:13,000 of 24 00:01:13,000 --> 00:01:15,000 this code in action 25 00:01:15,000 --> 00:01:18,000 that's done using our graphics library just to show you what's going on. 26 00:01:18,000 --> 00:01:20,000 And so this is the selection sort code 27 00:01:20,000 --> 00:01:23,000 that the outer loop of which 28 00:01:23,000 --> 00:01:24,000 selects 29 00:01:24,000 --> 00:01:27,000 the smallest of what remains 30 00:01:27,000 --> 00:01:28,000 from the position 31 00:01:28,000 --> 00:01:29,000 I 32 00:01:29,000 --> 00:01:30,000 to the end of the array 33 00:01:30,000 --> 00:01:34,000 and then swaps it into position after It's done that finding. So this inner loop is 34 00:01:34,000 --> 00:01:37,000 basically just a min finder finding the minimum index element 35 00:01:37,000 --> 00:01:41,000 from I to the end. So I'm gonna watch this code step through. So you can 36 00:01:41,000 --> 00:01:42,000 see 37 00:01:42,000 --> 00:01:45,000 as it moves J down, which goes bopping by in green It's actually updating 38 00:01:45,000 --> 00:01:47,000 this kind of min, 39 00:01:47,000 --> 00:01:50,000 the blue pointer there, that says, well, what's the min I'm seeing so far? 40 00:01:50,000 --> 00:01:54,000 Early on, it was 92, then it bumped down to 45, then it bumped down to 41, then 28, 41 00:01:54,000 --> 00:01:56,000 then 8 and finally got to the end 42 00:01:56,000 --> 00:01:58,000 and found nothing smaller so it says, okay, that must be where the 43 00:01:58,000 --> 00:01:59,000 min is. 44 00:01:59,000 --> 00:02:03,000 Let me move that guy to the very front. So 45 00:02:03,000 --> 00:02:05,000 exchanges those two guys, flopping another one back in there. 46 00:02:05,000 --> 00:02:09,000 Now, we know we have the very smallest one out of the way. It's in the right spot. We have 47 00:02:09,000 --> 00:02:12,000 nothing more to do with that. There's no changes that'll need to be made to anything 48 00:02:12,000 --> 00:02:13,000 to the left of I 49 00:02:13,000 --> 00:02:15,000 and we do the same operation again. Select 50 00:02:15,000 --> 00:02:19,000 from the remaining n minus one elements, the minimum element. So watch 51 00:02:19,000 --> 00:02:20,000 the 52 00:02:20,000 --> 00:02:21,000 green, 53 00:02:21,000 --> 00:02:25,000 sort of, finger move down while the blue finger kind of stays behind 54 00:02:25,000 --> 00:02:28,000 on the smallest It's seen so far. So It's okay, so, well, then 15 is it. 55 00:02:28,000 --> 00:02:29,000 Swap 56 00:02:29,000 --> 00:02:31,000 that guy up into the second slot. 57 00:02:31,000 --> 00:02:32,000 Do it again. 58 00:02:32,000 --> 00:02:35,000 Keep doing this, right? 59 00:02:35,000 --> 00:02:36,000 Working It's way down 60 00:02:36,000 --> 00:02:40,000 and then, as you see, the array's kind of going from the left to the right, 61 00:02:40,000 --> 00:02:44,000 the smallest most elements, right? Being picked off and selected 62 00:02:44,000 --> 00:02:47,000 and moved out and then leaving the larger ones kind of over here 63 00:02:47,000 --> 00:02:50,000 in a little bit of a jumbled mess to get sorted out in the later iterations of 64 00:02:50,000 --> 00:02:53,000 those loops. 65 00:02:53,000 --> 00:02:54,000 So there is select 66 00:02:54,000 --> 00:02:55,000 and 67 00:02:55,000 --> 00:03:00,000 sort in action, right? Doing its thing. Okay. 68 00:03:00,000 --> 00:03:02,000 Let me do another little demo with you because I have - there's lots of 69 00:03:02,000 --> 00:03:03,000 ways to look 70 00:03:03,000 --> 00:03:07,000 at these things. I'm gonna do one that actually can do slightly bigger ones and 71 00:03:07,000 --> 00:03:09,000 so I'm gonna set this guy up and It's 72 00:03:09,000 --> 00:03:10,000 gonna show, in this case, 73 00:03:10,000 --> 00:03:14,000 the red moving, right? And it holding onto the smallest It's seen and updating as it 74 00:03:14,000 --> 00:03:17,000 finds smaller ones. Move it to the front 75 00:03:17,000 --> 00:03:20,000 and then I'm actually gonna speed it up a little bit which is, let's see, go a little faster. 76 00:03:20,000 --> 00:03:21,000 So you'll note that 77 00:03:21,000 --> 00:03:23,000 it 78 00:03:23,000 --> 00:03:26,000 took in this case and it does a little bit of counting. I'm gonna look at the analysis in a minute 79 00:03:26,000 --> 00:03:26,000 about 80 00:03:26,000 --> 00:03:30,000 a lot of comparisons and not so many moves. that's actually kind of the model behind 81 00:03:30,000 --> 00:03:32,000 selection sort is to do a lot of testing. 82 00:03:32,000 --> 00:03:34,000 That first iteration looks at 83 00:03:34,000 --> 00:03:39,000 every one of the n elements to find the smallest and then it does one move to pull it 84 00:03:39,000 --> 00:03:40,000 into position. 85 00:03:40,000 --> 00:03:41,000 Then it does the same thing again. 86 00:03:41,000 --> 00:03:45,000 Does another task looking at this time only n minus one elements remain, but 87 00:03:45,000 --> 00:03:49,000 doing another full set of comparisons to decide who is the smallest of what 88 00:03:49,000 --> 00:03:51,000 remains and then swapping that to the front. 89 00:03:51,000 --> 00:03:52,000 So It's taking a strategy that seems to 90 00:03:52,000 --> 00:03:56,000 indicate that comparison, right? Which It's gonna do a lot of, a very 91 00:03:56,000 --> 00:03:59,000 few number of moves as a result, because it identifies where something goes. 92 00:03:59,000 --> 00:04:01,000 It pulls it out, puts it to the front, 93 00:04:01,000 --> 00:04:03,000 and doesn't do a lot of shuffling around. 94 00:04:03,000 --> 00:04:08,000 If I've put kind of a higher number of things in here 95 00:04:08,000 --> 00:04:10,000 and let it go 96 00:04:10,000 --> 00:04:13,000 it appears to kind of move very slowly at first, right? Because It's doing a lot of work 97 00:04:13,000 --> 00:04:17,000 to find those small ones and then move them down to the front, 98 00:04:17,000 --> 00:04:20,000 but as it gets further along it will 99 00:04:20,000 --> 00:04:23,000 tend to kind of speed up toward the end and that's because in subsequent iterations, 100 00:04:23,000 --> 00:04:23,000 right? There's 101 00:04:23,000 --> 00:04:25,000 fewer of the 102 00:04:25,000 --> 00:04:28,000 elements remaining to look at and so kind of a tiny little portion of It's been sort of 103 00:04:28,000 --> 00:04:35,000 so far the first four. If I - I think if I let this go it will go way too fast. Yeah. Way too fast to 104 00:04:35,000 --> 00:04:41,000 see it. There's a lot of things I want to show you because I'm gonna 105 00:04:41,000 --> 00:04:42,000 add in the 106 00:04:42,000 --> 00:04:44,000 option for sound. 107 00:04:44,000 --> 00:04:45,000 So one 108 00:04:45,000 --> 00:04:48,000 thing to learn a little bit about how to visualize sorts, right? I think that this 109 00:04:48,000 --> 00:04:51,000 kind of technique sometimes It's easier to look at than the code is to see how It's 110 00:04:51,000 --> 00:04:54,000 putting things in order. 111 00:04:54,000 --> 00:04:57,000 Several years ago I had a student who actually was blind and these 112 00:04:57,000 --> 00:05:00,000 visualizations do squat for someone who's blind it turns out. So 113 00:05:00,000 --> 00:05:04,000 she helped me work out some ideas about how to do a 114 00:05:04,000 --> 00:05:07,000 different, sort of, interface. One that visualizes it using it when you're different senses 115 00:05:07,000 --> 00:05:11,000 would just do sound and so the change I've made to the code here 116 00:05:11,000 --> 00:05:14,000 will play a tone each time It's moving an element 117 00:05:14,000 --> 00:05:17,000 and the frequency of that tone is related to the height of the bar. So 118 00:05:17,000 --> 00:05:21,000 the smaller height bars are lower frequency, so lower tones. And then 119 00:05:21,000 --> 00:05:24,000 the taller bars have a higher frequency, higher pitch to them. 120 00:05:24,000 --> 00:05:27,000 And so you'll hear the movement of the bars 121 00:05:27,000 --> 00:05:39,000 being expressed with this tone loop so - and I'm not hearing any sound. Is my - are we 122 00:05:39,000 --> 00:05:52,000 getting sound? There we go. 123 00:05:52,000 --> 00:05:58,000 I'm not getting sound now, how about you? No, we're not 124 00:05:58,000 --> 00:06:01,000 getting any sound at all. Here we go. I'm getting sound now. How about you? Well, and that's true. It's only when they move. that's a very good point. What about - I'm running a little bit 125 00:06:01,000 --> 00:06:09,000 slower than I'm thinking. All right. I can 126 00:06:09,000 --> 00:06:12,000 get it 127 00:06:12,000 --> 00:06:13,000 128 00:06:13,000 --> 00:06:14,000 to - all right. 129 00:06:14,000 --> 00:06:16,000 Let's try. Now, we're - I feel like we're hearing something. 130 00:06:16,000 --> 00:06:19,000 we're still. 131 00:06:19,000 --> 00:06:20,000 Definitely getting sound now. Okay. 132 00:06:20,000 --> 00:06:27,000 Now It's gonna be a little loud. There we go. The 133 00:06:27,000 --> 00:06:29,000 march of the low order tones can be moved into 134 00:06:29,000 --> 00:06:34,000 place doing a lot of comparison and then you'll hear kind of the speed up I talked 135 00:06:34,000 --> 00:06:46,000 about. 136 00:06:46,000 --> 00:06:49,000 Toward the end is the later iterations. 137 00:06:49,000 --> 00:06:50,000 Finishing it up. 138 00:06:50,000 --> 00:06:54,000 So giving you a kind of a little memory for, sort of, what's the action selection sort. 139 00:06:54,000 --> 00:06:54,000 The 140 00:06:54,000 --> 00:06:57,000 small amount of tones and the kind of sparse 141 00:06:57,000 --> 00:07:00,000 distance meeting shows that It's doing a lot of work in between each of those moves, 142 00:07:00,000 --> 00:07:01,000 right? 143 00:07:01,000 --> 00:07:04,000 And then the fact that the tones kind of advance from low to high tells you that It's 144 00:07:04,000 --> 00:07:05,000 working on the 145 00:07:05,000 --> 00:07:09,000 smaller values up to the larger values. Kind of working It's way to the end and that 146 00:07:09,000 --> 00:07:14,000 speeding up it closing in on the remaining ones as it gets to the end. 147 00:07:14,000 --> 00:07:17,000 Left that around because even if you are 148 00:07:17,000 --> 00:07:20,000 can see it you can also hear it and that may help something. 149 00:07:20,000 --> 00:07:24,000 What we're gonna look at is how much work does it do? So if I go back and look at the code just 150 00:07:24,000 --> 00:07:27,000 to refresh what's going on, right? This 151 00:07:27,000 --> 00:07:28,000 inner loop is doing 152 00:07:28,000 --> 00:07:31,000 the comparisons all the way to find the min elements and so this is gonna iterate 153 00:07:31,000 --> 00:07:35,000 n minus one times on the first iteration, then n minus two, then n minus three, and all the 154 00:07:35,000 --> 00:07:38,000 way down those final iterations, three to look at, two to look at, one to 155 00:07:38,000 --> 00:07:39,000 look at 156 00:07:39,000 --> 00:07:41,000 and then one swap outside that. 157 00:07:41,000 --> 00:07:45,000 So what I actually have here is the sum of the values 158 00:07:45,000 --> 00:07:47,000 n minus one compares, 159 00:07:47,000 --> 00:07:51,000 plus one swap, but as two comparisons one swap, so one just swaps but actually 160 00:07:51,000 --> 00:07:52,000 the 161 00:07:52,000 --> 00:07:53,000 162 00:07:53,000 --> 00:07:56,000 terms I'm looking at here are the number of comparisons the sum of the numbers 163 00:07:56,000 --> 00:07:59,000 one to n minus one 164 00:07:59,000 --> 00:08:00,000 is what we're trying to compute here. 165 00:08:00,000 --> 00:08:03,000 Then if you've seen this some are the [inaudible], some are the Gaussian, and some you may all ready know how to 166 00:08:03,000 --> 00:08:07,000 solve it, but I just kind of showed you how to do the math to work it out, which 167 00:08:07,000 --> 00:08:07,000 is 168 00:08:07,000 --> 00:08:09,000 the term you're looking for is this sum. 169 00:08:09,000 --> 00:08:13,000 If you add it to itself, but rearrange the sequence of the terms so that they 170 00:08:13,000 --> 00:08:16,000 cancel each other out you'll see that the n minus one plus one gives you an 171 00:08:16,000 --> 00:08:17,000 n 172 00:08:17,000 --> 00:08:20,000 and that what you end up with is n minus one n's being added together 173 00:08:20,000 --> 00:08:22,000 is what the sum of this 174 00:08:22,000 --> 00:08:24,000 sequence against itself is 175 00:08:24,000 --> 00:08:29,000 and so we can divide by two to get the answer we're looking for, so we have a one-half 176 00:08:29,000 --> 00:08:31,000 n squared minus n term. 177 00:08:31,000 --> 00:08:34,000 Which in the big O world, right? Just comes down to n squared. 178 00:08:34,000 --> 00:08:37,000 So tell this - It's a quadratic sort, right? That we would expect that if it took a 179 00:08:37,000 --> 00:08:40,000 certain amount of time to do a hundred, 180 00:08:40,000 --> 00:08:45,000 three seconds, then if we double that input we expect it to take four times as long. So if it 181 00:08:45,000 --> 00:08:46,000 was 182 00:08:46,000 --> 00:08:50,000 three seconds before It's 12 seconds now. So growing as a parabola is a 183 00:08:50,000 --> 00:08:51,000 fairly sharp 184 00:08:51,000 --> 00:08:58,000 steep curve to get through things. All 185 00:08:58,000 --> 00:09:00,000 right. So let me give you an alternative 186 00:09:00,000 --> 00:09:03,000 algorithm to this. 187 00:09:03,000 --> 00:09:06,000 Just to kind of think, that's gonna be the theme is, well, that's one way to do it and that 188 00:09:06,000 --> 00:09:08,000 has certain properties. Let's look at another way and then maybe we'll 189 00:09:08,000 --> 00:09:12,000 have some opportunity to compare and contrast these two different approaches to 190 00:09:12,000 --> 00:09:15,000 see what kind of tradeoffs they make in terms 191 00:09:15,000 --> 00:09:16,000 of 192 00:09:16,000 --> 00:09:18,000 algorithm choices. 193 00:09:18,000 --> 00:09:20,000 So the way you might handle a deck of a cards - 194 00:09:20,000 --> 00:09:23,000 sorting a deck of cards. So if I'm handing out cards to people you're getting 195 00:09:23,000 --> 00:09:27,000 each card in turn that one way people do that is they pick up the first card and they 196 00:09:27,000 --> 00:09:31,000 kind of - you assume It's trivially sorted, right? It's in the right place. You pick up the next 197 00:09:31,000 --> 00:09:34,000 card and you decide, well, where does it go relative to the one you just had. Maybe you're 198 00:09:34,000 --> 00:09:38,000 just sorting by number order and you say, well, okay, It's greater than it and goes on this side. 199 00:09:38,000 --> 00:09:40,000 You pick up your next one and It's like it goes in between them. 200 00:09:40,000 --> 00:09:43,000 And so you're inserting each new element 201 00:09:43,000 --> 00:09:46,000 into the position of the ones that are all ready sorted. 202 00:09:46,000 --> 00:09:49,000 So if you imagine applying that same thing in terms of 203 00:09:49,000 --> 00:09:53,000 computer talk in a vector sense is you could imagine kind of taking the vector, 204 00:09:53,000 --> 00:09:56,000 assuming that the first element is sorted, then looking at the second one and 205 00:09:56,000 --> 00:10:00,000 deciding, well, where does it go relative to the ones all ready sorted? The ones to my 206 00:10:00,000 --> 00:10:01,000 left 207 00:10:01,000 --> 00:10:02,000 and 208 00:10:02,000 --> 00:10:05,000 kind of moving it into position, shuffling over to open up that space that's gonna 209 00:10:05,000 --> 00:10:06,000 go into 210 00:10:06,000 --> 00:10:08,000 and then just extending that 211 00:10:08,000 --> 00:10:11,000 as you further and further down the vector. Taking each subsequent element and 212 00:10:11,000 --> 00:10:15,000 inserting it into the ones to its left to make a sorted array kind of grow from 213 00:10:15,000 --> 00:10:17,000 the left side. 214 00:10:17,000 --> 00:10:20,000 So it grows from the left somewhat similar to selection sort, but actually it will look 215 00:10:20,000 --> 00:10:24,000 a little bit different based on how It's doing its strategy here. So 216 00:10:24,000 --> 00:10:26,000 let me 217 00:10:26,000 --> 00:10:28,000 give you the insertion sort code. 218 00:10:28,000 --> 00:10:30,000 So my outer loop 219 00:10:30,000 --> 00:10:34,000 is looking at the element at index one to the element of the final index of the raise 220 00:10:34,000 --> 00:10:36,000 side minus one 221 00:10:36,000 --> 00:10:37,000 and it 222 00:10:37,000 --> 00:10:41,000 copies that into this variable current to work with and then this inner loop is 223 00:10:41,000 --> 00:10:44,000 doing a down to loop, so It's actually backing up 224 00:10:44,000 --> 00:10:47,000 from the position J over 225 00:10:47,000 --> 00:10:48,000 starting from where you are to say, 226 00:10:48,000 --> 00:10:51,000 well, where does this - it keeps sliding this one down until It's fallen into the 227 00:10:51,000 --> 00:10:52,000 right place. 228 00:10:52,000 --> 00:10:55,000 So we'll move over the 92, 229 00:10:55,000 --> 00:10:57,000 in this case, to make 230 00:10:57,000 --> 00:10:58,000 space for the 45 231 00:10:58,000 --> 00:11:01,000 and then that's kind of the whole iteration on that first time. The next time we have to 232 00:11:01,000 --> 00:11:05,000 look at 67. Well, 67's definitely gonna go past 92, but 233 00:11:05,000 --> 00:11:07,000 not past 234 00:11:07,000 --> 00:11:10,000 45. 41 is gonna need to go three full iterations to slide all the way 235 00:11:10,000 --> 00:11:13,000 down to the front. 236 00:11:13,000 --> 00:11:14,000 74 237 00:11:14,000 --> 00:11:17,000 is just gonna go over one number. 238 00:11:17,000 --> 00:11:18,000 Just needs to slide past one. 239 00:11:18,000 --> 00:11:21,000 So on different iterations, right? A little different amount of work is being 240 00:11:21,000 --> 00:11:22,000 done, right? This loop terminates 241 00:11:22,000 --> 00:11:26,000 when the number has been slotted into position. We won't know in advance how 242 00:11:26,000 --> 00:11:29,000 far it needs to go, but we go all the way to the end 243 00:11:29,000 --> 00:11:30,000 if necessary, 244 00:11:30,000 --> 00:11:32,000 but then kind of sliding each one up 245 00:11:32,000 --> 00:11:34,000 by one as we go down. So 246 00:11:34,000 --> 00:11:38,000 that one moved all the way to the front, 87 just has one to go, eights got 247 00:11:38,000 --> 00:11:39,000 a long way to go. 248 00:11:39,000 --> 00:11:43,000 All the way down to the very front there. Sixty-seven 249 00:11:43,000 --> 00:11:45,000 goes and stops right there, 250 00:11:45,000 --> 00:11:47,000 15 almost to the bottom, 251 00:11:47,000 --> 00:11:50,000 and then 59 moving down this 252 00:11:50,000 --> 00:11:52,000 way. 253 00:11:52,000 --> 00:11:55,000 So kind of immediately you get the sense It's actually doing something different 254 00:11:55,000 --> 00:11:58,000 than selection sort, just visually, right? You're seeing a lot more movement for a 255 00:11:58,000 --> 00:11:59,000 start 256 00:11:59,000 --> 00:12:02,000 that elements are getting kind of shuffled over and making that space so It's 257 00:12:02,000 --> 00:12:06,000 definitely making a different algorithmic choice in terms of comparison versus 258 00:12:06,000 --> 00:12:06,000 moving 259 00:12:06,000 --> 00:12:09,000 than selection sort did, which is kind of a lot of looking and then a little bit of 260 00:12:09,000 --> 00:12:10,000 moving. It's doing kind of 261 00:12:10,000 --> 00:12:14,000 the moving and the looking in tandem here. 262 00:12:14,000 --> 00:12:16,000 If I want to hear how it sounds, because 263 00:12:16,000 --> 00:12:20,000 nothing's complete without knowing how it sounds, I can go 264 00:12:20,000 --> 00:12:22,000 back over here 265 00:12:22,000 --> 00:12:27,000 and let me turn it down a 266 00:12:27,000 --> 00:12:33,000 little 267 00:12:33,000 --> 00:13:00,000 bit. 268 00:13:00,000 --> 00:13:07,000 So you hear 269 00:13:07,000 --> 00:13:11,000 a lot more work in terms of move, right? Because of the signal I'll speed it up just a little bit. So, as you can see, a kind of big [inaudible] a lot of work [inaudible] as it slides that thing down into position and then finding It's home. Some will have come a long distance to travel like that last one. Other ones are very quick where they don't have so far to go in turn. I think that's it. I'll crank it up to like a, sort of, a bigger number, let's say, 270 00:13:11,000 --> 00:13:12,000 and 271 00:13:12,000 --> 00:13:13,000 turn off the sound 272 00:13:13,000 --> 00:13:15,000 and just let it go and 273 00:13:15,000 --> 00:13:18,000 you can kind of get a sense of what it looks like. It seems to move very, very quickly at 274 00:13:18,000 --> 00:13:21,000 the beginning and then kind of starts to slow down towards the end. 275 00:13:21,000 --> 00:13:27,000 If I compare that to my friend the selection sort, 276 00:13:27,000 --> 00:13:30,000 but it looks like insertion sort is way out in front and its gonna just win this 277 00:13:30,000 --> 00:13:33,000 race hands down, but, in fact, selection sort kind of makes a really 278 00:13:33,000 --> 00:13:35,000 quick 279 00:13:35,000 --> 00:13:39,000 end runaround and at the end it actually came in just a few fractions of a second 280 00:13:39,000 --> 00:13:39,000 faster by kind of 281 00:13:39,000 --> 00:13:42,000 speeding up towards the end. So It's like the tortoise and the hare. It looks 282 00:13:42,000 --> 00:13:45,000 like insertion sort's way out in front, 283 00:13:45,000 --> 00:13:48,000 but then actually selection sort manages to catch up. I'm gonna come back and look at 284 00:13:48,000 --> 00:13:53,000 these numbers in a second, but I want to go back and do a little analysis on this first. 285 00:13:53,000 --> 00:13:54,000 If you 286 00:13:54,000 --> 00:13:56,000 take a look at what the code is doing and talk about kind of what's happening, right? 287 00:13:56,000 --> 00:14:01,000 We've got a number of iterations of the outside, which is sliding me to this position. This inner 288 00:14:01,000 --> 00:14:05,000 loop is potentially looking at all of the elements to the left. 289 00:14:05,000 --> 00:14:06,000 So on the first iteration it 290 00:14:06,000 --> 00:14:11,000 looks at one. At the second iteration it looks at two. The third one three and so on until the 291 00:14:11,000 --> 00:14:14,000 final one could potentially have n minus one things to examine and move 292 00:14:14,000 --> 00:14:17,000 down. If that element was the smallest of the ones remaining it would have a long way 293 00:14:17,000 --> 00:14:18,000 to travel. 294 00:14:18,000 --> 00:14:20,000 But this inner loop, right? 295 00:14:20,000 --> 00:14:23,000 Unlike selection sort that has kind of a known factor It's actually a little bit variable 296 00:14:23,000 --> 00:14:24,000 because we don't know for sure 297 00:14:24,000 --> 00:14:25,000 how It's gonna work. 298 00:14:25,000 --> 00:14:30,000 Potentially in the worst case, right? It will do one comparison move, two 299 00:14:30,000 --> 00:14:33,000 and three and four all the way up through those 300 00:14:33,000 --> 00:14:34,000 301 00:14:34,000 --> 00:14:37,000 iterations, which gives us exactly the same Gaussian sum that selection sort 302 00:14:37,000 --> 00:14:41,000 did. One plus two plus three all the way up to n minus one tells us its gonna be n 303 00:14:41,000 --> 00:14:42,000 squared 304 00:14:42,000 --> 00:14:46,000 minus n over two, which comes down to O of n squared. 305 00:14:46,000 --> 00:14:50,000 That would be in the absolute worst case, right? Situation where it did the maximum 306 00:14:50,000 --> 00:14:52,000 amount of work. What 307 00:14:52,000 --> 00:14:53,000 input 308 00:14:53,000 --> 00:14:57,000 is the embodiment of the worst case in selection sort? 309 00:14:57,000 --> 00:14:58,000 What's it gonna 310 00:14:58,000 --> 00:15:02,000 be? If It's flipped, right? If It's totally inverted, right? So if you have the maximum element in the 311 00:15:02,000 --> 00:15:05,000 front and the smallest element in the back, right? Every one of them has to travel the maximum 312 00:15:05,000 --> 00:15:06,000 distance 313 00:15:06,000 --> 00:15:08,000 in this form to get there. What is 314 00:15:08,000 --> 00:15:09,000 the 315 00:15:09,000 --> 00:15:10,000 best case 316 00:15:10,000 --> 00:15:12,000 to do the least amount of work? 317 00:15:12,000 --> 00:15:15,000 It's all ready sorted. If It's all ready sorted then all it needs to do is verify that, oh, I don't 318 00:15:15,000 --> 00:15:20,000 need to go at all. So it actually - the inner loop only does one test to say do 319 00:15:20,000 --> 00:15:23,000 you need to move over at least one? No. Okay. So then it turns out 320 00:15:23,000 --> 00:15:26,000 it would run completely linear time. 321 00:15:26,000 --> 00:15:29,000 It will just check each element with its neighbor, realize It's all ready in the 322 00:15:29,000 --> 00:15:31,000 right place relative to the left, 323 00:15:31,000 --> 00:15:32,000 and move one. So, in fact, it will run 324 00:15:32,000 --> 00:15:35,000 totally 325 00:15:35,000 --> 00:15:38,000 quickly, right? In that case. And in the average case you might say, well, you know, 326 00:15:38,000 --> 00:15:41,000 given any sort of random permutation of it, each element probably has to go 327 00:15:41,000 --> 00:15:42,000 about 328 00:15:42,000 --> 00:15:45,000 half the distance. So I don't have to go all the way, some don't have to go very far, some will go 329 00:15:45,000 --> 00:15:46,000 in the middle. You 330 00:15:46,000 --> 00:15:50,000 would add another factor of a half onto the n squared, which ends up still 331 00:15:50,000 --> 00:15:54,000 in the big O just being lost in the noise. You say, well, It's still n squared. But it 332 00:15:54,000 --> 00:15:54,000 probably does. 333 00:15:54,000 --> 00:15:58,000 It is a little bit less of an n squared than selection sort. So if I go 334 00:15:58,000 --> 00:16:01,000 back here to my - 335 00:16:01,000 --> 00:16:08,000 well, it was counting for me. That the 336 00:16:08,000 --> 00:16:10,000 mix of operations between 337 00:16:10,000 --> 00:16:13,000 insertion sort and selection sort, so that's selection sort on the top and insertion 338 00:16:13,000 --> 00:16:14,000 sort that's beneath it, 339 00:16:14,000 --> 00:16:18,000 shows that selection sort is doing a lot of compares. In this case, I had about 340 00:16:18,000 --> 00:16:19,000 500 elements 341 00:16:19,000 --> 00:16:23,000 and It's doing about n squared over two of them, 250,000 divided 342 00:16:23,000 --> 00:16:24,000 by two there, 343 00:16:24,000 --> 00:16:27,000 and so it really is doing a full set of compares. A whole n squared kind of compares 344 00:16:27,000 --> 00:16:30,000 is doing a small number of moves. In this case, 345 00:16:30,000 --> 00:16:33,000 each element is swapped so there's actually two moves. One in, one out. 346 00:16:33,000 --> 00:16:37,000 So it does basically a number of moves that's linear relative to 347 00:16:37,000 --> 00:16:39,000 the number of elements. 348 00:16:39,000 --> 00:16:43,000 The insertion sort though is making it, sort of, a different tradeoff. It's doing a 349 00:16:43,000 --> 00:16:44,000 move and a compare 350 00:16:44,000 --> 00:16:47,000 for most elements, right? 351 00:16:47,000 --> 00:16:50,000 In tandem and then that last compare doesn't do a move with it. So there should be roughly 352 00:16:50,000 --> 00:16:51,000 tracking in the same 353 00:16:51,000 --> 00:16:52,000 thing. 354 00:16:52,000 --> 00:16:56,000 But they look closer to n squared over four. Showing that kind of one-half expected 355 00:16:56,000 --> 00:16:58,000 being thrown in there. 356 00:16:58,000 --> 00:17:00,000 But in the total, the idea is that 357 00:17:00,000 --> 00:17:03,000 it does about 100,000 compares a move. This one does 358 00:17:03,000 --> 00:17:06,000 about 100,000 compares that when you look at them in real time they tend to be very, 359 00:17:06,000 --> 00:17:10,000 very close neck in neck on most things. So if I 360 00:17:10,000 --> 00:17:14,000 do this again giving it another big chunk 361 00:17:14,000 --> 00:17:17,000 and just let it go, again, it looks like insertion sorts off to that early lead, but 362 00:17:17,000 --> 00:17:21,000 if you were a betting person you might be putting your money on insertion sort. 363 00:17:21,000 --> 00:17:24,000 But selection sort is the ultimate comeback kid 364 00:17:24,000 --> 00:17:28,000 and toward the end it just really gets a second wind. 365 00:17:28,000 --> 00:17:29,000 In this case, beat it by a 366 00:17:29,000 --> 00:17:32,000 little bit bigger fraction this time. 367 00:17:32,000 --> 00:17:35,000 If I put the data 368 00:17:35,000 --> 00:17:38,000 into partially sorted order. 369 00:17:38,000 --> 00:17:41,000 So now, in this case, I've got about half of the data all ready where It's supposed to be and 370 00:17:41,000 --> 00:17:43,000 another half that's 371 00:17:43,000 --> 00:17:45,000 been randomly permuted 372 00:17:45,000 --> 00:17:47,000 and now let it go. 373 00:17:47,000 --> 00:17:50,000 Insertion sort takes an even faster looking early lead 374 00:17:50,000 --> 00:17:53,000 and selective sort, in this case, never 375 00:17:53,000 --> 00:17:56,000 notices that the data was actually in sorted order. 376 00:17:56,000 --> 00:17:59,000 But time actually is a little bit 377 00:17:59,000 --> 00:18:02,000 artificial here and, in fact, the number of comparisons is maybe a 378 00:18:02,000 --> 00:18:03,000 better number to use here, but 379 00:18:03,000 --> 00:18:06,000 it really did a lot more work relative to what insertion sort did because insertion sort was 380 00:18:06,000 --> 00:18:09,000 more able to recognize the sortedness of the data and take advantage of it in a 381 00:18:09,000 --> 00:18:12,000 way that selection sort 382 00:18:12,000 --> 00:18:12,000 totally just 383 00:18:12,000 --> 00:18:16,000 continued doing the same amount of work always. that's kind of 384 00:18:16,000 --> 00:18:18,000 interesting to note, right? Is that 385 00:18:18,000 --> 00:18:19,000 386 00:18:19,000 --> 00:18:20,000 when we're talking 387 00:18:20,000 --> 00:18:23,000 about all these different sorting algorithms, right? That 388 00:18:23,000 --> 00:18:28,000 we have multiple evidence, because actually they really do make different tradeoffs in terms of where 389 00:18:28,000 --> 00:18:32,000 the work is done and what operation it prefers and what inputs it actually 390 00:18:32,000 --> 00:18:34,000 performs well or poorly on. 391 00:18:34,000 --> 00:18:37,000 That selection sort is good for 392 00:18:37,000 --> 00:18:40,000 it does exactly the same amount of work no matter what. If It's in sorted order, or 393 00:18:40,000 --> 00:18:44,000 reversal order, or in random order, right? It always is guaranteed to do the kind of 394 00:18:44,000 --> 00:18:46,000 same amount of work 395 00:18:46,000 --> 00:18:48,000 and you don't have any unpredictability in it. 396 00:18:48,000 --> 00:18:52,000 that's both an advantage and a disadvantage, right? On the one hand it says, well, if you knew that 397 00:18:52,000 --> 00:18:55,000 you needed this sort to take exactly this time and no better, no worse would be 398 00:18:55,000 --> 00:18:58,000 fine then actually having that reliable performer may be useful to 399 00:18:58,000 --> 00:19:00,000 know. On the other hand, 400 00:19:00,000 --> 00:19:04,000 It's interesting to know that insertion sort if you gave it to data that was almost sorted then 401 00:19:04,000 --> 00:19:07,000 it would do less work, right? Is an appealing characteristic of that. And so having 402 00:19:07,000 --> 00:19:13,000 there be some opportunity for it to perform more efficiently is nice. The other thing, also, 403 00:19:13,000 --> 00:19:16,000 is about this mix of operations. Whether it considers comparisons or moves 404 00:19:16,000 --> 00:19:19,000 are more expensive operation. For certain types of data 405 00:19:19,000 --> 00:19:21,000 those really aren't one-to-one. 406 00:19:21,000 --> 00:19:23,000 That a comparison may be a very 407 00:19:23,000 --> 00:19:26,000 cheap operation and a move may be expensive or vice versa depending on kind of the 408 00:19:26,000 --> 00:19:28,000 data that's being looked at. 409 00:19:28,000 --> 00:19:31,000 For example, comparing strings is often a bit more expensive than comparing numbers 410 00:19:31,000 --> 00:19:33,000 because comparing strings has to look at letters 411 00:19:33,000 --> 00:19:37,000 to determine when they distinguish. If you had a lot of letters in the front that were 412 00:19:37,000 --> 00:19:37,000 overlapping 413 00:19:37,000 --> 00:19:41,000 it takes, sort of, more work to distinguish at what point they divide and 414 00:19:41,000 --> 00:19:43,000 which one goes forward. 415 00:19:43,000 --> 00:19:47,000 On the other hand, moving a large data structure if it were a big structure of student 416 00:19:47,000 --> 00:19:50,000 information, right? Takes more time than moving an integer around. So depending on what the 417 00:19:50,000 --> 00:19:52,000 data that's being looked at, 418 00:19:52,000 --> 00:19:55,000 there may actually be a real reason to prefer 419 00:19:55,000 --> 00:19:57,000 using more comparisons versus moves. 420 00:19:57,000 --> 00:20:00,000 And so examples I often give for this are like if you think about in the real world, if 421 00:20:00,000 --> 00:20:04,000 you were in charge of sorting something very heavy and awkward like 422 00:20:04,000 --> 00:20:07,000 refrigerators you kind of line up the refrigerators by price 423 00:20:07,000 --> 00:20:10,000 in the warehouse or something. 424 00:20:10,000 --> 00:20:13,000 You would probably want to do something that did fewer moves, right? Moves are 425 00:20:13,000 --> 00:20:16,000 expensive. I'm picking up this refrigerator and I'm moving, I don't want to move it 426 00:20:16,000 --> 00:20:18,000 more than once, right? And so you might want to go with the selection sort where 427 00:20:18,000 --> 00:20:20,000 you go and you figure out who's the cheapest fridge, let me pull that 428 00:20:20,000 --> 00:20:22,000 one and get it over here, 429 00:20:22,000 --> 00:20:25,000 and now I'm not gonna touch it again 430 00:20:25,000 --> 00:20:28,000 rather than kind of sitting there with your insertion sort and moving the fridges 431 00:20:28,000 --> 00:20:31,000 one by one until you got it to the right spot. 432 00:20:31,000 --> 00:20:35,000 But if I were in charge of finding out who was, let's say, the fastest 433 00:20:35,000 --> 00:20:37,000 runner in the mile in this class 434 00:20:37,000 --> 00:20:41,000 you probably would not enjoy it if my strategy were be to take two of you in a 435 00:20:41,000 --> 00:20:43,000 dead heat and say run a mile and see who won. 436 00:20:43,000 --> 00:20:47,000 And now whoever won, okay, well, you get to run against the next guy. And if you win again 437 00:20:47,000 --> 00:20:49,000 you get to run against the next guy. 438 00:20:49,000 --> 00:20:50,000 Like you might just say 439 00:20:50,000 --> 00:20:53,000 hey, how about you let me get ahead of that person if I beat them once. I don't want to 440 00:20:53,000 --> 00:20:56,000 have to go through this again. Like fewer comparisons 441 00:20:56,000 --> 00:21:00,000 would certainly be preferred. So both of 442 00:21:00,000 --> 00:21:03,000 these, I would say, are pretty easy algorithms to write. 443 00:21:03,000 --> 00:21:06,000 That is certainly the strength of selection and insertion sort. 444 00:21:06,000 --> 00:21:10,000 They are quadratic algorithms which we're gonna see is 445 00:21:10,000 --> 00:21:13,000 not very tractable for large inputs, but the fact that you can write the code in 446 00:21:13,000 --> 00:21:17,000 eight lines and debug it quickly and get it working is actually a real advantage to the - so 447 00:21:17,000 --> 00:21:18,000 if you were 448 00:21:18,000 --> 00:21:20,000 in a situation where you just needed a quick and dirty sort 449 00:21:20,000 --> 00:21:24,000 these are probably the ones you're gonna turn to. 450 00:21:24,000 --> 00:21:26,000 So just some numbers on them, right? Is 451 00:21:26,000 --> 00:21:28,000 10,000 elements 452 00:21:28,000 --> 00:21:30,000 on my machine kind 453 00:21:30,000 --> 00:21:32,000 of an 454 00:21:32,000 --> 00:21:34,000 unoptimized contact was taking three seconds. 455 00:21:34,000 --> 00:21:36,000 You go up by a factor of two. 456 00:21:36,000 --> 00:21:37,000 We expected to go up by about a 457 00:21:37,000 --> 00:21:41,000 corresponding factor of four and the time it roughly did. 458 00:21:41,000 --> 00:21:43,000 Going up by another factor of two and a half again going up. 459 00:21:43,000 --> 00:21:47,000 By the time you get to 100,000 though, a selection sort is slow enough to really be 460 00:21:47,000 --> 00:21:48,000 noticeable. 461 00:21:48,000 --> 00:21:52,000 It's taking several minutes to do that kind of data 462 00:21:52,000 --> 00:21:55,000 and that means if you're really trying to sort something of 463 00:21:55,000 --> 00:21:56,000 sufficiently large magnitude 464 00:21:56,000 --> 00:22:02,000 a quadratic sort, like insertion sort or selection sort, probably won't cut it. 465 00:22:02,000 --> 00:22:07,000 So here's an insight we're gonna kind of turn on its head. 466 00:22:07,000 --> 00:22:12,000 If you double the size of the input it takes four times as long. Okay. 467 00:22:12,000 --> 00:22:14,000 I can buy that, right? I went from ten to 20 468 00:22:14,000 --> 00:22:16,000 and went up by a factor of four. 469 00:22:16,000 --> 00:22:19,000 So going in that direction it feels like this growth is really working against you, 470 00:22:19,000 --> 00:22:24,000 right? It is very quickly taking more and more time. 471 00:22:24,000 --> 00:22:26,000 Let's try to take this idea though and kind of turn it around 472 00:22:26,000 --> 00:22:29,000 and see if we can actually capitalize on the fact that if I have 473 00:22:29,000 --> 00:22:34,000 the size of the input it should take one quarter the amount of time. 474 00:22:34,000 --> 00:22:37,000 So if I had a data set of 100,000 elements 475 00:22:37,000 --> 00:22:40,000 and it was gonna take me five minutes if I try to sort it in one batch. 476 00:22:40,000 --> 00:22:43,000 If I divided it into two-50,000 element batches it 477 00:22:43,000 --> 00:22:47,000 will take just a little over a minute to do each of them. 478 00:22:47,000 --> 00:22:48,000 Well, if I had 479 00:22:48,000 --> 00:22:53,000 a way of taking a 50,000 sorted element and a 50,000 sorted input 480 00:22:53,000 --> 00:22:53,000 and 481 00:22:53,000 --> 00:22:58,000 putting them back together into 100,000 combined sorted 482 00:22:58,000 --> 00:22:58,000 collection 483 00:22:58,000 --> 00:23:00,000 and I could do it in less than three minutes, 484 00:23:00,000 --> 00:23:03,000 then I'd be ahead of the game. 485 00:23:03,000 --> 00:23:05,000 So if I divided it up, 486 00:23:05,000 --> 00:23:08,000 sorted those guys, 487 00:23:08,000 --> 00:23:09,000 and then 488 00:23:09,000 --> 00:23:11,000 worked them back together and if 489 00:23:11,000 --> 00:23:14,000 it didn't take too much time to do that step 490 00:23:14,000 --> 00:23:17,000 I could really get somewhere with this. This kind of turning it on its head is really 491 00:23:17,000 --> 00:23:20,000 useful. So 492 00:23:20,000 --> 00:23:21,000 let's talk about 493 00:23:21,000 --> 00:23:25,000 an algorithm for doing exactly that. 494 00:23:25,000 --> 00:23:29,000 I take my stack. I've got a stack of exam papers. Maybe 495 00:23:29,000 --> 00:23:31,000 It's got a couple hundred students in it. I need to 496 00:23:31,000 --> 00:23:35,000 just divide it in half and I'm actually gonna make 497 00:23:35,000 --> 00:23:38,000 a very - the easy, lazy decision about dividing it in half is basically just to 498 00:23:38,000 --> 00:23:42,000 take the top half of the stack, kind of look at it, figure out about where the middle is, 499 00:23:42,000 --> 00:23:44,000 take the top half and I hand it 500 00:23:44,000 --> 00:23:46,000 to Ed. I say, Ed, would you please sort this for me? And I take the other half and 501 00:23:46,000 --> 00:23:49,000 I hand it to Michelle and I say would you please sort this for me? 502 00:23:49,000 --> 00:23:51,000 Now, I get Ed's stack back. I get Michelle's stack back. 503 00:23:51,000 --> 00:23:53,000 They're sitting right here. All 504 00:23:53,000 --> 00:23:55,000 right. So that was good because they actually take a quarter of the time it 505 00:23:55,000 --> 00:23:57,000 would have taken me to do the whole stack anyway. 506 00:23:57,000 --> 00:23:58,000 So I'm all ready 507 00:23:58,000 --> 00:23:59,000 at half the time. 508 00:23:59,000 --> 00:24:01,000 What can I do to put them back together 509 00:24:01,000 --> 00:24:05,000 that realizes that because they're in sorted order 510 00:24:05,000 --> 00:24:10,000 there's actually some advantage to reproducing the full result 511 00:24:10,000 --> 00:24:13,000 depending on the fact that they were all ready sorted themselves? 512 00:24:13,000 --> 00:24:14,000 So if I look at the two stacks 513 00:24:14,000 --> 00:24:18,000 I can tell you this, that someone over here starts with Adams and this one over 514 00:24:18,000 --> 00:24:19,000 here starts with Abbott. 515 00:24:19,000 --> 00:24:23,000 The very first one in the output has to be one of the two top stacks, right? 516 00:24:23,000 --> 00:24:26,000 They can't be any further down, right? This is the sorted left half. This is the sorted 517 00:24:26,000 --> 00:24:27,000 right half, right? 518 00:24:27,000 --> 00:24:30,000 That the very first one of the full combined result must be one of those two and 519 00:24:30,000 --> 00:24:33,000 It's actually just the smaller of the two, right? So I look at the top two things. I 520 00:24:33,000 --> 00:24:36,000 say, Abbott, Adams, oh Abbott proceeds Adams. Okay. Well, 521 00:24:36,000 --> 00:24:39,000 I take Abbott off and I stick it over there. 522 00:24:39,000 --> 00:24:43,000 And so now that exposes Baker over here. So I've got Adams versus Baker. And I 523 00:24:43,000 --> 00:24:44,000 say, oh, well, 524 00:24:44,000 --> 00:24:47,000 which of those goes first? Well, Adams does. I pick Adams up and I stick it over 525 00:24:47,000 --> 00:24:47,000 there. 526 00:24:47,000 --> 00:24:50,000 So now that exposes, let's say, 527 00:24:50,000 --> 00:24:51,000 528 00:24:51,000 --> 00:24:52,000 Ameliglue, my old 529 00:24:52,000 --> 00:24:56,000 TA. Ameliglue and Baker. And I say, oh, well, which of those? Ameliglue. And at any given 530 00:24:56,000 --> 00:25:00,000 point where there's only two I need consider for what could be the next one 531 00:25:00,000 --> 00:25:03,000 and so as I'm doing this, what's called the merge step, I'm just taking two sorted 532 00:25:03,000 --> 00:25:06,000 lists and I'm merging. So I'm kind of keeping track of where I am 533 00:25:06,000 --> 00:25:08,000 in those two vectors and then taking 534 00:25:08,000 --> 00:25:10,000 the top of either 535 00:25:10,000 --> 00:25:11,000 to 536 00:25:11,000 --> 00:25:15,000 push onto this collection I'm building over here and I just work my way down to the 537 00:25:15,000 --> 00:25:16,000 bottom. 538 00:25:16,000 --> 00:25:20,000 So that merge step, right? Is preserving the ordering. 539 00:25:20,000 --> 00:25:23,000 Just kind of merging them together into one sorted result. 540 00:25:23,000 --> 00:25:25,000 If I could do that faster 541 00:25:25,000 --> 00:25:28,000 then I could have actually of sorted them and then I'm actually ahead of the game and there's a very good 542 00:25:28,000 --> 00:25:30,000 chance for that because that, in fact, is just the linear operation, right? I'm 543 00:25:30,000 --> 00:25:32,000 looking at each element, deciding, 544 00:25:32,000 --> 00:25:36,000 and so I will do that comparison and time to decide who goes next, right? I look at 545 00:25:36,000 --> 00:25:39,000 this versus that and I put it over there. I look this versus that and I put it over 546 00:25:39,000 --> 00:25:42,000 there. Well, I'm gonna do that until this stack has n things in it. 547 00:25:42,000 --> 00:25:45,000 All of them have been moved. So, in fact, it will take me n comparisons to 548 00:25:45,000 --> 00:25:47,000 have gotten them all out of their two separate stacks and into the 549 00:25:47,000 --> 00:25:49,000 one together. 550 00:25:49,000 --> 00:25:53,000 So it is a linear operation to do that last step and we know that linear much 551 00:25:53,000 --> 00:25:57,000 quicker to get the job done than something that's quadratic. Yeah? When you 552 00:25:57,000 --> 00:25:59,000 merge the halves are you taking the 553 00:25:59,000 --> 00:26:01,000 higher one in the alphabet or the lower one? Well, it 554 00:26:01,000 --> 00:26:05,000 typically I'm taking it in the order I'm trying to sort them into, right? Which is increasing. 555 00:26:05,000 --> 00:26:08,000 So I'll typically start at A's and work my way to Z's, right? So 556 00:26:08,000 --> 00:26:09,000 it 557 00:26:09,000 --> 00:26:10,000 turns 558 00:26:10,000 --> 00:26:11,000 out It's completely ? It doesn't really matter. 559 00:26:11,000 --> 00:26:14,000 It doesn't really matter is the truth, but the - whatever order they're sorted 560 00:26:14,000 --> 00:26:17,000 in is the order I'm trying to output them in. So, in fact, these are the 561 00:26:17,000 --> 00:26:21,000 - if they're first on the sorted piles then they'll be first on the alphabet pile. 562 00:26:21,000 --> 00:26:23,000 So whatever that first is. 563 00:26:23,000 --> 00:26:27,000 So I'm gonna actually go - look at the code for a second before we come back and do the 564 00:26:27,000 --> 00:26:29,000 diagramming. We'll 565 00:26:29,000 --> 00:26:32,000 go over here and look at the stepping part of it. 566 00:26:32,000 --> 00:26:36,000 So this is the code that's doing merge sort 567 00:26:36,000 --> 00:26:39,000 and so it has a very recursive flavor to it. 568 00:26:39,000 --> 00:26:43,000 It does a little calculation here at the beginning to decide how 569 00:26:43,000 --> 00:26:47,000 many elements are going to the left and going to the right. As I said, it does no smart 570 00:26:47,000 --> 00:26:51,000 division here. So this is called an easy split hard join. So the split process is 571 00:26:51,000 --> 00:26:52,000 very 572 00:26:52,000 --> 00:26:54,000 dumb. It says 573 00:26:54,000 --> 00:26:57,000 figure out how many elements there are, divide that in half, take the one from 574 00:26:57,000 --> 00:27:00,000 zero to the n over two and put them in one separate subvector and take the ones from 575 00:27:00,000 --> 00:27:04,000 n over two to the n and put them in a second subvector 576 00:27:04,000 --> 00:27:07,000 and then recursively sort those. So let's watch what happens here. Computes that 577 00:27:07,000 --> 00:27:09,000 and says, okay, copy a 578 00:27:09,000 --> 00:27:12,000 subvector of the first, in this case four elements, 579 00:27:12,000 --> 00:27:15,000 copy a subvector that has the second four elements, the second half. So I've got my 580 00:27:15,000 --> 00:27:17,000 left and my right half 581 00:27:17,000 --> 00:27:21,000 and then it goes ahead and makes a call under merge sort, which says 582 00:27:21,000 --> 00:27:22,000 merge that left half 583 00:27:22,000 --> 00:27:26,000 and merge that right half and then we'll see the merge together. So I'm gonna 584 00:27:26,000 --> 00:27:30,000 watch it go down because the thing that's gonna happen is when I do a merge of 585 00:27:30,000 --> 00:27:33,000 this half It's kind of like it postponed these other calls 586 00:27:33,000 --> 00:27:35,000 and it comes back in and it makes another call, 587 00:27:35,000 --> 00:27:39,000 which does another division into two halves. So taking the four that are on 588 00:27:39,000 --> 00:27:42,000 the left and dividing them into two and two 589 00:27:42,000 --> 00:27:46,000 and then it says, okay, now I merge the left half of that, 590 00:27:46,000 --> 00:27:49,000 which gets us down to this case 591 00:27:49,000 --> 00:27:52,000 where it divides them into one on the left and one on the right. 592 00:27:52,000 --> 00:27:55,000 And then these are the ones that hit my base case. 593 00:27:55,000 --> 00:27:58,000 That an array of size one is trivially sorted. 594 00:27:58,000 --> 00:28:01,000 So, in fact, the merge sort never even goes into doing any work unless there's 595 00:28:01,000 --> 00:28:03,000 at least two elements to look at. 596 00:28:03,000 --> 00:28:06,000 So when it makes this call to the 597 00:28:06,000 --> 00:28:07,000 merge the 45 it will 598 00:28:07,000 --> 00:28:09,000 say, okay, 599 00:28:09,000 --> 00:28:10,000 there's nothing to do. 600 00:28:10,000 --> 00:28:13,000 Then it'll merge the 92 and works for the 92, which also does 601 00:28:13,000 --> 00:28:14,000 nothing. 602 00:28:14,000 --> 00:28:16,000 And now the code up here's gonna flip. 603 00:28:16,000 --> 00:28:19,000 So be warned about what's gonna happen here is I'm gonna show you what the process 604 00:28:19,000 --> 00:28:21,000 of the merge looks like, 605 00:28:21,000 --> 00:28:27,000 which is gonna do the left-right copy to the output. 606 00:28:27,000 --> 00:28:29,000 So this code looks a little bit dense 607 00:28:29,000 --> 00:28:32,000 and, again, this is not the time to get really worried about what the 608 00:28:32,000 --> 00:28:36,000 details of the intricacies of the code are. I think you really want to kind of step back 609 00:28:36,000 --> 00:28:40,000 and say conceptually the process I described of taking them off 610 00:28:40,000 --> 00:28:45,000 the two piles and then merging them is very much the take home point for this. 611 00:28:45,000 --> 00:28:49,000 And so what this upper loop is doing is It's basically saying well, while the two 612 00:28:49,000 --> 00:28:51,000 stacks, the two piles, the two halves, whatever, 613 00:28:51,000 --> 00:28:54,000 each have something left in them. 614 00:28:54,000 --> 00:28:58,000 Then compare them and take the smaller one off the top. 615 00:28:58,000 --> 00:29:01,000 So It's keeping track of all these indices, right? 616 00:29:01,000 --> 00:29:05,000 The indices of the left subarray, the indices of the right subarray and the indices of 617 00:29:05,000 --> 00:29:07,000 the right output array and It's actually kind of - 618 00:29:07,000 --> 00:29:09,000 and each step is putting a new one, 619 00:29:09,000 --> 00:29:12,000 copying one from one of the left or the right 620 00:29:12,000 --> 00:29:13,000 onto the output. 621 00:29:13,000 --> 00:29:17,000 And so that upper loop there said is 45 less than 92? It is, right? 622 00:29:17,000 --> 00:29:19,000 So it copies 45 623 00:29:19,000 --> 00:29:20,000 and then 624 00:29:20,000 --> 00:29:21,000 625 00:29:21,000 --> 00:29:24,000 at this point, right? There's nothing left on the left, so it actually drops down to 626 00:29:24,000 --> 00:29:28,000 those last two pieces of code, which, actually, do to copy the remainder from 627 00:29:28,000 --> 00:29:31,000 if there's only one stack left then just dump them all on the end. 628 00:29:31,000 --> 00:29:35,000 So we'll do the dump of the end of the 92. And so 629 00:29:35,000 --> 00:29:36,000 I've reassembled it. 630 00:29:36,000 --> 00:29:39,000 And so when I get back to here then 631 00:29:39,000 --> 00:29:41,000 I have merge sorted the left side 632 00:29:41,000 --> 00:29:44,000 and then I go through the processes of merge sorting the right side. 633 00:29:44,000 --> 00:29:47,000 Which copies the 62 and the 41 down to the things and then does a 634 00:29:47,000 --> 00:29:48,000 merge back. Let 635 00:29:48,000 --> 00:29:51,000 me get to the stage where I'm starting to do a little bit bigger merges. 636 00:29:51,000 --> 00:29:54,000 So here I am with 637 00:29:54,000 --> 00:29:57,000 indices on my left and my right side and then my 638 00:29:57,000 --> 00:30:00,000 output index and It's like, okay, is 45 less than 41? 639 00:30:00,000 --> 00:30:01,000 If so 640 00:30:01,000 --> 00:30:05,000 then take 41 here and kind of advance P2 over and still see the 641 00:30:05,000 --> 00:30:07,000 41 go across 642 00:30:07,000 --> 00:30:10,000 and it moved both the P and the P2 643 00:30:10,000 --> 00:30:13,000 up an index to say, okay, now we're ready to pick the next one. So it 644 00:30:13,000 --> 00:30:15,000 looks at P1 versus P2's, 645 00:30:15,000 --> 00:30:20,000 decides It's pulling from the left side this time, and then 646 00:30:20,000 --> 00:30:24,000 now it still has the last most members of the two piles to look at, takes from the 647 00:30:24,000 --> 00:30:27,000 right side, and then we'll just dump the end of the 648 00:30:27,000 --> 00:30:29,000 other side. So then in 649 00:30:29,000 --> 00:30:32,000 going through this process to do it all again, kind of break it all the way down. So It's just 650 00:30:32,000 --> 00:30:35,000 using recursion at every stage. So at the small stages It's a little bit hard to figure out 651 00:30:35,000 --> 00:30:38,000 what's going on, but once it gets back to here, right? It's just doing the big merge, let it go 652 00:30:38,000 --> 00:30:40,000 a little too fast there, 653 00:30:40,000 --> 00:30:43,000 to build it back up. 654 00:30:43,000 --> 00:30:46,000 So there's one thing about merge sort I should mention, 655 00:30:46,000 --> 00:30:48,000 which is merge 656 00:30:48,000 --> 00:30:52,000 sort is using extra storage. 657 00:30:52,000 --> 00:30:53,000 And I'm gonna 658 00:30:53,000 --> 00:30:58,000 get to a stage where I can explain why that's necessary. But selection 659 00:30:58,000 --> 00:31:01,000 sort and insertion sort both work with what's called in place 660 00:31:01,000 --> 00:31:04,000 and that means that they are 661 00:31:04,000 --> 00:31:07,000 using the one copy of the vector and just rearranging things within it so it doesn't 662 00:31:07,000 --> 00:31:12,000 require any auxiliary storage to copy things over and back and whatnot. 663 00:31:12,000 --> 00:31:16,000 That it can do all the operations on one vector with a little bit of a few extra 664 00:31:16,000 --> 00:31:17,000 variables around. 665 00:31:17,000 --> 00:31:20,000 That merge sort is really making copies. 666 00:31:20,000 --> 00:31:21,000 That it 667 00:31:21,000 --> 00:31:24,000 divides it into these two subarrays that are distinct from the original array and copies them 668 00:31:24,000 --> 00:31:28,000 back. So it copies the data away and then it copies it back in. 669 00:31:28,000 --> 00:31:31,000 And the reason for that actually is totally related to what's happening 670 00:31:31,000 --> 00:31:35,000 in this merge step. That 671 00:31:35,000 --> 00:31:37,000 if I had 672 00:31:37,000 --> 00:31:38,000 - 673 00:31:38,000 --> 00:31:39,000 well, actually this is not the 674 00:31:39,000 --> 00:31:43,000 step that's gonna show it. I'm gonna show it on this side. That 675 00:31:43,000 --> 00:31:48,000 if it were trying to write over the same array - oh, now I 676 00:31:48,000 --> 00:31:51,000 made it go too fast and that was really very annoying. All right. Let me see if I 677 00:31:51,000 --> 00:31:53,000 can get 678 00:31:53,000 --> 00:31:57,000 it one more time. Do what 679 00:31:57,000 --> 00:32:01,000 I wanted it. Is that when It's doing the copying, if it were copying on top of 680 00:32:01,000 --> 00:32:10,000 itself it would end up kind of destroying parts of what It's working on. Oh, It's - I see what. 681 00:32:10,000 --> 00:32:14,000 we're gonna get this mistake the whole time. Okay. 682 00:32:14,000 --> 00:32:18,000 Is that as it was doing the copy, right? If It's - if this really were 683 00:32:18,000 --> 00:32:20,000 sitting up here and this really were sitting up her and if we were pulling, for 684 00:32:20,000 --> 00:32:22,000 example, from the left side we 685 00:32:22,000 --> 00:32:25,000 would be overriding something that was all ready - pulling from the right side, I'm sorry. 686 00:32:25,000 --> 00:32:28,000 We'd be overriding something that was on the left side. 687 00:32:28,000 --> 00:32:30,000 And so if it 688 00:32:30,000 --> 00:32:32,000 did that it would have to have some other strategy for then, well, where did it 689 00:32:32,000 --> 00:32:36,000 put this one that it was overriding? Like if It's pulling from the left side It's actually 690 00:32:36,000 --> 00:32:40,000 fine for it to override in the output array, but otherwise, right? It would be writing on something 691 00:32:40,000 --> 00:32:41,000 it was gonna need later. 692 00:32:41,000 --> 00:32:44,000 So the easiest thing for merge sort to do is just to move them aside, work on 693 00:32:44,000 --> 00:32:48,000 them in a temporary scratch base, and then copy them back. 694 00:32:48,000 --> 00:32:51,000 There are ways to make merge sort in place, but the code gets quite a bit more 695 00:32:51,000 --> 00:32:54,000 complicated to do that. So your standard merge sort algorithm does use some 696 00:32:54,000 --> 00:32:57,000 auxiliary storage that's proportional to the size of the array. 697 00:32:57,000 --> 00:33:00,000 So that ends up being 698 00:33:00,000 --> 00:33:02,000 a factor in situations where you're 699 00:33:02,000 --> 00:33:06,000 managing a very large data set where, in fact, maybe it requires, 700 00:33:06,000 --> 00:33:07,000 right? 701 00:33:07,000 --> 00:33:11,000 A large amount of memory all ready for the ones that making a duplicate copy of it to 702 00:33:11,000 --> 00:33:12,000 work on 703 00:33:12,000 --> 00:33:16,000 may actually cost you more than you can afford. So merge sort sometimes is ruled 704 00:33:16,000 --> 00:33:18,000 out just because of its memory requirements 705 00:33:18,000 --> 00:33:26,000 despite the fact that it has a performance advantage over insertion sort and selection sort. What 706 00:33:26,000 --> 00:33:32,000 does it sound like though? that's what you really want to know. So 707 00:33:32,000 --> 00:33:36,000 let's 708 00:33:36,000 --> 00:33:39,000 first just watch it do It's thing 709 00:33:39,000 --> 00:33:42,000 and so you'll see it kind of building up these sorted subarrays, kind of 710 00:33:42,000 --> 00:33:44,000 working on the left half, and then 711 00:33:44,000 --> 00:33:47,000 postponing the right and coming back to it and so you'll see little pieces of it 712 00:33:47,000 --> 00:33:50,000 and the little merge steps as it goes along. 713 00:33:50,000 --> 00:33:53,000 And then you can get a little glimpse 714 00:33:53,000 --> 00:33:54,000 of 715 00:33:54,000 --> 00:33:58,000 kind of the divisions down there. Let me actually run it again in a little bit bigger thing. And 716 00:33:58,000 --> 00:34:00,000 I'll turn this on in just a second. Just like 717 00:34:00,000 --> 00:34:03,000 as it goes faster and faster let me make my sound go 718 00:34:03,000 --> 00:34:05,000 and we'll see 719 00:34:05,000 --> 00:34:07,000 how much we can stand of it. 720 00:34:07,000 --> 00:34:24,000 It's pretty noisy you'll discover. Volume, good volume? Oh, yeah. Yeah. A lot of noise. Oh, yeah. 721 00:34:24,000 --> 00:34:41,000 Okay. 722 00:34:41,000 --> 00:34:46,000 So you definitely get the kind of 1960's sound tone generation 723 00:34:46,000 --> 00:34:49,000 there, but you get this, I mean, you can hear the merge step, right? Is very 724 00:34:49,000 --> 00:34:51,000 distinguishable from the other activity. It's doing the kind of smaller and smaller 725 00:34:51,000 --> 00:34:54,000 subarrays it sounds just a little bit like noise, but as you see the 726 00:34:54,000 --> 00:34:59,000 larger subarrays being joined this very clear merging 727 00:34:59,000 --> 00:35:02,000 sound emerges from that that you can hear and see that in the very end, sort of, 728 00:35:02,000 --> 00:35:06,000 one big long merge of taking the two piles and joining them into one. 729 00:35:06,000 --> 00:35:10,000 A lot of noise, right? So that should tell you that there is a 730 00:35:10,000 --> 00:35:14,000 pretty good amount of movement going on. 731 00:35:14,000 --> 00:35:16,000 Question? [Inaudible] It's merging them after they've been sorted. 732 00:35:16,000 --> 00:35:20,000 So you'll - when you see a merge you'll always see two sorted subarrays being joined into one 733 00:35:20,000 --> 00:35:23,000 larger sorted subarray. So at the very end, for example, you'll have these two sorted halves that 734 00:35:23,000 --> 00:35:27,000 are being merged down. And how much is the sorting [inaudible]? 735 00:35:27,000 --> 00:35:31,000 Well, It's recursion, right? It's like it sorts the 736 00:35:31,000 --> 00:35:34,000 half, which sorts the quarters, which sorts the A's. And so at any given point what 737 00:35:34,000 --> 00:35:37,000 it really looks like It's sorting is these little one element arrays, which are being merged into a two 738 00:35:37,000 --> 00:35:40,000 element array. Like all the work is being done in merging really. 739 00:35:40,000 --> 00:35:44,000 That sorting is kind of funny. When does it actually sort anything? Like never actually. 740 00:35:44,000 --> 00:35:48,000 It only merges things and by dividing them all the way down into all these one 741 00:35:48,000 --> 00:35:51,000 element arrays the merging of them it says, well, here's these trivially sorted things. 742 00:35:51,000 --> 00:35:53,000 Make a two element sorted thing out of them 743 00:35:53,000 --> 00:35:55,000 and that you have these two 744 00:35:55,000 --> 00:35:58,000 element things, merge them into one-four element thing, which you merge into an eight 745 00:35:58,000 --> 00:36:03,000 element thing and all the way back. So all of the work is really done in the merge step. Kind of 746 00:36:03,000 --> 00:36:06,000 on the sorting angle it actually is deferring it 747 00:36:06,000 --> 00:36:09,000 down to the very bottom. So that's a kind of odd thing to do, right? It really just 748 00:36:09,000 --> 00:36:11,000 divides it all up into one 749 00:36:11,000 --> 00:36:14,000 - a hundred piles, each of size one and then kind of 750 00:36:14,000 --> 00:36:18,000 joins those into one pile, these into one pile, these into one pile, and so now I have 50 piles, 751 00:36:18,000 --> 00:36:22,000 right? That are each a size two. Now I join them into 25 piles of size four 752 00:36:22,000 --> 00:36:23,000 and then 753 00:36:23,000 --> 00:36:26,000 12 piles of size eight and so on. 754 00:36:26,000 --> 00:36:30,000 So this is the code for the outer point of the merge algorithm. 755 00:36:30,000 --> 00:36:34,000 I'm actually not gonna look at the merge algorithm very in depth. I'm really more interested in the kind of 756 00:36:34,000 --> 00:36:37,000 outer point of this algorithm and so the steps that are going on here is being a 757 00:36:37,000 --> 00:36:41,000 little bit of constant work. Let me 758 00:36:41,000 --> 00:36:42,000 actually - 759 00:36:42,000 --> 00:36:46,000 we do a copy operation that copies out the left half and the right half. Both of 760 00:36:46,000 --> 00:36:49,000 those operations together require linear time. 761 00:36:49,000 --> 00:36:51,000 So I've looked at every element and I've 762 00:36:51,000 --> 00:36:53,000 copied the first half onto something, the second 763 00:36:53,000 --> 00:36:57,000 half onto something. So that took - I looked at every element in that process. I 764 00:36:57,000 --> 00:36:59,000 make two calls on the left and the right side 765 00:36:59,000 --> 00:37:03,000 and then I do a merge on the end. We talked earlier about how that was 766 00:37:03,000 --> 00:37:06,000 linear and the number of elements because as I build the two piles into one 767 00:37:06,000 --> 00:37:07,000 sorted pile 768 00:37:07,000 --> 00:37:10,000 every element is touched in that process 769 00:37:10,000 --> 00:37:13,000 as it gets chosen to be pulled out. So 770 00:37:13,000 --> 00:37:15,000 linear divide 771 00:37:15,000 --> 00:37:18,000 and a linear join and then there's this cost in the middle that I'd have to kind of 772 00:37:18,000 --> 00:37:21,000 sort out what's happening, which is 773 00:37:21,000 --> 00:37:24,000 there's sort of n work being done at this level, 774 00:37:24,000 --> 00:37:26,000 but then I make two recursive calls 775 00:37:26,000 --> 00:37:29,000 that should take time n over two. 776 00:37:29,000 --> 00:37:30,000 So the input they're working on is 777 00:37:30,000 --> 00:37:32,000 half, again, as big as the one I have. 778 00:37:32,000 --> 00:37:35,000 How much time do they take? Well, there's my recurrence relation that allows me 779 00:37:35,000 --> 00:37:37,000 to express it. T of n is n, 780 00:37:37,000 --> 00:37:39,000 so at this level this kind of represents the 781 00:37:39,000 --> 00:37:41,000 copy step and the merge step 782 00:37:41,000 --> 00:37:46,000 at this level plus the work it took to get the two halves in sorted order. So I'm gonna 783 00:37:46,000 --> 00:37:49,000 show you a slightly different way of looking at 784 00:37:49,000 --> 00:37:51,000 recursive analysis just to kind of 785 00:37:51,000 --> 00:37:53,000 give 786 00:37:53,000 --> 00:37:56,000 you different tools for thinking about how to do this. I showed you last time how to 787 00:37:56,000 --> 00:37:58,000 do the repeated substitution 788 00:37:58,000 --> 00:38:02,000 and generalization. The pattern - I'm gonna show you this way to do it with a little bit of tree that kind of draws 789 00:38:02,000 --> 00:38:05,000 out the recursive calls and what's happening in them. 790 00:38:05,000 --> 00:38:08,000 That the merge sort of an input of size n, 791 00:38:08,000 --> 00:38:11,000 does n work at that level? The copy in the merge step, 792 00:38:11,000 --> 00:38:12,000 right? For there 793 00:38:12,000 --> 00:38:14,000 plus it does 794 00:38:14,000 --> 00:38:15,000 two calls 795 00:38:15,000 --> 00:38:17,000 to merge sort of n over two. 796 00:38:17,000 --> 00:38:21,000 Well, if I look at each of those calls I can say, well, they contribute 797 00:38:21,000 --> 00:38:24,000 an n squared and an n squared on each side. So this one has n squared copy and merge. 798 00:38:24,000 --> 00:38:27,000 This one has n squared copy and merge 799 00:38:27,000 --> 00:38:29,000 and so, in effect, I have 800 00:38:29,000 --> 00:38:31,000 another n squared plus itself there 801 00:38:31,000 --> 00:38:34,000 and then this level, right? Looks at the n over four, which has four n over 802 00:38:34,000 --> 00:38:38,000 four components. So that actually at each level in the tree 803 00:38:38,000 --> 00:38:42,000 that every element, right? Is being processed in one of those subcalls 804 00:38:42,000 --> 00:38:46,000 across it. So every element is up here and they're in two subgroups and then four 805 00:38:46,000 --> 00:38:47,000 subgroups and then eight subgroups. 806 00:38:47,000 --> 00:38:49,000 And that each element is 807 00:38:49,000 --> 00:38:51,000 copied and merged in its own subgroup 808 00:38:51,000 --> 00:38:54,000 at each level of the recursion. So that kind of gives me this intuition that 809 00:38:54,000 --> 00:38:57,000 there's n work being done 810 00:38:57,000 --> 00:38:59,000 on every level, 811 00:38:59,000 --> 00:39:02,000 every element copied and merged as part of that there. 812 00:39:02,000 --> 00:39:05,000 And so then what we need to complete this analysis is just to know how deep this 813 00:39:05,000 --> 00:39:08,000 tree grows. How far we get down 814 00:39:08,000 --> 00:39:12,000 in the recursion before we hit the base case. So 815 00:39:12,000 --> 00:39:15,000 we're dividing by two each time. 816 00:39:15,000 --> 00:39:18,000 N over the two, over two to the second, to the third, to the fourth, and so on. 817 00:39:18,000 --> 00:39:21,000 So at any given level K down, right? 818 00:39:21,000 --> 00:39:24,000 We have n divided by two to the K. What we want 819 00:39:24,000 --> 00:39:25,000 to solve for 820 00:39:25,000 --> 00:39:28,000 is where n over two to the K equals one. 821 00:39:28,000 --> 00:39:33,000 So we've gotten to this smallest case where we have those trivially sorted one element 822 00:39:33,000 --> 00:39:34,000 inputs to look at 823 00:39:34,000 --> 00:39:37,000 and so we just do a little math here, rearrange that, right? Divide by two to the K 824 00:39:37,000 --> 00:39:41,000 or multiply by two to the K both sides when n equals two to the K. 825 00:39:41,000 --> 00:39:43,000 Take the log base two of both sides and it 826 00:39:43,000 --> 00:39:45,000 will tell me that K is log base two of n. 827 00:39:45,000 --> 00:39:49,000 That I can divide n by K by two K times, 828 00:39:49,000 --> 00:39:51,000 where K is the log base two of n, before 829 00:39:51,000 --> 00:39:55,000 it bottoms out with those one element 830 00:39:55,000 --> 00:39:57,000 vectors. 831 00:39:57,000 --> 00:40:03,000 So log n levels, n for level tells me the whole thing is n log n. So n log 832 00:40:03,000 --> 00:40:07,000 n is a function you may not have a lot of intuition with. You may not have seen 833 00:40:07,000 --> 00:40:07,000 it 834 00:40:07,000 --> 00:40:10,000 enough to kind of know what It's curve looks like, 835 00:40:10,000 --> 00:40:13,000 but if you remember how the logarithmic curve looks, right? Which is a very 836 00:40:13,000 --> 00:40:15,000 slow growing 837 00:40:15,000 --> 00:40:17,000 almost flat line and 838 00:40:17,000 --> 00:40:19,000 then n being linear 839 00:40:19,000 --> 00:40:23,000 it - the kind of combination of the two It's called the linear rhythmic 840 00:40:23,000 --> 00:40:24,000 term here 841 00:40:24,000 --> 00:40:26,000 is just a little bit more than linear. 842 00:40:26,000 --> 00:40:30,000 Not a lot more, but it grows a little bit more steeply than the state of 843 00:40:30,000 --> 00:40:30,000 linear was, 844 00:40:30,000 --> 00:40:34,000 but not nearly, right? As sharply as something that's quadratic. 845 00:40:34,000 --> 00:40:37,000 So if we look at some times and 846 00:40:37,000 --> 00:40:40,000 let selection sort compared to merge sort, right? On an 847 00:40:40,000 --> 00:40:42,000 input of 10,000, right? 848 00:40:42,000 --> 00:40:46,000 Took a fraction, right? To do a merge sort than it did to do a selection sort 849 00:40:46,000 --> 00:40:48,000 and as it grows, 850 00:40:48,000 --> 00:40:49,000 right? So if we go from 851 00:40:49,000 --> 00:40:53,000 20,000 to 50,000 we've a little bit more than doubled it 852 00:40:53,000 --> 00:40:55,000 that the merge sort times, right? Went up a little bit 853 00:40:55,000 --> 00:40:58,000 more than a factor of two 854 00:40:58,000 --> 00:41:00,000 in growing in these things. Not quite doubling. 855 00:41:00,000 --> 00:41:03,000 A little bit more than doubling, right? Because of the logarithmic term that's 856 00:41:03,000 --> 00:41:06,000 being added there, but 857 00:41:06,000 --> 00:41:09,000 growing slowly enough that you can start imaging using a sort like 858 00:41:09,000 --> 00:41:13,000 merge sort on an input of a million elements in a way that you cannot on 859 00:41:13,000 --> 00:41:17,000 selection sort, right? Selection sort - my estimate I did not run it to find this out, 860 00:41:17,000 --> 00:41:20,000 right? Based on the early times I can predict it'll be about eight hours for it 861 00:41:20,000 --> 00:41:21,000 to sort a million elements, 862 00:41:21,000 --> 00:41:26,000 taking just a few seconds, right? For merge sort to do that same input. So 863 00:41:26,000 --> 00:41:29,000 a very big difference, right? 864 00:41:29,000 --> 00:41:30,000 865 00:41:30,000 --> 00:41:32,000 From the n square to n log n that makes it so that 866 00:41:32,000 --> 00:41:35,000 if you have a sufficiently large data set, right? You're gonna have to look 867 00:41:35,000 --> 00:41:37,000 to an n log n sort 868 00:41:37,000 --> 00:41:40,000 where an n squared sort would just not work out for you. 869 00:41:40,000 --> 00:41:44,000 I'll mention here that actually n log n is the theoretical boundary 870 00:41:44,000 --> 00:41:48,000 for what a general purpose sort algorithm can do. So 871 00:41:48,000 --> 00:41:49,000 that 872 00:41:49,000 --> 00:41:52,000 our search from here will have to be a quest for 873 00:41:52,000 --> 00:41:55,000 perhaps an n log n that competes with merge sort and maybe exceeds it in some 874 00:41:55,000 --> 00:41:59,000 ways by having lower constant factors to it, but we won't get a better big O for 875 00:41:59,000 --> 00:42:03,000 a general purpose sorting algorithm. If we have to sort kind of any amount of data 876 00:42:03,000 --> 00:42:04,000 and any permutation 877 00:42:04,000 --> 00:42:08,000 we just - and it has to work for all cases, then n log n is the best we can 878 00:42:08,000 --> 00:42:10,000 do in terms of a big O. 879 00:42:10,000 --> 00:42:11,000 Let me show you these guys kind 880 00:42:11,000 --> 00:42:14,000 of run a little race because 881 00:42:14,000 --> 00:42:15,000 there's nothing more important than having a 882 00:42:15,000 --> 00:42:19,000 reason to bet in class. So if I put 883 00:42:19,000 --> 00:42:22,000 insertion and selection and merge sort all up against one another. 884 00:42:22,000 --> 00:42:24,000 Let's turn off the sound. I 885 00:42:24,000 --> 00:42:33,000 really don't want to hear it all go. Okay. 886 00:42:33,000 --> 00:42:37,000 So merge sort just really smoking and 887 00:42:37,000 --> 00:42:39,000 insertion sort and selection sort 888 00:42:39,000 --> 00:42:42,000 not giving up early, but definitely doing a lot more work. 889 00:42:42,000 --> 00:42:45,000 So if you look at the numbers here in terms of comparisons and moves that are 890 00:42:45,000 --> 00:42:47,000 being done 891 00:42:47,000 --> 00:42:52,000 in the merge sort case, right? So I have 500 elements here 892 00:42:52,000 --> 00:42:55,000 is substantially less, right? Than the quadratic terms that we're seeing on 893 00:42:55,000 --> 00:42:59,000 the comparison and move counts, right? For us in selection sort and 894 00:42:59,000 --> 00:43:02,000 insertion sort. So this is still on something fairly small, right? Five hundred elements, right? That you get 895 00:43:02,000 --> 00:43:06,000 into the thousands and millions, right? The 896 00:43:06,000 --> 00:43:10,000 gap between them just widens immensely. 897 00:43:10,000 --> 00:43:12,000 Does merge sort 898 00:43:12,000 --> 00:43:18,000 make any good sense out of things being all ready sorted? So 899 00:43:18,000 --> 00:43:20,000 thinking about what you know about the algorithm, 900 00:43:20,000 --> 00:43:23,000 does the fact that It's mostly sorted or all ready sorted 901 00:43:23,000 --> 00:43:29,000 provide an advantage to the way merge sort works? Nope, 902 00:43:29,000 --> 00:43:30,000 nope. Just not at 903 00:43:30,000 --> 00:43:34,000 all, right? In fact, I can make the data actually totally sorted for that matter, 904 00:43:34,000 --> 00:43:35,000 right? 905 00:43:35,000 --> 00:43:37,000 And say go ahead and sort and see what happens, 906 00:43:37,000 --> 00:43:38,000 right? And It's like 907 00:43:38,000 --> 00:43:38,000 908 00:43:38,000 --> 00:43:39,000 insertion sort 909 00:43:39,000 --> 00:43:43,000 did 511 comparisons, zero moves, realized everything was in sorted 910 00:43:43,000 --> 00:43:45,000 order and finished very quickly. 911 00:43:45,000 --> 00:43:48,000 Merge sort still did a lot of work, thousands of comparison moves. 912 00:43:48,000 --> 00:43:49,000 It did a 913 00:43:49,000 --> 00:43:52,000 slightly fewer number of compares than it would typically do. that's because when 914 00:43:52,000 --> 00:43:55,000 it, for example, divided into the left half and the right half 915 00:43:55,000 --> 00:43:58,000 it had all the smaller elements on the left, all the larger elements on the 916 00:43:58,000 --> 00:43:59,000 right, 917 00:43:59,000 --> 00:44:02,000 and it will sit thee and compare to realize that all of the left elements go first. Then it'll kind of just 918 00:44:02,000 --> 00:44:06,000 dump the remaining ones on. So it actually shaves off a few of the comparisons in the 919 00:44:06,000 --> 00:44:07,000 merge step, 920 00:44:07,000 --> 00:44:08,000 but it doesn't really 921 00:44:08,000 --> 00:44:11,000 provide any real advantage. It still kind of moves them away and moves them back and does all 922 00:44:11,000 --> 00:44:15,000 the work. And then selection sort just still taking 923 00:44:15,000 --> 00:44:17,000 its favorite amount of time, 924 00:44:17,000 --> 00:44:19,000 which is, yeah, I'll look at everything. 925 00:44:19,000 --> 00:44:23,000 That might be the smallest, but I'm not taking any chances. I'm gonna look through 926 00:44:23,000 --> 00:44:25,000 all of them to make sure before I 927 00:44:25,000 --> 00:44:28,000 decide to keep it where it was. If 928 00:44:28,000 --> 00:44:33,000 I put them in reverse order, just because I can, 929 00:44:33,000 --> 00:44:35,000 watch insertion sort 930 00:44:35,000 --> 00:44:38,000 bog down into It's worst case and that gives selection sort a chance to, like, show It's 931 00:44:38,000 --> 00:44:40,000 metal and there selection 932 00:44:40,000 --> 00:44:41,000 sort showing okay, well, 933 00:44:41,000 --> 00:44:44,000 you give me my absolute worst input I definitely do have a little bit of a hard time 934 00:44:44,000 --> 00:44:46,000 with it. But 935 00:44:46,000 --> 00:44:51,000 merge sort still just doing its thing. What is gonna make all the sound go on? Just because we can. 936 00:44:51,000 --> 00:44:57,000 Because we have two minutes and I'm not gonna sort 937 00:44:57,000 --> 00:45:08,000 quick sort. Ah, you're going. They're duking it out. Oh, no. That sounds like 938 00:45:08,000 --> 00:45:11,000 a cartoon, sort of, like a race. All right. I could watch this thing all day. In fact, I often spend all 939 00:45:11,000 --> 00:45:13,000 day. 940 00:45:13,000 --> 00:45:16,000 I show this to my kids. I'm still waiting for them to come up with the next 941 00:45:16,000 --> 00:45:18,000 great sort algorithm, but so far 942 00:45:18,000 --> 00:45:20,000 really It's not their thing. 943 00:45:20,000 --> 00:45:22,000 So 944 00:45:22,000 --> 00:45:25,000 I will give you a little clue about what we're gonna do 945 00:45:25,000 --> 00:45:27,000 next. we're gonna talk about 946 00:45:27,000 --> 00:45:29,000 a different recursive algorithm. Same 947 00:45:29,000 --> 00:45:31,000 divide and conquer 948 00:45:31,000 --> 00:45:34,000 strategy kind of overall, but kind of 949 00:45:34,000 --> 00:45:37,000 taking a different tactic about which part to make easy and which part to make 950 00:45:37,000 --> 00:45:38,000 hard. 951 00:45:38,000 --> 00:45:41,000 So quick sort, right? As I said, is this easy split hard join. We divided them into 952 00:45:41,000 --> 00:45:42,000 953 00:45:42,000 --> 00:45:45,000 half using sort of no smart information whatsoever and then all of the 954 00:45:45,000 --> 00:45:48,000 work was done in that join. I've got these two sorted piles. I've gotta get them back into 955 00:45:48,000 --> 00:45:50,000 order. How do I work it out? 956 00:45:50,000 --> 00:45:55,000 Well, the quick sort algorithm is also recursive, also kind of a split join 957 00:45:55,000 --> 00:45:55,000 strategy, 958 00:45:55,000 --> 00:45:59,000 but it does more work up front that's not even in the split phase. If I kind 959 00:45:59,000 --> 00:46:02,000 of did a more intelligent division into two 960 00:46:02,000 --> 00:46:06,000 halves then I could make it easier on me in the join phase. 961 00:46:06,000 --> 00:46:10,000 And It's strategy for the split is to decide what's the lower half and what's 962 00:46:10,000 --> 00:46:14,000 the upper half. So if I were looking at a set of test papers I might put the 963 00:46:14,000 --> 00:46:17,000 names A through M over here. So go through the entire pile 964 00:46:17,000 --> 00:46:21,000 and do a quick assessment of are you in the upper half or the lower half? Oh, you're 965 00:46:21,000 --> 00:46:24,000 in the lower, you're in the upper, these two go in the lower, these two go in the 966 00:46:24,000 --> 00:46:25,000 upper. I 967 00:46:25,000 --> 00:46:30,000 examine all of them and I get all of the A through M's over here and all of the N through Z's over there and 968 00:46:30,000 --> 00:46:32,000 then I recursively sort those. 969 00:46:32,000 --> 00:46:35,000 So I get the A to M's all worked out and I get the N through the Z's all worked out. 970 00:46:35,000 --> 00:46:37,000 Then the task of joining them 971 00:46:37,000 --> 00:46:38,000 is totally trivial, 972 00:46:38,000 --> 00:46:42,000 right? I've got A through M. I've got N through Z. Well, 973 00:46:42,000 --> 00:46:45,000 you just push them together and there's actually no 974 00:46:45,000 --> 00:46:49,000 comparing and looking and merging and whatnot that's needed. That join step is 975 00:46:49,000 --> 00:46:52,000 where we get the benefit of all of the work we did in the front end. 976 00:46:52,000 --> 00:46:54,000 That split step though is a little hard. 977 00:46:54,000 --> 00:46:56,000 So we'll come back in on Friday and we'll talk about 978 00:46:56,000 --> 00:47:00,000 how to do that split step and then what is some of the consequences of our strategy for that 979 00:47:00,000 --> 00:47:04,000 split step and how they come back to get at us, but that will be Friday. 980 00:47:04,000 --> 00:47:05,000 There will be music. There 981 00:47:05,000 --> 00:47:09,000 will be dancing girls.