1 00:00:00,000 --> 00:00:07,000 2 00:00:07,000 --> 00:00:10,000 3 00:00:10,000 --> 00:00:27,000 This presentation is delivered by the Stanford Center for Professional Development. 4 00:00:27,000 --> 00:00:30,000 Good afternoon. Apparently, Winter Quarter is over and it's now Spring. 5 00:00:30,000 --> 00:00:34,000 I don't know what happened to the weather, but I'm pretty happy about it. I hope you are too. Let's see. Where 6 00:00:34,000 --> 00:00:39,000 are we at? We are picking back up with the text. Looking at Chapter 7. Chapter 7 is gonna 7 00:00:39,000 --> 00:00:43,000 be the theme for pretty much the whole week. Talking about algorithms and 8 00:00:43,000 --> 00:00:46,000 different ways of analyzing them, formalizing their 9 00:00:46,000 --> 00:00:50,000 performance, and then talking about sorting as an example problem to kind of 10 00:00:50,000 --> 00:00:55,000 discuss in terms of algorithms and options for solving that problem. 11 00:00:55,000 --> 00:00:57,000 Okay. So let's talk about algorithms. Algorithm is 12 00:00:57,000 --> 00:01:03,000 one of the most interesting things to think about from a CS perspective. Kind of one of the most 13 00:01:03,000 --> 00:01:04,000 intellectually 14 00:01:04,000 --> 00:01:08,000 interesting areas to think about things is that often, right? We're trying to solve a particular 15 00:01:08,000 --> 00:01:08,000 problem. 16 00:01:08,000 --> 00:01:09,000 We want to 17 00:01:09,000 --> 00:01:13,000 put this data into sorted order. We want to count these scores and place them into a 18 00:01:13,000 --> 00:01:14,000 histogram. 19 00:01:14,000 --> 00:01:19,000 We want to search a maze to find a path from the start to the goal. 20 00:01:19,000 --> 00:01:20,000 For any of those tasks, 21 00:01:20,000 --> 00:01:24,000 right? There may be multiple ways we could go about solving that problem 22 00:01:24,000 --> 00:01:28,000 that approach it from different angles, using different kinds of data structure, solving 23 00:01:28,000 --> 00:01:31,000 it in forward versus solving it in reverse, is it easier to get from 24 00:01:31,000 --> 00:01:35,000 the goal back to the start? In the end the path is invertible, so maybe it doesn't matter which 25 00:01:35,000 --> 00:01:36,000 end we start from. 26 00:01:36,000 --> 00:01:39,000 Would it be better to use an array for this or a vector? Would a set help us 27 00:01:39,000 --> 00:01:41,000 out? Could we use a map? 28 00:01:41,000 --> 00:01:45,000 Could I use iteration? Could I use recursion? Lots of different ways we could do this. Often sometimes the 29 00:01:45,000 --> 00:01:47,000 decision-making will be about, well, 30 00:01:47,000 --> 00:01:52,000 do I need to get the perfect best answer? Am I willing to take some approximation? Some estimation 31 00:01:52,000 --> 00:01:54,000 of a good answer 32 00:01:54,000 --> 00:01:58,000 that gives me a rough count of something that might be faster to do than doing a real 33 00:01:58,000 --> 00:01:59,000 precise count? 34 00:01:59,000 --> 00:02:03,000 So what we're gonna be looking at, right? Is taking a particular problem, sorting is gonna be the one 35 00:02:03,000 --> 00:02:05,000 that we spend the most time on, 36 00:02:05,000 --> 00:02:08,000 and think about, well, some of the different ways we can go about putting 37 00:02:08,000 --> 00:02:09,000 data in sorted order 38 00:02:09,000 --> 00:02:13,000 and then talk about what the strengths and weaknesses of those algorithms are compared 39 00:02:13,000 --> 00:02:14,000 to one another. 40 00:02:14,000 --> 00:02:16,000 Do they differ in how much 41 00:02:16,000 --> 00:02:21,000 time it will take to sort a particular data set? Does it matter if the data is almost sorted? 42 00:02:21,000 --> 00:02:22,000 Does it actually 43 00:02:22,000 --> 00:02:26,000 affect the performance of the algorithm? What happens if the data set is very large? How much memory is 44 00:02:26,000 --> 00:02:28,000 gonna be used by it? 45 00:02:28,000 --> 00:02:30,000 And we'll be thinking about as we do this, right? Is often 46 00:02:30,000 --> 00:02:34,000 that probably the most important thing is how does it run? How much time does it take? 47 00:02:34,000 --> 00:02:37,000 How much memory does it use? How accurate is its answer? 48 00:02:37,000 --> 00:02:38,000 But also 49 00:02:38,000 --> 00:02:41,000 given that brainpower is often in short supply, 50 00:02:41,000 --> 00:02:45,000 it's worth thinking about, well, is the code easy to develop, to write, to get correct, and kind 51 00:02:45,000 --> 00:02:46,000 of debug? 52 00:02:46,000 --> 00:02:49,000 It may be that a simpler algorithm 53 00:02:49,000 --> 00:02:53,000 that doesn't run as fast, but that's really easy to get working, might actually be the one you 54 00:02:53,000 --> 00:02:57,000 need to solve this particular problem and save the fancy thing for some day when really 55 00:02:57,000 --> 00:03:00,000 you needed that extra speed boost. 56 00:03:00,000 --> 00:03:04,000 So I'm gonna do a little brainstorming exercise with you. Just to kind of get you thinking about 57 00:03:04,000 --> 00:03:08,000 there's lots of different ways to solve what seems like the same problem. 58 00:03:08,000 --> 00:03:12,000 All right. I'd like to know how many students are sitting in this room right now. 59 00:03:12,000 --> 00:03:14,000 All right. I look around. 60 00:03:14,000 --> 00:03:17,000 Michelle, 61 00:03:17,000 --> 00:03:18,000 what's the easiest way 62 00:03:18,000 --> 00:03:20,000 and most reliable way 63 00:03:20,000 --> 00:03:28,000 to just get a count, an accurate count, of everybody in this room? Start by identifying what kind of 64 00:03:28,000 --> 00:03:31,000 [inaudible]. Yeah. Just have a strategy. Some people will use the front row. And I see one, 65 00:03:31,000 --> 00:03:35,000 two, three, four, five and then I go back and I just work my way through the room 66 00:03:35,000 --> 00:03:37,000 all the way to the end, right? 67 00:03:37,000 --> 00:03:41,000 What I've done, if I actually remember my grade school counting, right? Then I should have 68 00:03:41,000 --> 00:03:42,000 come up with a number 69 00:03:42,000 --> 00:03:45,000 that matched the number of people in the room 70 00:03:45,000 --> 00:03:48,000 and tells me, okay, there's 100 people, let's say, that are here. 71 00:03:48,000 --> 00:03:52,000 So that will definitely work. That's the easiest most ways to, sort of, likely approach you'd think 72 00:03:52,000 --> 00:03:54,000 of. You need to count something, well, then count them, right? 73 00:03:54,000 --> 00:03:56,000 Now, I'm gonna talk about ways that we could do it 74 00:03:56,000 --> 00:04:00,000 that maybe aren't so obvious, but that might have different properties 75 00:04:00,000 --> 00:04:05,000 in terms of trading it off. Let's say that the goal was to count all of the people in Stanford Stadium, 76 00:04:05,000 --> 00:04:08,000 right? And so I've got a whole bunch of people sitting there. 77 00:04:08,000 --> 00:04:11,000 The count in turn walking around the stadium doesn't 78 00:04:11,000 --> 00:04:13,000 scale that well. 79 00:04:13,000 --> 00:04:16,000 However long it took me to work my way through the rows in this room, right? 80 00:04:16,000 --> 00:04:20,000 When you multiple it by the size of the Stanford Stadium is gonna take a long 81 00:04:20,000 --> 00:04:23,000 time and as people get up and go the bathroom and move around and whatnot I might 82 00:04:23,000 --> 00:04:25,000 lose track of where I'm at and 83 00:04:25,000 --> 00:04:29,000 all sorts of complications that doesn't really work well in the large. 84 00:04:29,000 --> 00:04:31,000 So here's another option. Anyone 85 00:04:31,000 --> 00:04:35,000 else want to give me just another way of thinking about this? Yeah? Multiply 86 00:04:35,000 --> 00:04:36,000 yourself 87 00:04:36,000 --> 00:04:38,000 into more Julies. 88 00:04:38,000 --> 00:04:40,000 Because we can only ? one Julie's not really enough, right? So 89 00:04:40,000 --> 00:04:44,000 recursion. Can recursion help us out here, right? Can I do some delegating, 90 00:04:44,000 --> 00:04:47,000 right? Some subcounting to other people? 91 00:04:47,000 --> 00:04:50,000 So in the case of this room, it might be that I pick somebody to count the left side, somebody to count 92 00:04:50,000 --> 00:04:56,000 the right side, and somebody to count the middle. And maybe even they themselves decide to use a little delegation in 93 00:04:56,000 --> 00:04:57,000 their work and say, well, how about I 94 00:04:57,000 --> 00:05:02,000 get some volunteers on each row? That's sort of idea would work extremely well in the Stanford Stadium 95 00:05:02,000 --> 00:05:05,000 as well because you just divide it up even further. You say, well, here's this row, this 96 00:05:05,000 --> 00:05:06,000 row, 97 00:05:06,000 --> 00:05:11,000 that row and kind of having all of these delegates working in parallel to count the stadium, right? Getting recursion to kind of 98 00:05:11,000 --> 00:05:14,000 solve that same problem. 99 00:05:14,000 --> 00:05:15,000 What if I wanted 100 00:05:15,000 --> 00:05:18,000 ? I was willing to tolerate a little bit of inaccuracy? 101 00:05:18,000 --> 00:05:22,000 What if I wanted an estimate of the people that were in this room and I was willing 102 00:05:22,000 --> 00:05:24,000 to accept a little bit of sampling or 103 00:05:24,000 --> 00:05:27,000 estimation error? Maybe you could take, 104 00:05:27,000 --> 00:05:29,000 like, a 105 00:05:29,000 --> 00:05:35,000 little area of seats and count how many people are in those seats and then multiply it by how many of 106 00:05:35,000 --> 00:05:36,000 those areas are in the room? 107 00:05:36,000 --> 00:05:40,000 Yeah. So it's like this capacity. Somewhere there's a sign, right? On one of the doors here that 108 00:05:40,000 --> 00:05:43,000 will say, oh, this room seats 180 people. 109 00:05:43,000 --> 00:05:46,000 So I know that there are 180 seats without doing any counting. 110 00:05:46,000 --> 00:05:48,000 So if I took a section. 111 00:05:48,000 --> 00:05:50,000 So I take ten seats, let's say, 112 00:05:50,000 --> 00:05:55,000 or 18 seats for that matter. I take 18 seats and I count those 18 seats how many are occupied 113 00:05:55,000 --> 00:05:59,000 and let's say half of them of the 18 that I look at were occupied then I can say, well, the room's 114 00:05:59,000 --> 00:06:04,000 about half full, right? And then I say, well, okay, 180 seats, half of them, right? Means 90 people are here. 115 00:06:04,000 --> 00:06:07,000 So in that case it's not guaranteed to be accurate, right? If I happen to pick a particularly 116 00:06:07,000 --> 00:06:11,000 dense or sparsely populated section, right? I'd be getting some sampling error based on 117 00:06:11,000 --> 00:06:14,000 that, but that would also work in the Stanford Stadium, right? We know how many seats are in the Stanford 118 00:06:14,000 --> 00:06:17,000 Stadium, right? You pick a section 119 00:06:17,000 --> 00:06:20,000 and you count it. You count 100 seats worth to get a percentage of how full 120 00:06:20,000 --> 00:06:24,000 it is and then just multiply it out to see what you get. 121 00:06:24,000 --> 00:06:28,000 There's a variation on that that's kind of a ? maybe not the first thing that would occur to you, right? Is just that 122 00:06:28,000 --> 00:06:31,000 the idea of counting some subsections is kind of interesting 123 00:06:31,000 --> 00:06:35,000 as a way to kind of divide the problem and then say, well, in the small can I do some 124 00:06:35,000 --> 00:06:37,000 counting that will then scale up? 125 00:06:37,000 --> 00:06:39,000 So, for example, if I had everybody in this room and I said, okay, think of 126 00:06:39,000 --> 00:06:43,000 a number between one and ten. Everybody just think of a number independently on your own. Okay. 127 00:06:43,000 --> 00:06:47,000 If you were thinking of the number four, raise your hand. 128 00:06:47,000 --> 00:06:51,000 I got one, two, three, four, five, six, seven. Seven people. And said, okay, well, if I figure that everybody 129 00:06:51,000 --> 00:06:55,000 was just as likely to pick a number between one and ten as four, 130 00:06:55,000 --> 00:06:56,000 then those seven people, right? 131 00:06:56,000 --> 00:07:01,000 Represent 10 percent of the population. So maybe there were 70 of you. Again, totally based on randomness 132 00:07:01,000 --> 00:07:05,000 and sampling error, right? If I happen to pick a number that's very unpopular with my crowd, right? 133 00:07:05,000 --> 00:07:08,000 Or very popular, I get kind of an artificial estimate. 134 00:07:08,000 --> 00:07:11,000 But it does tell you a little bit about the nature of just solving this 135 00:07:11,000 --> 00:07:15,000 problem. It's like how much error am I willing to tolerate and how much time I'm willing to invest in it, how big 136 00:07:15,000 --> 00:07:17,000 is the problem I'm trying to solve? 137 00:07:17,000 --> 00:07:19,000 What kind of things can I do that might, 138 00:07:19,000 --> 00:07:22,000 in the end, give me some estimate 139 00:07:22,000 --> 00:07:25,000 or accurate count, right? Depending on what I'm willing to invest in the time 140 00:07:25,000 --> 00:07:28,000 spent. 141 00:07:28,000 --> 00:07:31,000 So random thought for you. Here's another one. I can take the access 142 00:07:31,000 --> 00:07:33,000 class list, right? 143 00:07:33,000 --> 00:07:35,000 And I can take the first ten people 144 00:07:35,000 --> 00:07:37,000 and call out their names and find out if they're here 145 00:07:37,000 --> 00:07:40,000 and on that, right? I would get this estimate of, well, what percentage of my students actually 146 00:07:40,000 --> 00:07:41,000 come? 147 00:07:41,000 --> 00:07:44,000 So if my class list says that there's 220 students, which it does because where are 148 00:07:44,000 --> 00:07:45,000 they? 149 00:07:45,000 --> 00:07:48,000 And I take the first of ten students, right? And I call them out and I discover three of them are 150 00:07:48,000 --> 00:07:52,000 here I can say, oh, about 30 percent of my 220 students 151 00:07:52,000 --> 00:07:54,000 come, 66 of you. 152 00:07:54,000 --> 00:07:57,000 And then I would also have the advantage of knowing who was a slacker and I could write it 153 00:07:57,000 --> 00:07:59,000 down in my book. 154 00:07:59,000 --> 00:08:05,000 No, no, no. I'm sure you guys are all at home watching in your bunny slippers. All right. So 155 00:08:05,000 --> 00:08:07,000 let's talk about in terms of computer things. Like 156 00:08:07,000 --> 00:08:12,000 that the situation often is you have a problem to solve, right? You need to do this thing. 157 00:08:12,000 --> 00:08:14,000 You need to solve this maze or 158 00:08:14,000 --> 00:08:15,000 159 00:08:15,000 --> 00:08:20,000 compute this histogram or find the mode score in your class or something like that 160 00:08:20,000 --> 00:08:23,000 and you have these different options about how you might proceed. Which data structure you use 161 00:08:23,000 --> 00:08:25,000 and how you're gonna approach it and stuff like that. 162 00:08:25,000 --> 00:08:29,000 Certainly one way to do it, and not a very practical way to do it, would be to say, well, I'll 163 00:08:29,000 --> 00:08:32,000 just go implement the five different ways I could do this. For example, when I was running your 164 00:08:32,000 --> 00:08:36,000 maze assignment that's what I did. I wrote, like, ten different maze solvers until I decided which one I was 165 00:08:36,000 --> 00:08:37,000 gonna give you. 166 00:08:37,000 --> 00:08:41,000 That's not very practical, right? If your boss said to you, you know, you need to solve task A 167 00:08:41,000 --> 00:08:45,000 and you're response was I'll solve it ten different ways and then I'll come back to you with the 168 00:08:45,000 --> 00:08:46,000 best one. 169 00:08:46,000 --> 00:08:48,000 That might take more time than you've got. 170 00:08:48,000 --> 00:08:49,000 171 00:08:49,000 --> 00:08:52,000 So you would have to write it, you'd debug it, you'd test it, and then you could kind of time it using 172 00:08:52,000 --> 00:08:56,000 your stopwatch to find out how good did it do on these data sets. 173 00:08:56,000 --> 00:09:00,000 Yeah. Well, there's some issues with that. First of all, you have to go do it, right? 174 00:09:00,000 --> 00:09:01,000 Which is kind of a hassle. 175 00:09:01,000 --> 00:09:04,000 And then also it's subject to these variations, like, well, what computer were you running on? What 176 00:09:04,000 --> 00:09:07,000 OS were you running? What other tasks were running? Like did it ? 177 00:09:07,000 --> 00:09:10,000 it may not be completely reliable whatever results you saw. 178 00:09:10,000 --> 00:09:14,000 What we're gonna be more interested in is doing it in a more formal abstract way, which is just analyzing 179 00:09:14,000 --> 00:09:18,000 the algorithm from a pseudocode standpoint. Knowing what the code does and the steps that 180 00:09:18,000 --> 00:09:19,000 it takes, right? 181 00:09:19,000 --> 00:09:24,000 What can we predict about how that algorithm will behave? What situations will it 182 00:09:24,000 --> 00:09:28,000 do well? What situations will it do poorly? When will it get the accurate answer or an estimate? 183 00:09:28,000 --> 00:09:31,000 How much time and space can we 184 00:09:31,000 --> 00:09:34,000 predict it will use based on the size of its input? 185 00:09:34,000 --> 00:09:37,000 That would tell us based on just some descriptions of the algorithm 186 00:09:37,000 --> 00:09:41,000 which one might be the best one that's gonna work out in practice. Then we can go off and really implement 187 00:09:41,000 --> 00:09:44,000 the one that we've chosen. 188 00:09:44,000 --> 00:09:49,000 So this is actually working more the mathematical sense, less on the stopwatch sense. But 189 00:09:49,000 --> 00:09:55,000 helps us to analyze things before we've gone through all the process of writing the code. So what I'm gonna do 190 00:09:55,000 --> 00:10:04,000 with you today is a little bit of just analysis of some code we didn't write to talk about how it behaves and then we're gonna go on and talk about the sorting problem and some of the options for that. 191 00:10:04,000 --> 00:10:09,000 So this one is a function that converts the temperature. You give it a temperature in Celsius and 192 00:10:09,000 --> 00:10:11,000 it converts it to the Fahrenheit equivalent by 193 00:10:11,000 --> 00:10:14,000 doing the multiplication and addition. 194 00:10:14,000 --> 00:10:19,000 So the basic, sort of, underpitting of kind of looking at it formally is to basically realize 195 00:10:19,000 --> 00:10:22,000 that we're gonna count the statements that are executed. 196 00:10:22,000 --> 00:10:25,000 Assuming, right? In this very gross overgeneralization 197 00:10:25,000 --> 00:10:26,000 that each 198 00:10:26,000 --> 00:10:29,000 action that you take costs the same amount. 199 00:10:29,000 --> 00:10:33,000 That doing the multiply and the divide and the addition and a return, that 200 00:10:33,000 --> 00:10:35,000 all those things are, like, one unit. In this 201 00:10:35,000 --> 00:10:37,000 case, I'm gonna call it a penny, right? 202 00:10:37,000 --> 00:10:39,000 It took a penny to do each of those operations. 203 00:10:39,000 --> 00:10:43,000 That's not really accurate to be fair. There's some operations that are more expensive at the 204 00:10:43,000 --> 00:10:47,000 low level than others, but we're gonna be pretty rough about our estimate 205 00:10:47,000 --> 00:10:48,000 here 206 00:10:48,000 --> 00:10:50,000 in this first step here. 207 00:10:50,000 --> 00:10:54,000 So there's a multiply, there's a divide, there's an add, there's a return, right? And so I have, like, four statements 208 00:10:54,000 --> 00:10:59,000 worth of stuff that goes into asking a temperature to be converted. 209 00:10:59,000 --> 00:11:01,000 The other thing you'll note is that, does it 210 00:11:01,000 --> 00:11:03,000 matter if the temperature is high or low? If 211 00:11:03,000 --> 00:11:07,000 we ask it to convert Celsius of zero, Celsius of 100, Celsius of 50 212 00:11:07,000 --> 00:11:09,000 it does the same amount of work no matter what. 213 00:11:09,000 --> 00:11:13,000 So actually that the ? for whatever inputs, right? You give it this function pretty much 214 00:11:13,000 --> 00:11:16,000 will take the same amount of time. 215 00:11:16,000 --> 00:11:19,000 That's good to know, right? There's certain functions that will be like that. 216 00:11:19,000 --> 00:11:22,000 So we would call this a constant time function. It's, like, no matter what you ask it to convert 217 00:11:22,000 --> 00:11:24,000 it takes about the same amount of time. 218 00:11:24,000 --> 00:11:30,000 That doesn't mean that it takes no time, but it is a reliable performer of 219 00:11:30,000 --> 00:11:33,000 relative inputs. 220 00:11:33,000 --> 00:11:35,000 So let's look at this guy. 221 00:11:35,000 --> 00:11:38,000 This is one that takes in a vector 222 00:11:38,000 --> 00:11:40,000 and then it sums all the values in the vector 223 00:11:40,000 --> 00:11:46,000 and then divides that sum by the total number of elements to compute its average. Okay. 224 00:11:46,000 --> 00:11:50,000 So, again, kind of using this idea that will everything you do cost you a penny? 225 00:11:50,000 --> 00:11:52,000 How much is it gonna cost you 226 00:11:52,000 --> 00:11:56,000 to make a call to the average function? Well, this 227 00:11:56,000 --> 00:11:59,000 is a little tricky because, in fact, there's a loop in here, right? That's executed a variable 228 00:11:59,000 --> 00:12:02,000 number of times depending on the size of the input. 229 00:12:02,000 --> 00:12:05,000 So first let's look at the things that are outside the loop, right? Outside the loop we do things 230 00:12:05,000 --> 00:12:07,000 like initialize the sum, 231 00:12:07,000 --> 00:12:10,000 we initialize I before we enter the loop, right? 232 00:12:10,000 --> 00:12:14,000 We do this division here at the bottom and this returns. There are about four things we do 233 00:12:14,000 --> 00:12:19,000 outside of getting into and then iterating in the loops. So just four things we pay no 234 00:12:19,000 --> 00:12:20,000 matter what. 235 00:12:20,000 --> 00:12:23,000 Then we get into this loop body 236 00:12:23,000 --> 00:12:28,000 and each iteration through the loop body does a test against the I of the size to make sure that we're in 237 00:12:28,000 --> 00:12:29,000 range, right? 238 00:12:29,000 --> 00:12:31,000 Does this addition into the sum 239 00:12:31,000 --> 00:12:33,000 and then increments I. 240 00:12:33,000 --> 00:12:36,000 So every iteration has kind of three statements that happen 241 00:12:36,000 --> 00:12:41,000 for every element in the vector. Zero, one, two, three, and whatnot. 242 00:12:41,000 --> 00:12:42,000 So what we get here is 243 00:12:42,000 --> 00:12:45,000 a little bit of constant work that's done no matter what 244 00:12:45,000 --> 00:12:49,000 and then another term that varies with the size of the input. If the vector has ten elements, 245 00:12:49,000 --> 00:12:50,000 right? We'll do 246 00:12:50,000 --> 00:12:52,000 30 steps worth of stuff inside the loop. 247 00:12:52,000 --> 00:12:55,000 If it has 100 elements we do 248 00:12:55,000 --> 00:12:56,000 300. 249 00:12:56,000 --> 00:12:57,000 In this case, right? 250 00:12:57,000 --> 00:13:05,000 For different inputs, right? We would expect the average to take a different amount of time. 251 00:13:05,000 --> 00:13:07,000 Yeah? If n, the vector size 252 00:13:07,000 --> 00:13:08,000 MP though we still initialize I? 253 00:13:08,000 --> 00:13:12,000 We still initialize I. So I was actually counted in here, right? The init I, right? That was in there. 254 00:13:12,000 --> 00:13:16,000 We actually have one more test though. We do a test and then we don't enter the loop body. So there's 255 00:13:16,000 --> 00:13:19,000 actually one little piece maybe we could say that there's actually like three n plus five. There's one additional 256 00:13:19,000 --> 00:13:20,000 test 257 00:13:20,000 --> 00:13:22,000 that doesn't enter into the loop body. 258 00:13:22,000 --> 00:13:23,000 We're gonna see that actually that 259 00:13:23,000 --> 00:13:27,000 we're gonna be a little bit fast and loose about some of the 260 00:13:27,000 --> 00:13:31,000 things that are in the noise in the end and look more at the big leading term, but you are right. 261 00:13:31,000 --> 00:13:36,000 There is one more test than there are alliterations on the loop. That last test that has to fail. 262 00:13:36,000 --> 00:13:38,000 The idea of being that, right? We have 263 00:13:38,000 --> 00:13:42,000 three statements per thing. A little bit of extra stuff tacked onto it if we double the size of 264 00:13:42,000 --> 00:13:47,000 the vectors. So if we compute the average of the vector that has 100 elements and it took us two seconds. 265 00:13:47,000 --> 00:13:50,000 If we put in a vector that's 200 elements 266 00:13:50,000 --> 00:13:51,000 long, 267 00:13:51,000 --> 00:13:53,000 we expect that it should about double, right? 268 00:13:53,000 --> 00:13:56,000 If it took two seconds to do this half-sized vector, 269 00:13:56,000 --> 00:14:00,000 then if we'd have to do a vector that's twice as long we expect it's gonna take about 270 00:14:00,000 --> 00:14:01,000 four seconds. 271 00:14:01,000 --> 00:14:04,000 And that prediction is actually at the heart of what we're looking at today. 272 00:14:04,000 --> 00:14:06,000 Is trying to get this idea of, 273 00:14:06,000 --> 00:14:10,000 well, if we know something about how it performs relative to its input 274 00:14:10,000 --> 00:14:14,000 can we describe if we were to change the size of that input to make the maze much bigger or to make the 275 00:14:14,000 --> 00:14:16,000 vector much smaller? 276 00:14:16,000 --> 00:14:19,000 What can we predict or estimate about 277 00:14:19,000 --> 00:14:28,000 how much time and memory will be used to solve the problem using this algorithm? 278 00:14:28,000 --> 00:14:33,000 So here is one that given a vector computes the min and the max of the vector and it does it 279 00:14:33,000 --> 00:14:34,000 with two loops, right? 280 00:14:34,000 --> 00:14:35,000 One 281 00:14:35,000 --> 00:14:40,000 loop that goes through and checks each element to see if it's greater than the max it seems 282 00:14:40,000 --> 00:14:40,000 so far 283 00:14:40,000 --> 00:14:42,000 and if so it updates the max. 284 00:14:42,000 --> 00:14:46,000 And similarly almost identical loop here at the bottom 285 00:14:46,000 --> 00:14:50,000 that then checks to see if any successive element is smaller than the min it's seen so far and it 286 00:14:50,000 --> 00:14:53,000 updates the min to that one. Okay. 287 00:14:53,000 --> 00:14:55,000 So a little bit of stuff that happens 288 00:14:55,000 --> 00:14:59,000 outside the loop, right? We init I two times. We init the min and the max, 289 00:14:59,000 --> 00:15:02,000 so we've got, like, four statements that happen outside. 290 00:15:02,000 --> 00:15:04,000 And then inside the loop, right? We've got a test and 291 00:15:04,000 --> 00:15:08,000 an assignment, a comparison, an increment, 292 00:15:08,000 --> 00:15:10,000 and we have two of those loops, right? 293 00:15:10,000 --> 00:15:12,000 And so we have a 294 00:15:12,000 --> 00:15:13,000 little bit of stuff outside the loops, 295 00:15:13,000 --> 00:15:17,000 about eight instructions worth that happens per every element. So we look at it once to 296 00:15:17,000 --> 00:15:21,000 see if it's greater than the max. We actually look at everything again, right? To see if it's the 297 00:15:21,000 --> 00:15:22,000 min. 298 00:15:22,000 --> 00:15:24,000 I'm gonna use this actually 299 00:15:24,000 --> 00:15:28,000 as written, right? You might think, well, I could make this better, right? I could make 300 00:15:28,000 --> 00:15:29,000 this better by 301 00:15:29,000 --> 00:15:34,000 doing these two things together, right? Inside that loop, right? So that while I'm looking at each element 302 00:15:34,000 --> 00:15:38,000 I can say, well, if it's the max or if it's the min, right? 303 00:15:38,000 --> 00:15:42,000 If I look at it just once I can actually kind of do both those comparisons inside 304 00:15:42,000 --> 00:15:43,000 the loop 305 00:15:43,000 --> 00:15:46,000 and what we're gonna see is that in terms of kind of this analysis 306 00:15:46,000 --> 00:15:48,000 that's really not gonna make any difference. 307 00:15:48,000 --> 00:15:52,000 That whether we do a loop over the vector once and then we go back and loop over it again or whether we do 308 00:15:52,000 --> 00:15:56,000 twice as much stuff to each element inside one 309 00:15:56,000 --> 00:15:58,000 loop, it's for the purposes of the analysis we're looking at 310 00:15:58,000 --> 00:16:01,000 it ends up coming down to the same things. 311 00:16:01,000 --> 00:16:04,000 Which is, yeah, there is something that depends on the size of the input directly. 312 00:16:04,000 --> 00:16:07,000 Which is to say that if it were 313 00:16:07,000 --> 00:16:12,000 it will increase linearly with the change in size. 314 00:16:12,000 --> 00:16:16,000 So I have these numbers, like, four statements for the Celsius to Fahrenheit. Three n plus 315 00:16:16,000 --> 00:16:18,000 four. Eight n plus four. 316 00:16:18,000 --> 00:16:19,000 317 00:16:19,000 --> 00:16:23,000 These tell us a little bit about, well, will get extremes always take more time than average 318 00:16:23,000 --> 00:16:24,000 or C to F? 319 00:16:24,000 --> 00:16:25,000 320 00:16:25,000 --> 00:16:28,000 That given the way those numbers look it looks like, well, if you plugged in the same value of n 321 00:16:28,000 --> 00:16:30,000 you'd have the same size vector. 322 00:16:30,000 --> 00:16:32,000 That it should take more time 323 00:16:32,000 --> 00:16:35,000 to compute get extremes versus compute the average. 324 00:16:35,000 --> 00:16:39,000 We're gonna discover that, actually, we're not gonna, actually, try to make that guarantee. 325 00:16:39,000 --> 00:16:44,000 That what we're ? we're not really so much interested in comparing two algorithms 326 00:16:44,000 --> 00:16:46,000 and their constant factors to decide 327 00:16:46,000 --> 00:16:50,000 which of these two that kind of look about the same might be a little bit faster. What we're gonna try and 328 00:16:50,000 --> 00:16:55,000 look at it is giving you kind of a rough idea of the class an algorithm fits in. What its growth 329 00:16:55,000 --> 00:16:56,000 term, the kind of 330 00:16:56,000 --> 00:17:00,000 prediction of its growth, is. In this case, both averaging get extremes in terms of 331 00:17:00,000 --> 00:17:05,000 growth, both have a leading term that depends on n with some other noise and a little bit 332 00:17:05,000 --> 00:17:06,000 of a constant thrown in the front. 333 00:17:06,000 --> 00:17:09,000 That means that both of them, right? We'd expect to grow linearly 334 00:17:09,000 --> 00:17:14,000 in a straight line based on the change in n. So if you doubled the size of the thing 335 00:17:14,000 --> 00:17:19,000 then average would take twice as long as it previously did. So should get extremes. But the 336 00:17:19,000 --> 00:17:23,000 data point for does average of a vector of 100 elements take the same amount of time 337 00:17:23,000 --> 00:17:26,000 as get extremes of 100 is not what we're looking at predicting, 338 00:17:26,000 --> 00:17:29,000 right? We're trying to just tell you things about average against itself 339 00:17:29,000 --> 00:17:31,000 over different ranges of inputs. 340 00:17:31,000 --> 00:17:36,000 So if we know that average takes two milliseconds for a 1,000 elements. If we double the 341 00:17:36,000 --> 00:17:40,000 size of that we expect it to take twice as long, four milliseconds. If we increase it by a factor 342 00:17:40,000 --> 00:17:45,000 of ten we expect to increase the time by a factor of ten, right? So it should go up to 20 milliseconds. Get 343 00:17:45,000 --> 00:17:45,000 extremes 344 00:17:45,000 --> 00:17:48,000 might take more or less for a particular element. 345 00:17:48,000 --> 00:17:50,000 We expect it probably will take more 346 00:17:50,000 --> 00:17:53,000 because it does a little bit more work per element, but 347 00:17:53,000 --> 00:17:54,000 348 00:17:54,000 --> 00:18:01,000 we really want to time it to be confident about that rather than making any estimation here. So in 349 00:18:01,000 --> 00:18:04,000 terms of what's called big O notation 350 00:18:04,000 --> 00:18:05,000 we're gonna see that we're going to 351 00:18:05,000 --> 00:18:06,000 352 00:18:06,000 --> 00:18:10,000 kind of take those statement counts and we're gonna summarize them. 353 00:18:10,000 --> 00:18:11,000 354 00:18:11,000 --> 00:18:15,000 That we can do a little bit of niggling to figure out what those numbers are, but what we're 355 00:18:15,000 --> 00:18:16,000 gonna then 356 00:18:16,000 --> 00:18:16,000 357 00:18:16,000 --> 00:18:22,000 do is take the leading term, the largest term with the largest coeff ? value of the input number, 358 00:18:22,000 --> 00:18:24,000 in this case n. 359 00:18:24,000 --> 00:18:26,000 Ignore any smaller terms 360 00:18:26,000 --> 00:18:28,000 and then drop all the constants, 361 00:18:28,000 --> 00:18:31,000 all the coefficients, right? If you know it takes n over 2, 362 00:18:31,000 --> 00:18:32,000 it's just gonna be n. 363 00:18:32,000 --> 00:18:34,000 If you know that it takes 364 00:18:34,000 --> 00:18:37,000 ten statements, then it's gonna just be constant if there's no term of n in there. If it 365 00:18:37,000 --> 00:18:43,000 takes 10n then it's still n. 10n and 2n and 25n and 1/15n 366 00:18:43,000 --> 00:18:44,000 all end up just being n. 367 00:18:44,000 --> 00:18:47,000 So we might have this idea that we've estimated the time 368 00:18:47,000 --> 00:18:51,000 using n and having those constants in. Well, when it comes time to describe it in terms of big 369 00:18:51,000 --> 00:18:52,000 O, 370 00:18:52,000 --> 00:18:54,000 we focus on 371 00:18:54,000 --> 00:18:55,000 what's the ? 372 00:18:55,000 --> 00:18:57,000 oh, that, 373 00:18:57,000 --> 00:19:00,000 the subscript got lost on that. And that just makes no sense as it is right there. I will fix it 374 00:19:00,000 --> 00:19:01,000 in a second. 375 00:19:01,000 --> 00:19:05,000 So the 3n plus 5, right? The leading term is n, the coefficient 3 376 00:19:05,000 --> 00:19:07,000 gets dropped. It's just linear. 377 00:19:07,000 --> 00:19:09,000 We expect that as we change the input 378 00:19:09,000 --> 00:19:13,000 the time will change linearly with that change. 379 00:19:13,000 --> 00:19:15,000 The 10n minus 380 00:19:15,000 --> 00:19:16,000 2, 381 00:19:16,000 --> 00:19:20,000 same class. O of n. Even though it didn't have kind of the same constants coming into it, right? It has the 382 00:19:20,000 --> 00:19:23,000 same growth pattern that they both are lines. 383 00:19:23,000 --> 00:19:26,000 The slope of them might be slightly different, but 384 00:19:26,000 --> 00:19:28,000 in terms of big O we're being 385 00:19:28,000 --> 00:19:31,000 pretty loose about what fits into this class. 386 00:19:31,000 --> 00:19:34,000 1/2n squared minus 387 00:19:34,000 --> 00:19:36,000 n, the leading term here is the n squared. 388 00:19:36,000 --> 00:19:40,000 The n, subtract it off. When n is small n squared and n are actually kind of in range 389 00:19:40,000 --> 00:19:41,000 of each other, 390 00:19:41,000 --> 00:19:45,000 but then what we're looking at is in the limit. As n gets very, very large, n gets 391 00:19:45,000 --> 00:19:49,000 to be 1,000, n gets to be ? n squared is a much bigger number, right? When 392 00:19:49,000 --> 00:19:52,000 n is a million 393 00:19:52,000 --> 00:19:56,000 then n itself was and so as we get to these larger points kind of out into the limit this 394 00:19:56,000 --> 00:20:00,000 term is the only one that matters and this is just a little bit of the noise attached to it. 395 00:20:00,000 --> 00:20:03,000 So that's why we're gonna summarize down to that. 396 00:20:03,000 --> 00:20:07,000 What this one is supposed to say, and it actually is incorrect here, it just should be two to the n and n 397 00:20:07,000 --> 00:20:12,000 to the third power, which summarizes to two to the n. Two to the n grows much, much more 398 00:20:12,000 --> 00:20:17,000 rapidly than n cubed does and so as n gets to be a very large number. Even a fairly small number, 399 00:20:17,000 --> 00:20:20,000 you know, two to the tenth, right? Is 400 00:20:20,000 --> 00:20:23,000 all ready 1,000. Two to the 20th is a million, 401 00:20:23,000 --> 00:20:24,000 which is 402 00:20:24,000 --> 00:20:28,000 much bigger than these guys over here. So as we get to the long run it 403 00:20:28,000 --> 00:20:31,000 must be much bigger. I'm gonna put 404 00:20:31,000 --> 00:20:38,000 a note to myself though to fix that before I put that on the web. It should be O to the 2n? Yeah. It should 405 00:20:38,000 --> 00:20:41,000 be O to the 2n. So that should be two to the 406 00:20:41,000 --> 00:20:43,000 n, n to the third, and my two 407 00:20:43,000 --> 00:20:46,000 to the n there. My subscripts got blasted out of that. 408 00:20:46,000 --> 00:20:47,000 And so we're trying to 409 00:20:47,000 --> 00:20:51,000 describe this growth curve, like, in the limit, right? We're avoiding the details when they don't matter and they don't 410 00:20:51,000 --> 00:20:53,000 matter when n gets big enough, right? 411 00:20:53,000 --> 00:20:55,000 That only the leading term, 412 00:20:55,000 --> 00:20:59,000 right? Is predicting without this other stuff kind of just ignoring it. 413 00:20:59,000 --> 00:21:01,000 So this is very sloppy math, 414 00:21:01,000 --> 00:21:05,000 right? So those of you who are more trained in the, sort of, mathematical sciences may find this 415 00:21:05,000 --> 00:21:09,000 a little bit alarming. Which is just how kind of fast and loose we're gonna be. The only way all these terms left and 416 00:21:09,000 --> 00:21:13,000 right just kind of summarizing down to, okay, here's this leading term and how it relates to n. 417 00:21:13,000 --> 00:21:14,000 Everything else, right? 418 00:21:14,000 --> 00:21:19,000 Totally uninteresting. We could have a function, for example, that did 1,000 conversions of Celsius to 419 00:21:19,000 --> 00:21:20,000 Fahrenheit, 420 00:21:20,000 --> 00:21:24,000 but if every time you call it does 1,000 conversions that means no matter 421 00:21:24,000 --> 00:21:28,000 what input you give to it doesn't change. That's considered O of one or a constant 422 00:21:28,000 --> 00:21:31,000 time. Similarly, right? For doing average. If we did an average that somehow 423 00:21:31,000 --> 00:21:32,000 424 00:21:32,000 --> 00:21:36,000 operated over the vector one time and another one that did it ten times or 20 times 425 00:21:36,000 --> 00:21:39,000 looked at each element 20 times, right? 426 00:21:39,000 --> 00:21:42,000 They would still both be O of n. Whatever time they took, 427 00:21:42,000 --> 00:21:47,000 right? On a vector of a certain length we double that length. We'd expect to double the time. 428 00:21:47,000 --> 00:21:50,000 More formally, right? In terms of what the math really means 429 00:21:50,000 --> 00:21:55,000 is that we say that O of F of n, so O, if it's big O of some function, 430 00:21:55,000 --> 00:22:00,000 it describes an upper bound on the time required. Meaning that 431 00:22:00,000 --> 00:22:02,000 for sufficiently large 432 00:22:02,000 --> 00:22:03,000 values of n 433 00:22:03,000 --> 00:22:06,000 and some constant C that we get to pick that 434 00:22:06,000 --> 00:22:10,000 that would bound the curves. So if you imagine what the real curve looks like, it grows and 435 00:22:10,000 --> 00:22:13,000 maybe it flattens out or maybe it goes up very sharply. 436 00:22:13,000 --> 00:22:17,000 That what C of F of n gives you kind of some upper bound of that. A curve that stays above it 437 00:22:17,000 --> 00:22:20,000 at ? for at some point of n and the limit, right? 438 00:22:20,000 --> 00:22:22,000 Will dominate it from there to affinity. 439 00:22:22,000 --> 00:22:26,000 So describing kind of where it's gonna hit at that upper bound gives us some measure 440 00:22:26,000 --> 00:22:32,000 of what's going on. Okay. 441 00:22:32,000 --> 00:22:34,000 So we can use this to predict times. 442 00:22:34,000 --> 00:22:37,000 That's probably what's most valuable to us about it is knowing that 443 00:22:37,000 --> 00:22:41,000 I have an n ? a linear algorithm. So an O of n algorithm we'll call linear. 444 00:22:41,000 --> 00:22:46,000 And n squared algorithm I'll call quadratic. N to the third I'll called cubic, right? 445 00:22:46,000 --> 00:22:50,000 That's gonna tell us that those curves, right? You know what a line looks like. Well, you know what a parabola 446 00:22:50,000 --> 00:22:54,000 looks like coming out of an n squared where it's growing much more sharply and early kind of 447 00:22:54,000 --> 00:22:57,000 making the decision. So 448 00:22:57,000 --> 00:22:58,000 it might be, right? That 449 00:22:58,000 --> 00:23:02,000 if I know that it takes three seconds to do something for 5,000 elements then I have a linear 450 00:23:02,000 --> 00:23:03,000 algorithm. 451 00:23:03,000 --> 00:23:07,000 That 10,000 elements, right? Should take twice as long. 20,000 should take another, 452 00:23:07,000 --> 00:23:10,000 a factor of two on that. So 12 seconds worth. That 453 00:23:10,000 --> 00:23:12,000 I may have an n squared algorithm. 454 00:23:12,000 --> 00:23:17,000 So one that I expect to actually perform more poorly in the long run, right? This n squared 455 00:23:17,000 --> 00:23:21,000 versus n tells you that it's gonna more sharply grow. That 456 00:23:21,000 --> 00:23:25,000 for some values of n in simply 5,000 is the case where the n squared algorithm 457 00:23:25,000 --> 00:23:26,000 actually outperforms 458 00:23:26,000 --> 00:23:31,000 it in a small case. That's not uncommon actually. 459 00:23:31,000 --> 00:23:32,000 But, right? 460 00:23:32,000 --> 00:23:36,000 As it grows, right? As I get to larger and larger values of n 461 00:23:36,000 --> 00:23:40,000 the fact that it is going by factor of four, right? The square of 462 00:23:40,000 --> 00:23:40,000 the doubling 463 00:23:40,000 --> 00:23:45,000 as opposed to linear means it's gonna quickly take off. And so I take my 5,000 elements that 464 00:23:45,000 --> 00:23:47,000 took two and a half seconds. 465 00:23:47,000 --> 00:23:50,000 I put an input that's twice as large into it. 466 00:23:50,000 --> 00:23:52,000 It's gonna take a factor of four, 467 00:23:52,000 --> 00:23:53,000 right? From there. 468 00:23:53,000 --> 00:23:56,000 And if I go from 10,000 to 20,000 another factor of four is gonna bring that up 469 00:23:56,000 --> 00:23:57,000 to 470 00:23:57,000 --> 00:23:58,000 40 seconds or so 471 00:23:58,000 --> 00:24:02,000 compared to my more modestly growing linear algorithm there. 472 00:24:02,000 --> 00:24:07,000 So if I was facing some task, right? Where I had an option between a linear algorithm and a 473 00:24:07,000 --> 00:24:09,000 quadratic algorithm 474 00:24:09,000 --> 00:24:12,000 it's telling me that in the long run the quadratic algorithm 475 00:24:12,000 --> 00:24:15,000 for sufficiently large inputs, right? 476 00:24:15,000 --> 00:24:18,000 Is going to bog down in a way that the linear one 477 00:24:18,000 --> 00:24:26,000 will be our kind of strong performer in the larger cases. 478 00:24:26,000 --> 00:24:27,000 So some algorithms 479 00:24:27,000 --> 00:24:30,000 actually have a variable 480 00:24:30,000 --> 00:24:32,000 run time expectation 481 00:24:32,000 --> 00:24:34,000 that you cannot say 482 00:24:34,000 --> 00:24:35,000 483 00:24:35,000 --> 00:24:39,000 with all assuredness how much time it will take because actually depending on the input and 484 00:24:39,000 --> 00:24:43,000 the characteristics of the input it may do more or less work. So average looks at every 485 00:24:43,000 --> 00:24:47,000 element in the vector and sums them all up and does the division. It doesn't matter what elements 486 00:24:47,000 --> 00:24:48,000 are in there. 487 00:24:48,000 --> 00:24:50,000 It's very reliable in that sense. 488 00:24:50,000 --> 00:24:54,000 Something like search. So this is a linear search function that given a vector of names 489 00:24:54,000 --> 00:24:56,000 and a particular key 490 00:24:56,000 --> 00:24:58,000 just walks the vector looking for a match. If 491 00:24:58,000 --> 00:25:02,000 it finds it, it returns true. If it exhaustively searched the whole thing and didn't find it, it returns 492 00:25:02,000 --> 00:25:04,000 false. Okay. 493 00:25:04,000 --> 00:25:06,000 So we've got what looks like 494 00:25:06,000 --> 00:25:08,000 an O of n loop 495 00:25:08,000 --> 00:25:09,000 that we'll look at the things, 496 00:25:09,000 --> 00:25:11,000 but there are some cases, right? 497 00:25:11,000 --> 00:25:15,000 Where it just doesn't do very much work at all. For example, if it finds the key in the very 498 00:25:15,000 --> 00:25:16,000 first spot, 499 00:25:16,000 --> 00:25:18,000 right? If I'm looking for 500 00:25:18,000 --> 00:25:22,000 Bob and Bob is the first thing in there, then I immediately return true and I do no more work. 501 00:25:22,000 --> 00:25:26,000 And so it doesn't matter whether Bob was followed by a million more names or ten more names, right? 502 00:25:26,000 --> 00:25:31,000 That, in fact, it is a constant time operation to just access that first element and return it. 503 00:25:31,000 --> 00:25:34,000 Similarly for other things that are close to the front. It looks at a few of them and 504 00:25:34,000 --> 00:25:35,000 it's done. 505 00:25:35,000 --> 00:25:39,000 And so we would call those the best case of the algorithm. So we can divide this into things. It's like, 506 00:25:39,000 --> 00:25:41,000 well, what of the 507 00:25:41,000 --> 00:25:45,000 best case input, right? What can we expect out of the performance? Well, the best-case input would 508 00:25:45,000 --> 00:25:46,000 be 509 00:25:46,000 --> 00:25:48,000 it's found in the first few members. 510 00:25:48,000 --> 00:25:49,000 In which case 511 00:25:49,000 --> 00:25:52,000 it's an O of one algorithm for those situations. 512 00:25:52,000 --> 00:25:56,000 That's not often very interesting to say, well, here's a particular input that causes it to immediately 513 00:25:56,000 --> 00:25:58,000 be able to calculate something in return. 514 00:25:58,000 --> 00:26:02,000 Yeah. Not so interesting. So it's worth knowing that such a thing exists, but it turns out it's 515 00:26:02,000 --> 00:26:03,000 not 516 00:26:03,000 --> 00:26:04,000 517 00:26:04,000 --> 00:26:08,000 likely to tell us a lot about the general performance of this algorithm. 518 00:26:08,000 --> 00:26:11,000 If it's in the middle or it's in the last slot, right? We're 519 00:26:11,000 --> 00:26:15,000 gonna be looking at a lot of the elements to decide that it's there or not. 520 00:26:15,000 --> 00:26:18,000 In the absolute worst case, 521 00:26:18,000 --> 00:26:21,000 right? So what's the input that causes it to do the most amount of work 522 00:26:21,000 --> 00:26:23,000 is the one where it doesn't find it at all. Where 523 00:26:23,000 --> 00:26:28,000 it looks at every single element, never finding the match, and then comes out and returns that false. 524 00:26:28,000 --> 00:26:31,000 And the worst case, in this case, is a fairly likely thing to happen, right? 525 00:26:31,000 --> 00:26:34,000 We're searching because we happen to believe it might not be there 526 00:26:34,000 --> 00:26:35,000 and that gives us this upper bound on 527 00:26:35,000 --> 00:26:40,000 how bad it gets. So in the worst case it looks at everything and it is definitely O of 528 00:26:40,000 --> 00:26:43,000 n. So we have this kind of constant best case, 529 00:26:43,000 --> 00:26:44,000 an O of n worst case, 530 00:26:44,000 --> 00:26:47,000 and then our average case is gonna be somewhere in the middle there. 531 00:26:47,000 --> 00:26:52,000 This actually is a little bit harder to predict or to compute precisely in most 532 00:26:52,000 --> 00:26:57,000 situations because you have to know things about, well, what are the likely range of inputs? 533 00:26:57,000 --> 00:27:01,000 So typically the definition is that it's averaged over all the possible inputs. 534 00:27:01,000 --> 00:27:06,000 Well, given the search it's pretty hard to say what are all the possible inputs for it? It's 535 00:27:06,000 --> 00:27:10,000 like you can make some assumptions, like, well, all in ? all permutations of the list 536 00:27:10,000 --> 00:27:12,000 are equally likely 537 00:27:12,000 --> 00:27:17,000 and the name is in the list about half the time, let's say. You can just ? you have 538 00:27:17,000 --> 00:27:22,000 to make up some parameters that describe what you believe to be the likely inputs 539 00:27:22,000 --> 00:27:26,000 and then you can say, well, if it were in the list, if it's equally likely to be 540 00:27:26,000 --> 00:27:27,000 in the front as in the back, 541 00:27:27,000 --> 00:27:30,000 then on average it's gonna look at n over two. 542 00:27:30,000 --> 00:27:35,000 It's just as likely to be in all the front slots. So, let's say, if you were gonna call it n times 543 00:27:35,000 --> 00:27:36,000 and 544 00:27:36,000 --> 00:27:40,000 the name you were looking for was going to be in each individual slot exactly one 545 00:27:40,000 --> 00:27:41,000 time, then the total random 546 00:27:41,000 --> 00:27:43,000 perfect permutation case. 547 00:27:43,000 --> 00:27:46,000 So then it would have looked at n over two of them. Sometimes it looked at one, sometimes it looked 548 00:27:46,000 --> 00:27:48,000 at n minus one, sometimes as it looked at n 549 00:27:48,000 --> 00:27:52,000 over two, sometimes n over two plus one, and whatnot. That over all of them it looked at 550 00:27:52,000 --> 00:27:53,000 about half. 551 00:27:53,000 --> 00:27:57,000 And then if there was another set of calls to search for it, right? Where it never found it, it would be looking 552 00:27:57,000 --> 00:27:58,000 at 553 00:27:58,000 --> 00:27:59,000 n, right? And so 554 00:27:59,000 --> 00:28:02,000 you can say, well, sometimes we look at n over two, sometimes we look at n. 555 00:28:02,000 --> 00:28:04,000 On average we're looking at about three-quarters of them, right? 556 00:28:04,000 --> 00:28:07,000 In big O, since we get to be a bit sloppy here, 557 00:28:07,000 --> 00:28:10,000 we can say, well, this all just boils down to being linear. That O 558 00:28:10,000 --> 00:28:14,000 of n still described to the growth in the average case, 559 00:28:14,000 --> 00:28:16,000 560 00:28:16,000 --> 00:28:20,000 which is the same number we got for the worst case. So this is a little 561 00:28:20,000 --> 00:28:23,000 bit tricky, but this is actually the one that's probably the most interesting, right? Which is just in 562 00:28:23,000 --> 00:28:28,000 normal practice, how is this thing gonna behave? 563 00:28:28,000 --> 00:28:31,000 So the last little thing I need to 564 00:28:31,000 --> 00:28:34,000 complete before we can go on and start applying this to do something interesting, is to talk 565 00:28:34,000 --> 00:28:39,000 a little bit about how we do recursive algorithms because they're a little bit trickier than the standard iteration 566 00:28:39,000 --> 00:28:40,000 and counting. 567 00:28:40,000 --> 00:28:44,000 So the counting statements are kind of saying, oh, I've got this loop followed by this loop and this thing 568 00:28:44,000 --> 00:28:45,000 happening, right? 569 00:28:45,000 --> 00:28:49,000 Will get you through the simple cases. Well, what do we do when we have a recursive algorithm, right? Well, 570 00:28:49,000 --> 00:28:53,000 we're trying to compute the time spent in solving something 571 00:28:53,000 --> 00:28:57,000 that, in affect, is making a call to the same algorithm and so we're gonna likely have 572 00:28:57,000 --> 00:29:00,000 some recursive definition we need to work through. 573 00:29:00,000 --> 00:29:02,000 So this is the factorial function. 574 00:29:02,000 --> 00:29:06,000 If n equals zero, return one, otherwise return n times the factorial n minus one. 575 00:29:06,000 --> 00:29:08,000 We're interested in doing kind of the statement counts or kind 576 00:29:08,000 --> 00:29:11,000 of summarizing the time for an input 577 00:29:11,000 --> 00:29:13,000 whose value is n. 578 00:29:13,000 --> 00:29:17,000 And so what we write down is what's called a recurrence relation 579 00:29:17,000 --> 00:29:20,000 that largely matches the code in terms of saying 580 00:29:20,000 --> 00:29:21,000 how much work is done 581 00:29:21,000 --> 00:29:25,000 in the base case? In the different cases that are being handled? Typically there is a base case 582 00:29:25,000 --> 00:29:29,000 or two where you say, well, if it's in these cases, right? We do these easy things. 583 00:29:29,000 --> 00:29:31,000 So if n is exactly zero 584 00:29:31,000 --> 00:29:36,000 then we do O of one worth of work, right? We do a little bit of constant work here. We 585 00:29:36,000 --> 00:29:38,000 do a comparison and a return. 586 00:29:38,000 --> 00:29:42,000 In the otherwise case when it is not zero, 587 00:29:42,000 --> 00:29:46,000 then we do a little bit of work. In this case, a return, a multiply, a little bit of constant work 588 00:29:46,000 --> 00:29:48,000 plus whatever time it takes to compute 589 00:29:48,000 --> 00:29:51,000 the factorial of n minus one. 590 00:29:51,000 --> 00:29:54,000 So we have T of n defined in terms of 591 00:29:54,000 --> 00:29:58,000 T of n of something else, right? Which is exactly what we'd expect in a recursive function. 592 00:29:58,000 --> 00:30:00,000 This is called a recurrence relation, so that 593 00:30:00,000 --> 00:30:03,000 solving for T of 594 00:30:03,000 --> 00:30:05,000 n means working with 595 00:30:05,000 --> 00:30:09,000 a side that refers back to that same T, but on a smaller version of the problem 596 00:30:09,000 --> 00:30:15,000 in n over two and n minus one. Some other variation of that recursive call. 597 00:30:15,000 --> 00:30:19,000 So let me show you how we make that go into closed form. 598 00:30:19,000 --> 00:30:23,000 The idea ? and actually I'm just gonna do this on the board because I think it's easier if I just write what's 599 00:30:23,000 --> 00:30:26,000 going on and you can watch me develop it. 600 00:30:26,000 --> 00:30:28,000 So I have this recurrence relation. 601 00:30:28,000 --> 00:30:29,000 Even 602 00:30:29,000 --> 00:30:31,000 equals one if 603 00:30:31,000 --> 00:30:33,000 n equals zero 604 00:30:33,000 --> 00:30:39,000 and then it's one plus T of n minus one otherwise. So I'm trying to 605 00:30:39,000 --> 00:30:41,000 solve for T of n 606 00:30:41,000 --> 00:30:45,000 into a closed form. So I'm gonna start with 607 00:30:45,000 --> 00:30:50,000 it's non-recur ? the non-base case form. So the recursive step. And then what I'm gonna do 608 00:30:50,000 --> 00:30:54,000 is I'm actually just gonna reply repeated substitution 609 00:30:54,000 --> 00:30:56,000 to take this term and expand it out. 610 00:30:56,000 --> 00:31:00,000 So I know it's one plus whatever it takes to do T of n minus one. 611 00:31:00,000 --> 00:31:02,000 So if I go back over here and I say, well, T of 612 00:31:02,000 --> 00:31:03,000 n minus one 613 00:31:03,000 --> 00:31:04,000 must be 614 00:31:04,000 --> 00:31:06,000 615 00:31:06,000 --> 00:31:08,000 one if n minus one equals zero or it's 616 00:31:08,000 --> 00:31:11,000 one plus T of n minus two. So 617 00:31:11,000 --> 00:31:13,000 kind 618 00:31:13,000 --> 00:31:17,000 of plugging it into the original formula and getting the expansion one layer in. 619 00:31:17,000 --> 00:31:19,000 So I can say, well, this is 620 00:31:19,000 --> 00:31:20,000 one plus 621 00:31:20,000 --> 00:31:23,000 one plus T of n minus two. And if I apply 622 00:31:23,000 --> 00:31:25,000 that again, 623 00:31:25,000 --> 00:31:27,000 right? I should get another term of 624 00:31:27,000 --> 00:31:35,000 625 00:31:35,000 --> 00:31:44,000 one. So after I have done this I times 626 00:31:44,000 --> 00:31:46,000 then I will have a bunch of ones added up together 627 00:31:46,000 --> 00:31:49,000 and I will have subtracted an I from T 628 00:31:49,000 --> 00:31:51,000 629 00:31:51,000 --> 00:31:52,000 down to some term. So I'm just gonna ? each of 630 00:31:52,000 --> 00:31:54,000 these represents kind of like 631 00:31:54,000 --> 00:31:57,000 this is a little bit of work from the n, which does the n minus one, which brings the quality of my two. 632 00:31:57,000 --> 00:32:01,000 So each of the cul's kind of in the stack frame has a little bit of a 633 00:32:01,000 --> 00:32:03,000 contribution to add to the whole thing 634 00:32:03,000 --> 00:32:04,000 and then what we're looking at is 635 00:32:04,000 --> 00:32:08,000 how many times do we have to do that expansion and substitution, right? 636 00:32:08,000 --> 00:32:10,000 Before we hit this base case, right? 637 00:32:10,000 --> 00:32:12,000 Where T of n exactly equals one. 638 00:32:12,000 --> 00:32:15,000 So we're looking for the case where n 639 00:32:15,000 --> 00:32:18,000 actually it's n equals zero. So 640 00:32:18,000 --> 00:32:20,000 where n 641 00:32:20,000 --> 00:32:23,000 so I want to do this until n minus I equals equals zero. 642 00:32:23,000 --> 00:32:27,000 Okay. So I need to have done this I times where I is n. 643 00:32:27,000 --> 00:32:32,000 So if I say I set I equaled to 644 00:32:32,000 --> 00:32:34,000 n, then I'll have one plus one plus one 645 00:32:34,000 --> 00:32:37,000 n times, 646 00:32:37,000 --> 00:32:39,000 and then I have plus 647 00:32:39,000 --> 00:32:41,000 T of the n minus n over here, 648 00:32:41,000 --> 00:32:43,000 which is my T subzero. 649 00:32:43,000 --> 00:32:47,000 T subzero, right? Immediately plugs into that base case and says, well, there's just one more thing 650 00:32:47,000 --> 00:32:48,000 to do when you get to that 651 00:32:48,000 --> 00:32:52,000 and so what I basically have here is n plus one. 652 00:32:52,000 --> 00:33:00,000 So n multiplications plus one little tack on for the base case, which in terms of big O 653 00:33:00,000 --> 00:33:03,000 is just linear. A little bit 654 00:33:03,000 --> 00:33:05,000 of extra work in the noise there, but 655 00:33:05,000 --> 00:33:10,000 that means that kind of as it seems more predictable, right? That factorial over particular input, 656 00:33:10,000 --> 00:33:11,000 right? Is linear 657 00:33:11,000 --> 00:33:17,000 in the number you ask to compute the factorial. The factorial of ten takes ten 658 00:33:17,000 --> 00:33:20,000 multiplications, right? The factorial of 20 takes 20 multiplications. 659 00:33:20,000 --> 00:33:24,000 So if you change the size of your input, right? You double it; it should take twice as long. However much 660 00:33:24,000 --> 00:33:28,000 time it cost you to compute the factorial of ten is gonna take twice as much time to compute 661 00:33:28,000 --> 00:33:32,000 the factorial of 20. Okay. That kind of 662 00:33:32,000 --> 00:33:36,000 makes sense, but it's good to know kind of how I can do this math, 663 00:33:36,000 --> 00:33:38,000 right? To work this out, right? So this idea of 664 00:33:38,000 --> 00:33:45,000 taking the term, repeatedly substituting and expanding, generalizing my pattern, and say, well, after I substitutions 665 00:33:45,000 --> 00:33:46,000 worth where am I at? 666 00:33:46,000 --> 00:33:48,000 And these correspond and kind of thinking about 667 00:33:48,000 --> 00:33:53,000 the recursion tree. What calls are made and then how deep does the recursion 668 00:33:53,000 --> 00:33:55,000 go before I hit the base case 669 00:33:55,000 --> 00:33:58,000 and that tells me how to 670 00:33:58,000 --> 00:34:08,000 stop that expanding and then substitute back in for the base case to compute my total result. I'm gonna do another one, so 671 00:34:08,000 --> 00:34:10,000 if you want to just 672 00:34:10,000 --> 00:34:12,000 watch and we'll do 673 00:34:12,000 --> 00:34:13,000 the math for this together, too. 674 00:34:13,000 --> 00:34:17,000 This is the Towers of Hanoi example 675 00:34:17,000 --> 00:34:18,000 that moves the 676 00:34:18,000 --> 00:34:20,000 tower away of n minus one off, 677 00:34:20,000 --> 00:34:25,000 moves the single disk on the bottom and then moves that tower back on. 678 00:34:25,000 --> 00:34:37,000 And so the recurrence that we're working with 679 00:34:37,000 --> 00:34:42,000 is one when n equals zero. So when n equals zero, right? We have a zero height tower to move and we 680 00:34:42,000 --> 00:34:46,000 actually do no work in the function, right? We just do that test and return, so if n equals zero 681 00:34:46,000 --> 00:34:47,000 there's no work to be done. 682 00:34:47,000 --> 00:34:50,000 So that's the easy case for us, right? Is 683 00:34:50,000 --> 00:34:53,000 when it's zero 684 00:34:53,000 --> 00:34:54,000 do no work. 685 00:34:54,000 --> 00:34:55,000 Otherwise, 686 00:34:55,000 --> 00:34:57,000 right? We move the bottom most disk. 687 00:34:57,000 --> 00:35:01,000 So we do a little testing and moving of that disk. We'll call that the constant amount of work that's in the function 688 00:35:01,000 --> 00:35:02,000 itself. 689 00:35:02,000 --> 00:35:04,000 And it makes two recursive calls 690 00:35:04,000 --> 00:35:05,000 each of a tower 691 00:35:05,000 --> 00:35:10,000 of height n minus one. 692 00:35:10,000 --> 00:35:12,000 So otherwise, right? 693 00:35:12,000 --> 00:35:15,000 Two calls, which gives a little clue to what the tree looks like. It'll branch twice 694 00:35:15,000 --> 00:35:20,000 at each stop and then it's one closer to that base case gives us a sense that the recursion 695 00:35:20,000 --> 00:35:21,000 depth 696 00:35:21,000 --> 00:35:24,000 is likely to be linear here. 697 00:35:24,000 --> 00:35:26,000 So let me go through the process of making this work. 698 00:35:26,000 --> 00:35:30,000 I've got T of 699 00:35:30,000 --> 00:35:33,000 n equals one plus two, T to the n minus one. So then 700 00:35:33,000 --> 00:35:35,000 I 701 00:35:35,000 --> 00:35:37,000 take T to the n minus one 702 00:35:37,000 --> 00:35:39,000 and I plug it in over here 703 00:35:39,000 --> 00:35:42,000 to get one plus two 704 00:35:42,000 --> 00:35:44,000 T to the n minus two. 705 00:35:44,000 --> 00:35:47,000 This whole thing is in a - 706 00:35:47,000 --> 00:35:51,000 multiplied by two though because I have two of those, right? From the original call 707 00:35:51,000 --> 00:35:53,000 which then itself made two. So, in fact, if I multiply through, 708 00:35:53,000 --> 00:35:55,000 right? I've got one plus 709 00:35:55,000 --> 00:35:59,000 two plus four 710 00:35:59,000 --> 00:36:01,000 T to the n over two. If 711 00:36:01,000 --> 00:36:03,000 I apply my substitution again, one plus two plus 712 00:36:03,000 --> 00:36:06,000 four times T to 713 00:36:06,000 --> 00:36:09,000 the n minus two is one plus two, T 714 00:36:09,000 --> 00:36:12,000 to the n minus three, 715 00:36:12,000 --> 00:36:13,000 and then let the 716 00:36:13,000 --> 00:36:21,000 multiplication go through again. One plus two plus four plus eight T to the n 717 00:36:21,000 --> 00:36:22,000 minus three. 718 00:36:22,000 --> 00:36:26,000 And so each expansion of this, right? Is causing the number of towers 719 00:36:26,000 --> 00:36:31,000 that are being moved around to go up by a factor of two. So each time I do this, 720 00:36:31,000 --> 00:36:32,000 right? I went from 721 00:36:32,000 --> 00:36:36,000 two towers to move to four towers to eight towers, but those towers each got one shorter 722 00:36:36,000 --> 00:36:41,000 in the process. So kind of telling us a little bit about what the recursion tree looks like, right? 723 00:36:41,000 --> 00:36:42,000 Is there is a 724 00:36:42,000 --> 00:36:48,000 branching factor of two all the way down and that 725 00:36:48,000 --> 00:36:50,000 the depth of this thing 726 00:36:50,000 --> 00:36:51,000 is gonna bottom out linearly 727 00:36:51,000 --> 00:36:57,000 because this was a tower by ten, nine, eight, seven, and so on down to the bottom. 728 00:36:57,000 --> 00:36:59,000 So if I imagined this happened I times, 729 00:36:59,000 --> 00:37:03,000 so to generalize my pattern. I've got one plus two 730 00:37:03,000 --> 00:37:06,000 plus four plus two 731 00:37:06,000 --> 00:37:09,000 to the I. So after I've done this 732 00:37:09,000 --> 00:37:12,000 that number of times, right? Actually it's two I minus one plus two to 733 00:37:12,000 --> 00:37:17,000 the I, T n minus I. 734 00:37:17,000 --> 00:37:22,000 So I have subtracted I off the heights of the tower. 735 00:37:22,000 --> 00:37:26,000 I have gone up by a factor of two each time I did that and then I have these sums in the front, 736 00:37:26,000 --> 00:37:30,000 which represent kind of the single disk that got moved in there. One plus two plus four 737 00:37:30,000 --> 00:37:32,000 plus eight plus so on 738 00:37:32,000 --> 00:37:33,000 down to there. 739 00:37:33,000 --> 00:37:37,000 And so the place I want to get to, right? Is where n equals zero. 740 00:37:37,000 --> 00:37:39,000 So I actually want to set 741 00:37:39,000 --> 00:37:44,000 n, I equal to n here. I wrote that backwards, let's say I equals n. I plug that in I've 742 00:37:44,000 --> 00:37:49,000 got one plus two plus four plus all the powers of two 743 00:37:49,000 --> 00:37:52,000 to the n minus one plus 744 00:37:52,000 --> 00:37:54,000 two to the n, T to the n minus 745 00:37:54,000 --> 00:37:55,000 n, which is T to the zero, 746 00:37:55,000 --> 00:37:58,000 which is just one. 747 00:37:58,000 --> 00:38:05,000 So what I've got here 748 00:38:05,000 --> 00:38:11,000 is following along. 749 00:38:11,000 --> 00:38:13,000 Is T to the n is one plus two plus four 750 00:38:13,000 --> 00:38:14,000 plus two to the n, right? 751 00:38:14,000 --> 00:38:18,000 So two 752 00:38:18,000 --> 00:38:19,000 to the n 753 00:38:19,000 --> 00:38:20,000 minus 754 00:38:20,000 --> 00:38:23,000 one plus two to the n times one. 755 00:38:23,000 --> 00:38:25,000 So I've got the geometric sum here. 756 00:38:25,000 --> 00:38:26,000 You may or may not 757 00:38:26,000 --> 00:38:30,000 all ready know how to solve this one, but I'm just gonna go ahead and 758 00:38:30,000 --> 00:38:34,000 solve it in front of you to remind you of how the process works. I've got the powers of two. 759 00:38:34,000 --> 00:38:36,000 One plus two plus up to the n. 760 00:38:36,000 --> 00:38:39,000 So that's the term I'm looking for. I want to call this A 761 00:38:39,000 --> 00:38:40,000 just to mark it. 762 00:38:40,000 --> 00:38:45,000 And then what I'm gonna compute for you is what two times A is 763 00:38:45,000 --> 00:38:50,000 and I'm gonna write it underneath. So if I multiply this by two I have one times two, which is two. 764 00:38:50,000 --> 00:38:52,000 Two times two, which is four, 765 00:38:52,000 --> 00:38:53,000 right? Four times four 766 00:38:53,000 --> 00:38:55,000 at two, which is eight. 767 00:38:55,000 --> 00:38:56,000 And so on. 768 00:38:56,000 --> 00:38:59,000 And so I should get 769 00:38:59,000 --> 00:39:02,000 basically the same sequence of terms, but shifted over by one. 770 00:39:02,000 --> 00:39:04,000 I don't have the factor of one 771 00:39:04,000 --> 00:39:06,000 and I do have an additional factor of two to the 772 00:39:06,000 --> 00:39:11,000 n times another power of two at the end. So I've got the whole term kind of shifted 773 00:39:11,000 --> 00:39:15,000 over by one. That's my two A so that's the same thing I'm looking for, but doubled. 774 00:39:15,000 --> 00:39:17,000 And then I'm gonna basically take this thing 775 00:39:17,000 --> 00:39:19,000 and I'm gonna 776 00:39:19,000 --> 00:39:23,000 add negative A to two A. So subtracting A off of two A 777 00:39:23,000 --> 00:39:24,000 so that all these terms 778 00:39:24,000 --> 00:39:28,000 will go away. If I have this minus this, this minus this all the way across, right? 779 00:39:28,000 --> 00:39:30,000 The terms I'll be left with 780 00:39:30,000 --> 00:39:32,000 is two to the n plus one 781 00:39:32,000 --> 00:39:35,000 minus one is 782 00:39:35,000 --> 00:39:37,000 A subtracted from two A so the 783 00:39:37,000 --> 00:39:38,000 two to the T to 784 00:39:38,000 --> 00:39:41,000 the n that I'm looking for 785 00:39:41,000 --> 00:39:43,000 is two to the n plus one minus one. 786 00:39:43,000 --> 00:39:45,000 That's my term there. And 787 00:39:45,000 --> 00:39:51,000 if I summarize in my big O way, 788 00:39:51,000 --> 00:39:56,000 ignoring the constants, throwing away these little parts of the terms, right? That are in the noise. That 789 00:39:56,000 --> 00:39:58,000 what we really have here is something that grows, 790 00:39:58,000 --> 00:40:00,000 right? Exponentially 791 00:40:00,000 --> 00:40:04,000 by about a factor of two for each additional 792 00:40:04,000 --> 00:40:05,000 793 00:40:05,000 --> 00:40:07,000 disk added to the tower. 794 00:40:07,000 --> 00:40:10,000 So however much time it took to move a tower of height ten, 795 00:40:10,000 --> 00:40:15,000 right? A tower of height 11 we expect to take twice as long. So not growing linearly at all, right? 796 00:40:15,000 --> 00:40:18,000 Much, much, much sharply growing, right? What exponential 797 00:40:18,000 --> 00:40:19,000 growth looks like. 798 00:40:19,000 --> 00:40:22,000 Two to the n, right? Even for small values of n. 799 00:40:22,000 --> 00:40:27,000 Ten, 20, right? Is up into the millions all ready in terms of disks that need to be 800 00:40:27,000 --> 00:40:30,000 moved. So 801 00:40:30,000 --> 00:40:33,000 this sort of strategy, right? Is good to know. When you're looking at this recursive things, right? 802 00:40:33,000 --> 00:40:37,000 Kind of having this analysis. It says, okay, well, where did I start from? 803 00:40:37,000 --> 00:40:41,000 What's the time? And then just being pretty mechanical. Sometimes you can get to something here that 804 00:40:41,000 --> 00:40:45,000 requires a little bit of algebra to work out, but the most common patterns, right? 805 00:40:45,000 --> 00:40:48,000 You'll start to recognize over and again, right? 806 00:40:48,000 --> 00:40:51,000 And this one, right? The fact that it's branching by a factor of two 807 00:40:51,000 --> 00:41:03,000 telling you that at the kind of base case, right? You're gonna expect to see a two to the n expansion. 808 00:41:03,000 --> 00:41:05,000 So just to give you 809 00:41:05,000 --> 00:41:09,000 a little bit of some idea that you can use big O to predict things 810 00:41:09,000 --> 00:41:13,000 based on getting some run time performance for some small values of n 811 00:41:13,000 --> 00:41:16,000 and then what you'd expect to see for larger values. 812 00:41:16,000 --> 00:41:17,000 So I have 813 00:41:17,000 --> 00:41:19,000 three different algorithms here. 814 00:41:19,000 --> 00:41:20,000 One that was 815 00:41:20,000 --> 00:41:24,000 just logarithmic in terms of n, so it should divide the input in half and kind of 816 00:41:24,000 --> 00:41:27,000 work on it. Something like binary search actually fits that profile. 817 00:41:27,000 --> 00:41:31,000 Something that's linear. Something that's n log n and something that's n squared. 818 00:41:31,000 --> 00:41:35,000 Then for different values of n I've plugged in actually the first ones I ran and then actually 819 00:41:35,000 --> 00:41:39,000 I - some of the other ones I had to estimate because they were taking way too long to finish. 820 00:41:39,000 --> 00:41:43,000 So that for an input of size ten, all of them are imperceptible. These are in terms of seconds 821 00:41:43,000 --> 00:41:44,000 here. 822 00:41:44,000 --> 00:41:48,000 Taking fractions of seconds. This is on my machine that is running at a couple gigahertz, so it's 823 00:41:48,000 --> 00:41:50,000 got about a million instructions per second. 824 00:41:50,000 --> 00:41:53,000 You up that by a factor of ten, right? 825 00:41:53,000 --> 00:41:56,000 And they're still kind of in the subsecond range, but you can see that from the 826 00:41:56,000 --> 00:41:59,000 n squared algorithm 827 00:41:59,000 --> 00:42:00,000 took a bigger jump, 828 00:42:00,000 --> 00:42:04,000 let's say, than the linear algorithm did, which went up by a factor of ten almost 829 00:42:04,000 --> 00:42:05,000 exactly. 830 00:42:05,000 --> 00:42:07,000 Going up by another factor of ten 831 00:42:07,000 --> 00:42:08,000 and, 832 00:42:08,000 --> 00:42:13,000 again, sort of climbing, right? 10,000, 100,000, a million, a trillion. You 833 00:42:13,000 --> 00:42:15,000 get to something that's like 100,000, right? 834 00:42:15,000 --> 00:42:20,000 And an algorithm that's linear, right? Is still taking a fraction of a second. But something 835 00:42:20,000 --> 00:42:24,000 that was quadratic now has climbed up to take several hours. 836 00:42:24,000 --> 00:42:28,000 So even for inputs that don't seem particularly huge, right? The difference between having an 837 00:42:28,000 --> 00:42:33,000 algorithm that's gonna operate in linear time versus quadratic is gonna be felt very profoundly. And 838 00:42:33,000 --> 00:42:37,000 as you move to these larger numbers you have things that you just can't do, 839 00:42:37,000 --> 00:42:40,000 right? You cannot run an n squared algorithm on a trillion pieces of data 840 00:42:40,000 --> 00:42:42,000 in your lifetime. 841 00:42:42,000 --> 00:42:43,000 842 00:42:43,000 --> 00:42:48,000 It's just too much work for what the input is. Clinton? Yeah. How did it go down for a 843 00:42:48,000 --> 00:42:51,000 billion from 844 00:42:51,000 --> 00:42:53,000 like a million? 845 00:42:53,000 --> 00:42:55,000 From - Oh, you know what, 846 00:42:55,000 --> 00:42:59,000 that just must be wrong. You're totally right. That should definitely be over some. Yeah. 847 00:42:59,000 --> 00:43:02,000 Well log n should really be - I think that 848 00:43:02,000 --> 00:43:05,000 when I copied and pasted from these terminal window. So, of course, I must have just made a mistake when I 849 00:43:05,000 --> 00:43:08,000 was moving something across, but you're totally right. 850 00:43:08,000 --> 00:43:09,000 This should be going up, 851 00:43:09,000 --> 00:43:13,000 right? Ever so slightly, right? This algorithms going very, very slowly logarithmic function, 852 00:43:13,000 --> 00:43:16,000 right? Almost a flat line, but that definitely should be up a little bit, right? From 853 00:43:16,000 --> 00:43:18,000 where it is. 854 00:43:18,000 --> 00:43:19,000 When you get to a trillion, right? 855 00:43:19,000 --> 00:43:24,000 Even the linear algorithm is starting to actually take some noticeable time, right? 856 00:43:24,000 --> 00:43:28,000 But things that are logarithmic still running very, very slowly. So this is an example of 857 00:43:28,000 --> 00:43:32,000 binary search. Binary search operating on a million or trillion items is still just 858 00:43:32,000 --> 00:43:33,000 in a 859 00:43:33,000 --> 00:43:35,000 flash, right? Telling you whether it found it or not. 860 00:43:35,000 --> 00:43:37,000 Doing a linear search 861 00:43:37,000 --> 00:43:38,000 on a trillion or billion 862 00:43:38,000 --> 00:43:40,000 is 863 00:43:40,000 --> 00:43:41,000 several minutes' worth of 864 00:43:41,000 --> 00:43:45,000 my old computers 865 00:43:45,000 --> 00:43:48,000 time. And so just another way to look at them, right? Is to think about what those things look 866 00:43:48,000 --> 00:43:51,000 like in terms of graphed on the Cartesian plane 867 00:43:51,000 --> 00:43:55,000 that a constant algorithm is just a flat line. It takes the same amount of time no matter how big the input is. 868 00:43:55,000 --> 00:43:56,000 It's 869 00:43:56,000 --> 00:44:00,000 pretty small values of n down here, so it just shows the early stages, right? But logarithmic, right? 870 00:44:00,000 --> 00:44:03,000 Almost a flat line itself. A little bit above 871 00:44:03,000 --> 00:44:06,000 constant, but very, very slowly growing. 872 00:44:06,000 --> 00:44:07,000 The 873 00:44:07,000 --> 00:44:09,000 linear scale, right? Showing the lines and then the 874 00:44:09,000 --> 00:44:13,000 n squared and to the n showing you that they're kind of heading to the hills very 875 00:44:13,000 --> 00:44:15,000 quickly here. So even for small values of n 876 00:44:15,000 --> 00:44:16,000 877 00:44:16,000 --> 00:44:18,000 they are reaching into the 878 00:44:18,000 --> 00:44:23,000 high regions and will continue to do so as 879 00:44:23,000 --> 00:44:31,000 that will be more pronounced, right? As you move into the higher and higher values of n. All right. I get to tell 880 00:44:31,000 --> 00:44:34,000 you a little bit about sorting before 881 00:44:34,000 --> 00:44:36,000 we go away. Because some of them I'm just setting the stage for, like, 882 00:44:36,000 --> 00:44:40,000 okay, how can we use this to do something interesting? Knowing about how to compare and then contrast. 883 00:44:40,000 --> 00:44:41,000 That it tells you something. 884 00:44:41,000 --> 00:44:44,000 Let's try to solve a problem, right? That 885 00:44:44,000 --> 00:44:47,000 lends itself to different approaches that have different algorithmic 886 00:44:47,000 --> 00:44:50,000 results to them that are worth thinking about. 887 00:44:50,000 --> 00:44:54,000 So sorting turns out to be one of the best problems to study for this because it turns out sorting 888 00:44:54,000 --> 00:44:55,000 is 889 00:44:55,000 --> 00:44:58,000 one of the things that computers do a lot of. 890 00:44:58,000 --> 00:45:02,000 That it's very, very common that you need to keep data in sorted order. It makes it easier to 891 00:45:02,000 --> 00:45:06,000 search. It makes it easier to find the minimum, the maximum, and find duplicates. All these sort of things, right? 892 00:45:06,000 --> 00:45:09,000 Like by having, keeping it inserted or it tends to be more convenient to access that way. 893 00:45:09,000 --> 00:45:11,000 It also makes a bunch of other things, 894 00:45:11,000 --> 00:45:14,000 other properties about the data, easy to do. 895 00:45:14,000 --> 00:45:18,000 If you want to find the top ten percent, right? You can just pick them off, right? If they've all ready 896 00:45:18,000 --> 00:45:19,000 897 00:45:19,000 --> 00:45:20,000 been sorted. You want to 898 00:45:20,000 --> 00:45:23,000 group them into buckets for histograming, right? All these things actually are 899 00:45:23,000 --> 00:45:24,000 enabled by having them 900 00:45:24,000 --> 00:45:28,000 all ready be in sorted order. So it turns out that a lot of times that data that you get 901 00:45:28,000 --> 00:45:31,000 from other sources tends to first need to be put in sorted order before you start doing 902 00:45:31,000 --> 00:45:33,000 stuff on it. Okay. 903 00:45:33,000 --> 00:45:36,000 Well, there's lots of different ways you can sort it. 904 00:45:36,000 --> 00:45:40,000 There are simple ways. There are complicated ways. There are fancy ways. There are ways that 905 00:45:40,000 --> 00:45:42,000 are dumb, 906 00:45:42,000 --> 00:45:44,000 907 00:45:44,000 --> 00:45:46,000 but easy to write. Ways that are smart, but hard 908 00:45:46,000 --> 00:45:48,000 to write. And everything in between. 909 00:45:48,000 --> 00:45:52,000 So we're gonna be looking at it in terms of sorting vectors because that's probably the most common data structure 910 00:45:52,000 --> 00:45:53,000 that needs the sorting. 911 00:45:53,000 --> 00:45:57,000 But you can also imagine taking the same kind of algorithms and applying them to different kinds of data 912 00:45:57,000 --> 00:46:00,000 structures you have. Like starting a linked list. 913 00:46:00,000 --> 00:46:04,000 Okay. So I'm gonna show you a sorting algorithm. 914 00:46:04,000 --> 00:46:06,000 It's probably the simplest and the easiest to write. 915 00:46:06,000 --> 00:46:10,000 If somebody - if you were stuck on a desert island without a textbook, but you happen to have 916 00:46:10,000 --> 00:46:11,000 a compiler 917 00:46:11,000 --> 00:46:15,000 and you needed to write a sorting algorithm to get your way off the island. It 918 00:46:15,000 --> 00:46:18,000 comes up all the time. 919 00:46:18,000 --> 00:46:20,000 This is probably the algorithm you're gonna use. 920 00:46:20,000 --> 00:46:23,000 It's called selection sort. 921 00:46:23,000 --> 00:46:24,000 And the idea behind selection sort 922 00:46:24,000 --> 00:46:26,000 is that it's going to select 923 00:46:26,000 --> 00:46:28,000 the smallest element 924 00:46:28,000 --> 00:46:30,000 and put it in the front. 925 00:46:30,000 --> 00:46:33,000 So if I have a big stack of test papers 926 00:46:33,000 --> 00:46:36,000 and I want to sort them in order of score, 927 00:46:36,000 --> 00:46:40,000 then I'm gonna go through there and find somebody who got the absolute lowest score. It's said, but 928 00:46:40,000 --> 00:46:42,000 true. Somebody had to be there. 929 00:46:42,000 --> 00:46:45,000 So I kind of work my through it and maybe I hold my finger on the one that I've seen that looks 930 00:46:45,000 --> 00:46:49,000 smallest so far. So this one's a 40, oh, 931 00:46:49,000 --> 00:46:50,000 well, this one's a 38. Okay. Oh, look there's a 35. Oh, 932 00:46:50,000 --> 00:46:54,000 look, there's a 22. Nobody gets these scores in my exams, I'm just kidding. 933 00:46:54,000 --> 00:46:56,000 And then finally get down to the bottom. You say, okay, 934 00:46:56,000 --> 00:47:00,000 that 25, that was the smallest. I have a hold of it, right? And 935 00:47:00,000 --> 00:47:01,000 I'm gonna basically take that 936 00:47:01,000 --> 00:47:03,000 and bring it to the front. 937 00:47:03,000 --> 00:47:05,000 938 00:47:05,000 --> 00:47:07,000 And in terms of managing a vector 939 00:47:07,000 --> 00:47:11,000 that actually there's a slight efficiency to be had by instead of kind of pulling it out of the 940 00:47:11,000 --> 00:47:15,000 stack and sliding everything down I'm actually just gonna swap it with the one that's in the 941 00:47:15,000 --> 00:47:18,000 very front. So whoever was the current first one is gonna booted 942 00:47:18,000 --> 00:47:22,000 to pull in this one and I'm gonna put them back where this other one came from. 943 00:47:22,000 --> 00:47:26,000 So I have moved the very smallest to the top of the vector. 944 00:47:26,000 --> 00:47:28,000 Then I just do the same thing again, 945 00:47:28,000 --> 00:47:32,000 but now excluding that smallest one. So I've all ready seen that one and kind of set it aside. I start looking 946 00:47:32,000 --> 00:47:35,000 at the remaining n minus one and I do the same thing again. 947 00:47:35,000 --> 00:47:38,000 Find the smallest of what remains, kind of just walking through, 948 00:47:38,000 --> 00:47:39,000 going to find it, 949 00:47:39,000 --> 00:47:44,000 and swap it down into the second slot, and so on. 950 00:47:44,000 --> 00:47:50,000 A little bit of code. 951 00:47:50,000 --> 00:47:55,000 Doing it kind of the way I described it, right? Of tracking the minimum index, right? Looking from 952 00:47:55,000 --> 00:47:59,000 here to the end, so this four loop in the middle here starts at the current position and only considers 953 00:47:59,000 --> 00:48:01,000 from here 954 00:48:01,000 --> 00:48:02,000 to the very end of the 955 00:48:02,000 --> 00:48:05,000 vector and then finds 956 00:48:05,000 --> 00:48:09,000 any element which is smaller than the one I'm kind of currently holding my finger on 957 00:48:09,000 --> 00:48:13,000 and then when I'm done after that inner loop has executed I will know what the min index is 958 00:48:13,000 --> 00:48:15,000 and then there's a spot function here 959 00:48:15,000 --> 00:48:17,000 that just exchanges those two things. 960 00:48:17,000 --> 00:48:19,000 The I slot now gets replaced with the one 961 00:48:19,000 --> 00:48:22,000 at min index and, 962 00:48:22,000 --> 00:48:25,000 again, exchanged in the contents of the array. I just do that 963 00:48:25,000 --> 00:48:28,000 n minus one times and I will have done it all. I'd like to show 964 00:48:28,000 --> 00:48:32,000 you animation, but I guess I'll just wait for that until Wednesday. We will 965 00:48:32,000 --> 00:48:36,000 watch it doing it's kind of thing in progress and you can kind of be thinking about for Wednesday what other 966 00:48:36,000 --> 00:48:37,000 ways you might be able to sort 967 00:48:37,000 --> 00:48:42,000 data than just this strategy.