1 00:00:10,029 --> 00:00:12,788 2 00:00:12,788 --> 00:00:16,059 This presentation is delivered by the Stanford Center for Professional 3 00:00:16,059 --> 00:00:23,059 Development. 4 00:00:25,879 --> 00:00:28,068 So welcome to the 5 00:00:28,068 --> 00:00:30,178 106A lecture just before Thanksgiving. 6 00:00:30,178 --> 00:00:33,518 I thought attendance might not be spectacular today. People are still 7 00:00:33,518 --> 00:00:34,958 filtering in, but it's good to see 8 00:00:34,959 --> 00:00:38,950 so many of you here, though. I'm sure some of you have already headed home. 9 00:00:38,950 --> 00:00:39,980 You may be wondering 10 00:00:39,979 --> 00:00:44,679 who I am. I would have thought before we got back mid-quarter evaluations that 11 00:00:44,679 --> 00:00:46,789 you stood a chance of 12 00:00:46,789 --> 00:00:49,679 recognizing me as the TA of your class, but 13 00:00:49,679 --> 00:00:52,679 the comment of more than half of the people who responded 14 00:00:52,679 --> 00:00:57,210 to the question of how is Ben doing was, gI don't know Ben. I've never interacted 15 00:00:57,210 --> 00:00:57,829 with 16 00:00:57,829 --> 00:00:58,820 17 00:00:58,820 --> 00:01:01,780 Ben. So I assume he's doing a great job.h Most of you 18 00:01:01,780 --> 00:01:04,260 jumped to that conclusion. But - 19 00:01:04,260 --> 00:01:05,150 so 20 00:01:05,150 --> 00:01:07,420 I sort of chased at this, but not much. 21 00:01:07,420 --> 00:01:12,759 I realize, though, that I would enjoy the opportunity to develop a little 22 00:01:12,759 --> 00:01:13,748 more exposure. 23 00:01:13,748 --> 00:01:17,019 And so Maron and I decided to switch roles for today. 24 00:01:17,019 --> 00:01:21,158 In fact, he's making copies as we speak, and there's going to be one handout that's 25 00:01:21,159 --> 00:01:22,180 sort of pertinent 26 00:01:22,180 --> 00:01:24,118 to what you'll be doing 27 00:01:24,118 --> 00:01:26,370 for [inaudible] the next assignment. 28 00:01:26,370 --> 00:01:28,950 And he should be back any time, but I certainly can't blame him for 29 00:01:28,950 --> 00:01:29,789 being - 30 00:01:29,789 --> 00:01:34,650 running a little late since I was supposed to make the copies before class myself. 31 00:01:34,650 --> 00:01:39,079 So anyway, in my capacity as Maron, I don't hope to be as spontaneous 32 00:01:39,078 --> 00:01:43,199 or as funny as he is, but I do hope to talk about something that is a 33 00:01:43,200 --> 00:01:45,790 little bit off of the sort of charted path 34 00:01:45,790 --> 00:01:48,939 of CS106A's curriculum. 35 00:01:48,938 --> 00:01:51,298 So one way of describing 36 00:01:51,299 --> 00:01:54,000 the things that you're learning in 106A 37 00:01:54,000 --> 00:01:55,819 is that you're learning how to take 38 00:01:55,819 --> 00:01:59,299 the ways that you already know of doing things that may 39 00:01:59,299 --> 00:02:00,739 take you some time 40 00:02:00,739 --> 00:02:03,349 and maybe error-prone in the doing of them 41 00:02:03,349 --> 00:02:07,298 and tell a computer precisely how to do them so that you reap the benefits 42 00:02:07,299 --> 00:02:07,950 43 00:02:07,950 --> 00:02:11,009 of a machine being able to execute those instructions. And 44 00:02:11,008 --> 00:02:13,818 they'll be faster. There'll be fewer errors as long as you 45 00:02:13,818 --> 00:02:15,318 get the instructions right 46 00:02:15,318 --> 00:02:18,969 in the first place. And that's got to be an exhilarating feeling. It's got to be 47 00:02:18,969 --> 00:02:23,128 empowering, or at least it was when I was sitting in your place as a freshman, 48 00:02:23,128 --> 00:02:24,048 to know that, 49 00:02:24,049 --> 00:02:27,739 all of a sudden, if you have a good idea - if you can come up with a 50 00:02:27,739 --> 00:02:28,700 procedure for solving 51 00:02:28,699 --> 00:02:32,549 a problem that you've always had and tell a computer how to do it, then 52 00:02:32,550 --> 00:02:35,419 the computer will be able to do that over and over and over again. All right? 53 00:02:35,419 --> 00:02:36,140 So 54 00:02:36,139 --> 00:02:39,148 that's 106A in a nutshell. So it's sort of - 55 00:02:39,149 --> 00:02:42,669 it's not a perfect metaphor. What does it mean to 56 00:02:42,669 --> 00:02:46,709 run breakout by hands? I don't know. 57 00:02:46,709 --> 00:02:49,968 So what I - the distinction I'm trying to make between 58 00:02:49,968 --> 00:02:53,378 what you've been learning so far and what you will learn, hopefully, in this 59 00:02:53,378 --> 00:02:59,278 lecture - and certainly if you go on to CS106B - is that 60 00:02:59,278 --> 00:03:02,588 there are things that computers can do that you couldn't possibly have 61 00:03:02,588 --> 00:03:04,368 simulated by hand. 62 00:03:04,368 --> 00:03:06,318 Computers can handle so much more data 63 00:03:06,318 --> 00:03:12,708 than you could ever manage to sort through on your own manually 64 00:03:12,709 --> 00:03:14,330 that it's worth you're learning 65 00:03:14,330 --> 00:03:17,690 a bit about how to make them do that - how to think about 66 00:03:17,689 --> 00:03:19,318 instructing a computer to do something that 67 00:03:19,318 --> 00:03:22,858 you couldn't possibly have done. So today, we're going to talk about 68 00:03:22,859 --> 00:03:26,079 two of the most important 69 00:03:26,079 --> 00:03:28,510 kinds of those problems in computer science, and 70 00:03:28,509 --> 00:03:32,528 I'm going to stop just talking at you and show you some examples 71 00:03:32,528 --> 00:03:35,109 in a second. But those two topics are searching 72 00:03:35,110 --> 00:03:36,039 and sorting. 73 00:03:36,038 --> 00:03:40,319 How do you find something in a set of things that you have? And how, 74 00:03:40,319 --> 00:03:42,878 if you could impose some structure on that set, 75 00:03:42,878 --> 00:03:46,498 would you both find it more quickly, and - I 76 00:03:46,498 --> 00:03:49,739 guess I got ahead of myself - how would you go about imposing 77 00:03:49,739 --> 00:03:51,020 the sort of structure that 78 00:03:51,020 --> 00:03:54,028 would help you find it more 79 00:03:54,028 --> 00:03:56,509 quickly? All right. So if I'm seeming to ramble, 80 00:03:56,509 --> 00:04:01,038 that stops now. 81 00:04:01,038 --> 00:04:03,218 So searching and sorting - 82 00:04:03,218 --> 00:04:07,038 I say they are important, and I hope you believe me, and you think that for that 83 00:04:07,038 --> 00:04:10,808 reason, it's worth talking about for an entire lecture. But they really are 84 00:04:10,808 --> 00:04:11,669 just 85 00:04:11,669 --> 00:04:14,839 a way of getting to talk about something a little deeper, which is this 86 00:04:14,840 --> 00:04:17,329 concept of algorithmic efficiency 87 00:04:17,329 --> 00:04:18,298 that we haven't said 88 00:04:18,298 --> 00:04:21,959 much about in the class so far. So that's the third part of this two-part 89 00:04:21,959 --> 00:04:24,168 lecture. 90 00:04:24,168 --> 00:04:27,339 What's the deal with searching? Well, searching 91 00:04:27,339 --> 00:04:30,909 doesn't quite fit the mold as something you couldn't do by hand. You 92 00:04:30,910 --> 00:04:33,840 find things all the time. 93 00:04:33,839 --> 00:04:35,329 Chapter 12 of the book 94 00:04:35,329 --> 00:04:39,709 looks at two operations of the searching and sorting. This 95 00:04:39,709 --> 00:04:42,468 is pretty generic. All 96 00:04:42,468 --> 00:04:44,639 right. So searching is simpler. 97 00:04:44,639 --> 00:04:47,019 You can define a search problem - 98 00:04:47,019 --> 00:04:49,118 say you have an array 99 00:04:49,119 --> 00:04:53,149 or some other collection of things, and you have something you want to find. 100 00:04:53,149 --> 00:04:55,278 The typical way this is done is that 101 00:04:55,278 --> 00:04:59,269 you want to find the index into the array - where that element was - 102 00:04:59,269 --> 00:04:59,909 or, 103 00:04:59,910 --> 00:05:01,439 104 00:05:01,439 --> 00:05:05,528 in a more generic case, you want to find out if that element is even in the array. 105 00:05:05,528 --> 00:05:07,649 So you want to have some way of determining 106 00:05:07,649 --> 00:05:11,318 that it wasn't actually found. 107 00:05:11,319 --> 00:05:13,090 And that may be all you care about. 108 00:05:13,089 --> 00:05:16,508 It may be the case that, 109 00:05:16,509 --> 00:05:18,419 if the search 110 00:05:18,418 --> 00:05:20,250 routine returns anything other than 111 00:05:20,250 --> 00:05:23,738 negative one, you don't actually care what its value was. But we 112 00:05:23,738 --> 00:05:27,388 adopt - I will adopt the convention for this lecture that if you don't find what 113 00:05:27,389 --> 00:05:30,490 you're searching for, then the method that you wrote to do the search 114 00:05:30,490 --> 00:05:32,908 should return a negative value since that's not 115 00:05:32,908 --> 00:05:39,178 a valid index into the array. All right. So 116 00:05:39,178 --> 00:05:40,818 if you have a set of things that 117 00:05:40,819 --> 00:05:42,968 you're trying to search through, 118 00:05:42,968 --> 00:05:45,969 the easiest way of doing that is just to look at all of them 119 00:05:45,970 --> 00:05:47,920 and see if each one, in turn, 120 00:05:47,920 --> 00:05:49,009 is 121 00:05:49,009 --> 00:05:50,009 what you're looking for. 122 00:05:50,009 --> 00:05:52,479 That's called a linear search 123 00:05:52,478 --> 00:05:53,818 because, to 124 00:05:53,819 --> 00:05:57,620 sort of pre-figure what we're going to talk about at the end of the lecture, 125 00:05:57,620 --> 00:06:01,340 the number of operations you have to do to find the element that you're looking 126 00:06:01,339 --> 00:06:03,019 for is 127 00:06:03,019 --> 00:06:06,338 linearly proportional to the number of elements that you had to begin with. Now it 128 00:06:06,338 --> 00:06:08,600 may not be obvious right now that there's 129 00:06:08,600 --> 00:06:12,129 something better that you can do, and with an arbitrary set of data, there 130 00:06:12,129 --> 00:06:15,179 is not anything better that you can do. 131 00:06:15,178 --> 00:06:16,769 132 00:06:16,769 --> 00:06:21,870 But this procedure that I've written up here is an implementation of what 133 00:06:21,870 --> 00:06:22,750 you 134 00:06:22,750 --> 00:06:25,379 already could have written - a 135 00:06:25,379 --> 00:06:30,279 procedure for finding an element in an array of integers. 136 00:06:30,279 --> 00:06:32,799 So there's nothing very tricky about this. 137 00:06:32,800 --> 00:06:36,180 If it had been a string, you might have even avoided having to write the 138 00:06:36,180 --> 00:06:39,709 function and called it something like, gIndex ofh with a character. And 139 00:06:39,709 --> 00:06:40,939 inside of gIndex ofh, 140 00:06:40,939 --> 00:06:44,759 though you don't have to write the code, it would have had a four loop that 141 00:06:44,759 --> 00:06:47,229 iterated over each of the characters of the string, 142 00:06:47,228 --> 00:06:50,618 tested each one for quality with the key that you were looking for and then, 143 00:06:50,619 --> 00:06:52,000 presuming it found 144 00:06:52,000 --> 00:06:52,968 one that equaled 145 00:06:52,968 --> 00:06:54,209 that character - or 146 00:06:54,209 --> 00:06:57,649 in this case, that integer - it would return the index of it. 147 00:06:57,649 --> 00:07:01,060 And if the four loop completes, then you must not have examined any elements 148 00:07:01,060 --> 00:07:02,658 that matched, 149 00:07:02,658 --> 00:07:06,579 and so you return negative one. Okay? 150 00:07:06,579 --> 00:07:07,859 So 151 00:07:07,860 --> 00:07:09,550 here's a simulation 152 00:07:09,550 --> 00:07:14,408 of how that would work, although the - 153 00:07:14,408 --> 00:07:19,250 leaving nothing to the imagination here, this is pretty easy to sort of envision. So 154 00:07:19,250 --> 00:07:22,160 we have this array of primes, 155 00:07:22,160 --> 00:07:24,550 and it has 156 00:07:24,550 --> 00:07:28,579 17 in it. So we are going to expect to find that this linear search 157 00:07:28,579 --> 00:07:30,079 shouldn't return 158 00:07:30,079 --> 00:07:31,978 negative one. But the second one, 159 00:07:31,978 --> 00:07:35,839 we're looking for 27 - which sort of looks prime, but 160 00:07:35,839 --> 00:07:37,929 of course, it's not because it's 161 00:07:37,930 --> 00:07:40,259 nine times three. 162 00:07:40,259 --> 00:07:47,259 All right? Okay. So 163 00:07:47,529 --> 00:07:51,269 we called linear search, and here's our new stat frame, which has the 164 00:07:51,269 --> 00:07:55,028 local variables I and the parameters P and array 165 00:07:55,028 --> 00:07:57,548 in it. This is sort of Eric Robert's 166 00:07:57,548 --> 00:08:01,138 - who wrote the textbooks - way of 167 00:08:01,139 --> 00:08:02,970 displaying the execution model of 168 00:08:02,970 --> 00:08:03,830 JAVA. All 169 00:08:03,829 --> 00:08:07,188 right? So we're just looping over it, and we're testing 170 00:08:07,189 --> 00:08:10,528 as I equals zero and then one and 171 00:08:10,528 --> 00:08:13,478 then two and then three and four, five, six - 172 00:08:13,478 --> 00:08:14,848 where should we find it? 173 00:08:14,848 --> 00:08:16,498 Well, we were looking for seven - 174 00:08:16,499 --> 00:08:20,000 no. Right now, we're looking for 27. So 175 00:08:20,000 --> 00:08:24,129 we're just gonna go to the - I think 176 00:08:24,129 --> 00:08:25,679 there was a problem with the way that - 177 00:08:25,678 --> 00:08:26,728 well, anyway. 178 00:08:26,728 --> 00:08:27,620 179 00:08:27,620 --> 00:08:29,978 The console here prints out that, 180 00:08:29,978 --> 00:08:33,029 indeed, we found 17 at position six. 181 00:08:33,029 --> 00:08:34,360 There may have been 182 00:08:34,360 --> 00:08:36,810 a problem converting 183 00:08:36,809 --> 00:08:39,929 PowerPoint slides to Keynote slides, so I apologize for the 184 00:08:39,929 --> 00:08:41,269 sort of shakiness of 185 00:08:41,269 --> 00:08:48,269 these animations. But the content of the slides should be adequate. Cool. 186 00:08:48,879 --> 00:08:51,710 All right. So now we're in our second called linear search, and the only 187 00:08:51,710 --> 00:08:54,710 thing that's changed at this point from the beginning of the first called was that 188 00:08:54,710 --> 00:08:55,989 the key is different. 189 00:08:55,989 --> 00:09:00,329 So I is still gonna start at zero, and it's still gonna iterate over all of the 190 00:09:00,328 --> 00:09:01,370 positions of the array. 191 00:09:01,370 --> 00:09:03,580 But we're gonna go through the entire array, 192 00:09:03,580 --> 00:09:05,869 which means going all the way up to ten where I 193 00:09:05,869 --> 00:09:08,139 is no longer less than ten. 194 00:09:08,139 --> 00:09:09,928 And we didn't find 27 anywhere. 195 00:09:09,928 --> 00:09:12,949 So I hope this is all 196 00:09:12,950 --> 00:09:13,759 fairly clear. 197 00:09:13,759 --> 00:09:16,240 But these are the results that you would have expected. 198 00:09:16,240 --> 00:09:18,740 So we found the position of 17, and then we 199 00:09:18,740 --> 00:09:20,919 looked for 27 but didn't find it 200 00:09:20,919 --> 00:09:26,299 and returned negative one. Cool. 201 00:09:26,299 --> 00:09:27,379 How many people think 202 00:09:27,379 --> 00:09:28,639 they could have written that 203 00:09:28,639 --> 00:09:31,330 in their sleep? That that 204 00:09:31,330 --> 00:09:36,150 was too much time to spend on such a simple idea? All right. Well, 205 00:09:36,149 --> 00:09:38,449 it gets more interesting from here on out, 206 00:09:38,450 --> 00:09:40,230 but 207 00:09:40,230 --> 00:09:41,529 - 208 00:09:41,529 --> 00:09:47,279 so talk to me. What is the problem with linear search? It takes a lot of time. 209 00:09:47,279 --> 00:09:51,059 It takes a lot of time, right? It takes as much time as you have elements. 210 00:09:51,059 --> 00:09:52,079 So 211 00:09:52,080 --> 00:09:53,330 212 00:09:53,330 --> 00:09:55,720 if I asked a question, 213 00:09:55,720 --> 00:09:57,480 which was 214 00:09:57,480 --> 00:09:59,340 - if I knew the 215 00:09:59,340 --> 00:10:01,389 SUID number of 216 00:10:01,389 --> 00:10:02,830 one of you in the room, 217 00:10:02,830 --> 00:10:05,100 and I asked, 218 00:10:05,100 --> 00:10:07,230 gIs this your SUID number? 219 00:10:07,230 --> 00:10:09,740 And you say, 220 00:10:09,740 --> 00:10:10,870 221 00:10:10,870 --> 00:10:14,779 gNo.h And I ask you, gIs 05179164 your SUID number?h 222 00:10:14,779 --> 00:10:16,399 Is it yours? 223 00:10:16,399 --> 00:10:19,720 Be kind. Tell me it's yours. [Inaudible]. It's not. 224 00:10:19,720 --> 00:10:22,690 So it actually turns out that it's mine, so I would have had to search 225 00:10:22,690 --> 00:10:27,460 the entire room before I finally got back to myself if I didn't make that 226 00:10:27,460 --> 00:10:29,460 shortcut. 227 00:10:29,460 --> 00:10:33,189 Okay. So another illustration of what is really a very simple idea - 228 00:10:33,188 --> 00:10:38,099 that if you're forced to look at all of the elements that you are searching 229 00:10:38,100 --> 00:10:39,069 among, 230 00:10:39,068 --> 00:10:40,659 then 231 00:10:40,659 --> 00:10:42,490 232 00:10:42,490 --> 00:10:45,840 that's bad in and of itself if you could do better. So 233 00:10:45,840 --> 00:10:49,070 we'll see if we can do better, but 234 00:10:49,070 --> 00:10:50,690 here is 235 00:10:50,690 --> 00:10:54,490 a rather larger data set - 286 236 00:10:54,490 --> 00:10:57,889 area codes - the three-digit numbers that come before the 237 00:10:57,889 --> 00:11:01,659 rest of the seven digits in your phone numbers. But 238 00:11:01,659 --> 00:11:05,089 we're trying to find 650, which is 239 00:11:05,090 --> 00:11:08,269 this county's area code. So here's a larger array. 240 00:11:08,269 --> 00:11:10,299 What would happen if we 241 00:11:10,299 --> 00:11:14,179 were searching in this? Well, it depends. If we have a fast computer, it may not 242 00:11:14,179 --> 00:11:17,519 matter. These are - there's only about 300 numbers here, and 243 00:11:17,519 --> 00:11:21,700 doing something 300 times is something that computers are 244 00:11:21,700 --> 00:11:23,020 reasonably good at. 245 00:11:23,019 --> 00:11:24,179 246 00:11:24,179 --> 00:11:25,649 But 247 00:11:25,649 --> 00:11:30,939 I thought I'd put together this sort of 248 00:11:30,940 --> 00:11:32,560 compelling example of why 249 00:11:32,559 --> 00:11:37,159 this can become a bad idea. All 250 00:11:37,159 --> 00:11:38,969 right. So I wrote 251 00:11:38,970 --> 00:11:40,320 a program 252 00:11:40,320 --> 00:11:41,780 in JAVA, 253 00:11:41,779 --> 00:11:44,569 254 00:11:44,570 --> 00:11:48,510 which has this implementation of linear search that we were just looking at. 255 00:11:48,509 --> 00:11:51,179 And the difference that you may notice is that 256 00:11:51,179 --> 00:11:55,158 I have an instance variable, which is sort of behaving as though it were 257 00:11:55,158 --> 00:11:56,448 an array, and 258 00:11:56,448 --> 00:12:00,808 you'll see why that is in a second. But all that I need it for is to get values 259 00:12:00,808 --> 00:12:04,948 from it at certain positions, and I wanna know what its length is so that 260 00:12:04,948 --> 00:12:06,449 I can iterate over it. 261 00:12:06,450 --> 00:12:10,439 And I'm also - so why is it called DIS? It's short for display, 262 00:12:10,438 --> 00:12:13,929 and it's this - an instance of this number display class that I've written. 263 00:12:13,929 --> 00:12:17,149 And if there's time and interest, I'll talk about how that 264 00:12:17,149 --> 00:12:18,230 was implemented. 265 00:12:18,230 --> 00:12:19,740 But it's intended to sort of 266 00:12:19,740 --> 00:12:22,220 mimic the behavior of an array and, 267 00:12:22,220 --> 00:12:25,730 at the same time, show you 268 00:12:25,730 --> 00:12:27,920 a linear search in action. 269 00:12:27,919 --> 00:12:32,479 All right. So 270 00:12:32,480 --> 00:12:35,250 this is a dialogue program that's just 271 00:12:35,250 --> 00:12:39,980 reading line to ask me for some input. 272 00:12:39,980 --> 00:12:43,200 And here are all of those numbers. 273 00:12:43,200 --> 00:12:47,060 So I slowed it down - even if this looks like it's going relatively fast - 274 00:12:47,059 --> 00:12:49,949 so that you could see it sequentially considering each of these 275 00:12:49,950 --> 00:12:54,300 numbers. All right? 276 00:12:54,299 --> 00:12:55,689 You may notice that 277 00:12:55,690 --> 00:12:56,760 you can - 278 00:12:56,759 --> 00:13:00,799 you could probably beat this. Right? 279 00:13:00,799 --> 00:13:03,569 It's going slowly enough that, 280 00:13:03,570 --> 00:13:04,770 since 281 00:13:04,769 --> 00:13:09,009 these numbers are already in sorted order, 282 00:13:09,009 --> 00:13:11,049 you can jump ahead. So what is it - 283 00:13:11,049 --> 00:13:13,929 I want you to think about - while you're waiting on this to finish, I want 284 00:13:13,929 --> 00:13:16,500 you to think about what it is that you're doing 285 00:13:16,500 --> 00:13:17,918 when you search for the number 286 00:13:17,918 --> 00:13:22,329 in this list of numbers that happens to be sorted, which allows you to 287 00:13:22,330 --> 00:13:24,080 go so much more quickly 288 00:13:24,080 --> 00:13:28,530 than the computer. All right. 289 00:13:28,529 --> 00:13:31,990 So why didn't we find 650 here? [Inaudible]. 290 00:13:31,990 --> 00:13:38,990 It wasn't even there? Yeah. Tricky. Let's fix that. 291 00:13:40,269 --> 00:13:41,750 Where do we wanna stick it? 292 00:13:41,750 --> 00:13:43,178 Probably between 630 - they 293 00:13:43,178 --> 00:13:45,899 skipped whole swats of - 294 00:13:45,899 --> 00:13:48,740 I just pulled this off of the Internet, which is 295 00:13:48,740 --> 00:13:49,690 notoriously 296 00:13:49,690 --> 00:13:52,820 unreliable. All right. 297 00:13:52,820 --> 00:13:56,100 They probably aren't gonna let this run all the way through the next time. But 298 00:13:56,100 --> 00:13:58,769 650, as you will notice, 299 00:13:58,769 --> 00:14:02,299 is now right here. So it would eventually be found. 300 00:14:02,299 --> 00:14:04,669 I'll let that tell me when it's finished. 301 00:14:04,669 --> 00:14:06,849 So now I'm going to 302 00:14:06,850 --> 00:14:08,290 303 00:14:08,289 --> 00:14:12,049 switch over to a different kind of search, which I hope has something to do 304 00:14:12,049 --> 00:14:12,599 with 305 00:14:12,600 --> 00:14:14,389 the way a 306 00:14:14,389 --> 00:14:17,470 human looks at a sorted array and finds the element that 307 00:14:17,470 --> 00:14:21,820 they're looking for quickly. But - no. 308 00:14:21,820 --> 00:14:25,270 Help me out before I just show you that implementation. 309 00:14:25,269 --> 00:14:29,429 What is the insight that lets you find elements more quickly if there's some 310 00:14:29,429 --> 00:14:32,829 structure - if there's a sorted order to the elements that you're 311 00:14:32,830 --> 00:14:39,830 looking for? Go ahead. [Inaudible]. 312 00:14:43,759 --> 00:14:46,689 Okay. So if I start with the first number in the array, 313 00:14:46,690 --> 00:14:48,160 and I ask myself, 314 00:14:48,159 --> 00:14:52,610 gIs the number I'm looking for higher or lower than the first number in the 315 00:14:52,610 --> 00:14:53,720 array?h 316 00:14:53,720 --> 00:14:57,399 What's the answer probably going to be? [Inaudible]. 317 00:14:57,399 --> 00:15:00,689 It's probably higher if it's a sorted array, right? 318 00:15:00,690 --> 00:15:03,760 So I know that's not quite what you were getting at, but I'm pushing on 319 00:15:03,759 --> 00:15:04,330 320 00:15:04,330 --> 00:15:05,600 the definition 321 00:15:05,600 --> 00:15:07,079 of this procedure 322 00:15:07,078 --> 00:15:09,849 so that you - 323 00:15:09,850 --> 00:15:12,700 would you like to add a clarification to 324 00:15:12,700 --> 00:15:16,970 what you were intending? [Inaudible]. 325 00:15:16,970 --> 00:15:19,330 Yeah. The number in the middle. Okay? So 326 00:15:19,330 --> 00:15:21,290 if you look at the number in the middle, 327 00:15:21,289 --> 00:15:24,659 and you see that that number is less than the number you're looking for, 328 00:15:24,659 --> 00:15:29,339 there's a whole set of numbers that you now don't have to worry about. Right? It's 329 00:15:29,340 --> 00:15:30,939 all of those numbers 330 00:15:30,938 --> 00:15:32,319 up to and including 331 00:15:32,320 --> 00:15:35,240 the number in the middle of the array. Right? 332 00:15:35,240 --> 00:15:37,320 So now you're looking at half of the array, 333 00:15:37,320 --> 00:15:41,510 and in some sense, your problem is simpler. In fact, it's a lot simpler 334 00:15:41,509 --> 00:15:42,700 because if you apply 335 00:15:42,700 --> 00:15:45,560 the same procedure again, 336 00:15:45,559 --> 00:15:48,389 then you can cut that remaining half in half, 337 00:15:48,389 --> 00:15:51,289 and you can cut the remaining half in half. Right? 338 00:15:51,289 --> 00:15:53,208 So this kind of searching 339 00:15:53,208 --> 00:15:56,599 is called binary searching because you make a binary choice at each decision 340 00:15:56,600 --> 00:15:58,560 point, 341 00:15:58,559 --> 00:16:01,879 and you can then cut the space that you're searching in half. 342 00:16:01,879 --> 00:16:08,070 So what would that look like if we wrote it out in code? Well, 343 00:16:08,070 --> 00:16:10,569 if you trust me, then it would look something like this. 344 00:16:10,568 --> 00:16:17,568 And the insight behind these admittedly too-short variable names 345 00:16:18,330 --> 00:16:20,120 is that we're keeping track 346 00:16:20,120 --> 00:16:22,769 of the LH or left-hand side 347 00:16:22,769 --> 00:16:25,750 of the portion of the array that we're considering 348 00:16:25,750 --> 00:16:28,019 and also the right-hand side. 349 00:16:28,019 --> 00:16:30,710 So if you look at these two initializations, it should make sense 350 00:16:30,710 --> 00:16:31,410 that 351 00:16:31,409 --> 00:16:34,969 you start with the left-hand side as far left as it can be at zero and the right-hand 352 00:16:34,970 --> 00:16:35,750 side 353 00:16:35,750 --> 00:16:39,568 at the index of the very last element in the array, which is one less than the 354 00:16:39,568 --> 00:16:41,399 length of the array. Right? 355 00:16:41,399 --> 00:16:42,309 And so 356 00:16:42,309 --> 00:16:45,539 you may already be able to anticipate that we're going to sort of move these in 357 00:16:45,539 --> 00:16:47,259 closer and closer to each other 358 00:16:47,259 --> 00:16:50,139 until they identify an array that has only one element 359 00:16:50,139 --> 00:16:53,500 or, potentially, zero elements, in which case we would decide that we were not 360 00:16:53,500 --> 00:16:57,889 able to find the element that we were looking for. All right? So 361 00:16:57,889 --> 00:16:59,250 we're going to do this 362 00:16:59,250 --> 00:17:00,610 halving procedure 363 00:17:00,610 --> 00:17:02,600 until it is the case that 364 00:17:02,600 --> 00:17:07,370 LH and RH have collided with each other in some sense. And 365 00:17:07,369 --> 00:17:07,989 366 00:17:07,990 --> 00:17:10,798 the idea of getting the middle element is captured 367 00:17:10,798 --> 00:17:12,199 by this next line here 368 00:17:12,199 --> 00:17:17,610 where we take the average of left hand and right hand. Now 369 00:17:17,609 --> 00:17:21,149 searching is such an important operation, and it's so often 370 00:17:21,150 --> 00:17:24,920 needed for very large data 371 00:17:24,920 --> 00:17:27,480 problems that 372 00:17:27,480 --> 00:17:31,410 talking to people in industry who have to do a lot of 373 00:17:31,410 --> 00:17:33,620 searching - you may hear someone say that 374 00:17:33,619 --> 00:17:34,418 there's a bug 375 00:17:34,419 --> 00:17:36,330 on this line. 376 00:17:36,329 --> 00:17:38,189 And it's the sort of bug that 377 00:17:38,190 --> 00:17:41,289 has nothing to do with computer science in the abstract, but it's just one of those 378 00:17:41,289 --> 00:17:42,859 details that 379 00:17:42,859 --> 00:17:43,990 is unfortunate - 380 00:17:43,990 --> 00:17:45,329 the bug being that 381 00:17:45,329 --> 00:17:48,710 you don't really have to add LH to 382 00:17:48,710 --> 00:17:51,019 RH before dividing them. You can divide 383 00:17:51,019 --> 00:17:52,858 each one by two and then add 384 00:17:52,858 --> 00:17:55,369 their halves to get the average. 385 00:17:55,369 --> 00:17:57,908 This is just sort of a side comment, 386 00:17:57,909 --> 00:18:01,140 but why might it be problematic for you to 387 00:18:01,140 --> 00:18:05,570 add them together? All right. If 388 00:18:05,569 --> 00:18:11,099 this rings no bells - why would you say? [Inaudible]. 389 00:18:11,099 --> 00:18:12,000 Integer division? 390 00:18:12,000 --> 00:18:12,549 That's - 391 00:18:12,549 --> 00:18:15,940 I think if you work with some small representative cases, you'll 392 00:18:15,940 --> 00:18:17,000 see that that 393 00:18:17,000 --> 00:18:18,980 doesn't prove to be a problem, although that's 394 00:18:18,980 --> 00:18:19,929 a good insight. 395 00:18:19,929 --> 00:18:24,550 But there is a maximum value - yeah? What if they're too large? 396 00:18:24,549 --> 00:18:25,799 What if they're too large? 397 00:18:25,799 --> 00:18:29,500 So an integer can be as large as - an [inaudible] integer can be as large as, 398 00:18:29,500 --> 00:18:30,589 like, four billion. 399 00:18:30,589 --> 00:18:32,158 But what if each one of LH 400 00:18:32,159 --> 00:18:33,970 and RH is, at some point, 401 00:18:33,970 --> 00:18:35,850 three billion or greater? Right? And 402 00:18:35,849 --> 00:18:39,119 there's going to be a problem adding them together. 403 00:18:39,119 --> 00:18:40,519 But we've already spent too long 404 00:18:40,519 --> 00:18:42,960 talking about that. I would just throw it out as sort of 405 00:18:42,960 --> 00:18:44,159 an interesting detail. 406 00:18:44,159 --> 00:18:46,010 But after 407 00:18:46,009 --> 00:18:49,660 many, many years of using a binary search that looks exactly like 408 00:18:49,660 --> 00:18:52,540 this, only recently have people begun to realize that 409 00:18:52,539 --> 00:18:54,579 little problems like that can sort of 410 00:18:54,579 --> 00:18:58,099 spell disaster with really, really big problems - 411 00:18:58,099 --> 00:19:01,500 really, really big sets of data. Okay. So anyway. 412 00:19:01,500 --> 00:19:05,369 That is somewhat an immaterial point. We now have the index of the middle of the 413 00:19:05,369 --> 00:19:05,918 array - 414 00:19:05,919 --> 00:19:08,940 or at least the middle of the portion of the array that we're considering right 415 00:19:08,940 --> 00:19:10,170 now - 416 00:19:10,170 --> 00:19:13,630 and we ask our sort of array-like object 417 00:19:13,630 --> 00:19:15,980 what number is at that index. 418 00:19:15,980 --> 00:19:19,730 And if it's equal to our key, then we're done. We can just return that we found 419 00:19:19,730 --> 00:19:22,990 the index that we - of 420 00:19:22,990 --> 00:19:24,660 something that matched the key. 421 00:19:24,660 --> 00:19:28,440 So in the linear search case, if there were multiple copies of the key that we 422 00:19:28,440 --> 00:19:32,640 were looking for, which one would be returned? 423 00:19:32,640 --> 00:19:33,910 The first one, 424 00:19:33,910 --> 00:19:40,910 right? Do you have any insight yet about which one might be returned in this case? It's 425 00:19:42,230 --> 00:19:43,680 not at all clear to me. 426 00:19:43,680 --> 00:19:45,009 And in fact, 427 00:19:45,009 --> 00:19:50,480 binary searches generally don't guarantee anything about 428 00:19:50,480 --> 00:19:55,849 which element they return if there are multiple copies. Anyway. 429 00:19:55,849 --> 00:19:57,819 Assuming that we don't get lucky, though, 430 00:19:57,819 --> 00:20:02,139 and that we don't find the element that we were looking for on our first try, 431 00:20:02,140 --> 00:20:05,850 then we have to modify the size - the part of the array that we're looking 432 00:20:05,849 --> 00:20:06,338 at. 433 00:20:06,338 --> 00:20:07,539 So how do we do that? 434 00:20:07,539 --> 00:20:09,069 Well, we need to determine, 435 00:20:09,069 --> 00:20:10,799 as was suggested, 436 00:20:10,799 --> 00:20:13,720 whether the element that we're looking for is less than the middle 437 00:20:13,720 --> 00:20:16,160 or greater than or equal to the middle. 438 00:20:16,160 --> 00:20:19,790 And in the case that it's less than the element of the middle of the array, 439 00:20:19,789 --> 00:20:24,629 then we want to adjust which side. 440 00:20:24,630 --> 00:20:27,290 So we know that it must be in the first portion of the array, and 441 00:20:27,289 --> 00:20:31,659 we're going to ignore the second portion of the array. 442 00:20:31,660 --> 00:20:34,100 So the right-hand marker that we're - 443 00:20:34,099 --> 00:20:38,059 is what we need to update this time. So that's what this does. It updates 444 00:20:38,059 --> 00:20:40,678 the right-hand marker to mid minus one. 445 00:20:40,679 --> 00:20:43,080 And why is it minus one? 446 00:20:43,079 --> 00:20:44,819 These sorts of issues 447 00:20:44,819 --> 00:20:47,730 can be really frustrating to be off by one 448 00:20:47,730 --> 00:20:53,680 errors in otherwise already complicated algorithms. [Inaudible]. Okay. 449 00:20:53,680 --> 00:20:59,360 Arrays start at zero. Well, 450 00:20:59,359 --> 00:21:02,599 could it be the case after we 451 00:21:02,599 --> 00:21:03,109 examine 452 00:21:03,109 --> 00:21:04,729 this test 453 00:21:04,730 --> 00:21:05,309 that 454 00:21:05,309 --> 00:21:09,609 the element is actually at the mid position? 455 00:21:09,609 --> 00:21:12,179 Okay, why is that? [Inaudible]. 456 00:21:12,180 --> 00:21:14,350 Yeah. We already - oops. 457 00:21:14,349 --> 00:21:16,919 We already 458 00:21:16,920 --> 00:21:20,690 checked, and the way that we checked is because it's just a less than sign. 459 00:21:20,690 --> 00:21:23,009 It's strictly less than, so that implies that 460 00:21:23,009 --> 00:21:26,730 the two are not equal. So 461 00:21:26,730 --> 00:21:29,120 there may be different ways of interpreting why there's a minus one 462 00:21:29,119 --> 00:21:30,968 here, but my interpretation is that 463 00:21:30,969 --> 00:21:34,029 we never want to consider that index again. 464 00:21:34,029 --> 00:21:37,779 So we want to move one past the middle index that we now know is not 465 00:21:37,779 --> 00:21:38,149 equal 466 00:21:38,150 --> 00:21:41,750 to the element that we were looking for. Okay? So 467 00:21:41,750 --> 00:21:43,169 in the other case, 468 00:21:43,169 --> 00:21:46,869 we want to adjust the other end of this portion of the array that we're 469 00:21:46,868 --> 00:21:48,559 considering, 470 00:21:48,559 --> 00:21:50,858 and so we update the left-hand side to 471 00:21:50,858 --> 00:21:54,158 mid plus one. All right? 472 00:21:54,159 --> 00:21:58,590 Why plus one? [Inaudible]. It's 473 00:21:58,589 --> 00:22:01,439 a similar reason. Right. We don't want to consider the midpoint 474 00:22:01,440 --> 00:22:03,039 ever again. Right? 475 00:22:03,039 --> 00:22:05,180 So at the very least, we've 476 00:22:05,180 --> 00:22:08,049 gotten rid of the midpoint. But more importantly, we've gotten rid of 477 00:22:08,049 --> 00:22:13,460 everything on one or the other sides of it. Okay? So 478 00:22:13,460 --> 00:22:17,990 you can now see that we are - we've got an instance of the original problem, 479 00:22:17,990 --> 00:22:21,599 and when we go back to the top of the Y 480 00:22:21,599 --> 00:22:22,449 loop, 481 00:22:22,450 --> 00:22:24,420 we calculate a new midpoint, 482 00:22:24,420 --> 00:22:26,870 and we do the comparison again. 483 00:22:26,869 --> 00:22:28,869 But now we're in a much smaller array, 484 00:22:28,869 --> 00:22:30,009 and eventually, 485 00:22:30,009 --> 00:22:34,230 the only directions in which RH and LH can move 486 00:22:34,230 --> 00:22:36,289 are directions toward each other. 487 00:22:36,289 --> 00:22:36,720 And 488 00:22:36,720 --> 00:22:38,539 no matter what happens, unless we 489 00:22:38,539 --> 00:22:41,028 go ahead and return mid, we're changing 490 00:22:41,028 --> 00:22:41,778 one of them 491 00:22:41,778 --> 00:22:42,650 each time. 492 00:22:42,650 --> 00:22:46,710 So you should be able to see that 493 00:22:46,710 --> 00:22:51,000 the array that we're considering is definitely getting smaller. Okay? 494 00:22:51,000 --> 00:22:52,200 So that's important. 495 00:22:52,200 --> 00:22:54,190 This is a Y loop with 496 00:22:54,190 --> 00:22:55,660 an interesting 497 00:22:55,660 --> 00:22:58,110 test. So it's always worth 498 00:22:58,109 --> 00:23:01,579 considering whether the Y loop is really going to exit. But in this case, 499 00:23:01,579 --> 00:23:06,720 I hope that I've persuaded you that, in fact, it should exit. Okay? 500 00:23:06,720 --> 00:23:11,048 So what will the search look like if we 501 00:23:11,048 --> 00:23:12,319 use binary 502 00:23:12,319 --> 00:23:13,700 search now. I'll 503 00:23:13,700 --> 00:23:20,700 just change this name. All right. 504 00:23:31,278 --> 00:23:32,119 That was quick. 505 00:23:32,119 --> 00:23:33,439 We could have 506 00:23:33,440 --> 00:23:36,149 fiddled with the delay that I added. But 507 00:23:36,148 --> 00:23:39,269 holding the delays the same, it should be clear that 508 00:23:39,269 --> 00:23:42,388 the difference in running times between these two algorithms 509 00:23:42,388 --> 00:23:49,388 is pretty substantial. All right? So 510 00:23:50,150 --> 00:23:52,519 that's searching. Any questions about what I've 511 00:23:52,519 --> 00:23:59,519 explained? Yeah. [Inaudible]. 512 00:24:03,019 --> 00:24:05,150 That's an excellent question. So 513 00:24:05,150 --> 00:24:07,880 it depends on 514 00:24:07,880 --> 00:24:09,949 - I could put the question back to you. 515 00:24:09,949 --> 00:24:11,759 I haven't explained 516 00:24:11,759 --> 00:24:14,289 a good way of thinking about 517 00:24:14,289 --> 00:24:16,500 exactly how fast these two operations are. 518 00:24:16,500 --> 00:24:18,799 And I haven't even shown you 519 00:24:18,799 --> 00:24:22,470 how one would go about sorting the array, 520 00:24:22,470 --> 00:24:25,089 much less how fast that would be. 521 00:24:25,089 --> 00:24:27,259 But 522 00:24:27,259 --> 00:24:28,359 523 00:24:28,359 --> 00:24:31,119 if you were going to do a bunch of look-ups, 524 00:24:31,119 --> 00:24:31,779 then 525 00:24:31,779 --> 00:24:37,148 the savings that you would reap by being able to search in a sorted array would 526 00:24:37,148 --> 00:24:41,729 potentially, for some number of look-ups, dwarf whatever cost it took to sort the 527 00:24:41,730 --> 00:24:44,089 array the very first time. Right? 528 00:24:44,089 --> 00:24:47,499 But if you're in a situation where you've got all sorts of arrays, and 529 00:24:47,499 --> 00:24:50,769 for this array you want to find this element, and next to it's 530 00:24:50,769 --> 00:24:53,509 going to be some completely different array - some completely different element 531 00:24:53,509 --> 00:24:56,670 that you're trying to find, then 532 00:24:56,670 --> 00:25:00,360 your intuition - or the intuition that I imagine is behind that question 533 00:25:00,359 --> 00:25:02,009 that it's 534 00:25:02,009 --> 00:25:04,599 more costly to sort the array and then search it 535 00:25:04,599 --> 00:25:08,199 is perfectly valid. It would be better in those situations just to do a linear 536 00:25:08,200 --> 00:25:09,249 search. And sometimes, 537 00:25:09,249 --> 00:25:12,298 that's all you can do. There are situations where 538 00:25:12,298 --> 00:25:16,129 there's no easy way to put the elements into sorted order. 539 00:25:16,130 --> 00:25:18,470 It's not clear what that would mean, 540 00:25:18,470 --> 00:25:22,990 so in situations like that, sometimes linear search, which is sort of our 541 00:25:22,990 --> 00:25:29,990 null hypothesis - our default algorithm - is the only thing that's available. All 542 00:25:37,409 --> 00:25:40,810 right. So we've already gone through the idea of 543 00:25:40,809 --> 00:25:42,230 binary search. 544 00:25:42,230 --> 00:25:45,039 Here is a prose version 545 00:25:45,039 --> 00:25:46,990 of the algorithm 546 00:25:46,990 --> 00:25:49,069 that we just walked 547 00:25:49,069 --> 00:25:50,369 548 00:25:50,369 --> 00:25:52,869 through. And here is - I inherited these slides from 549 00:25:52,869 --> 00:25:55,869 someone else's, as you may be able to tell. But here is - 550 00:25:55,869 --> 00:25:56,879 you can see that 551 00:25:56,880 --> 00:26:00,110 part of the array is no longer being considered. 552 00:26:00,109 --> 00:26:04,379 Successively, that portion gets smaller and smaller and smaller 553 00:26:04,380 --> 00:26:05,200 until finally, 554 00:26:05,200 --> 00:26:09,750 there's only one element. Okay? 555 00:26:09,750 --> 00:26:12,450 Okay. So we've already talked about how linear search 556 00:26:12,450 --> 00:26:19,450 depends on its running time on the number of elements. Do you have a question? Yeah. I was wondering what type of search mechanisms does [inaudible] use [inaudible]? Yeah. 557 00:26:22,819 --> 00:26:25,649 Yeah, so 558 00:26:25,650 --> 00:26:29,720 - we'll talk about exactly how efficient binary search is, but 559 00:26:29,720 --> 00:26:31,308 the basic insight is that it - 560 00:26:31,308 --> 00:26:33,210 the number of steps that you take 561 00:26:33,210 --> 00:26:36,910 is the number of times you would have to divide the original number of elements 562 00:26:36,910 --> 00:26:37,930 by two 563 00:26:37,930 --> 00:26:41,430 in order to get it down to just one element. 564 00:26:41,430 --> 00:26:44,640 Right? So 565 00:26:44,640 --> 00:26:49,830 that's a logarithmic proportion. Right? 566 00:26:49,829 --> 00:26:51,139 That's just sort of 567 00:26:51,140 --> 00:26:52,540 jargon, and actually, I 568 00:26:52,539 --> 00:26:54,740 didn't think much of logarithms until 569 00:26:54,740 --> 00:26:55,620 I started 570 00:26:55,619 --> 00:26:57,319 being a computer scientist, and 571 00:26:57,319 --> 00:27:00,789 now they make a lot of sense. They pop up everywhere. But I'll 572 00:27:00,789 --> 00:27:02,009 talk about that in a second. 573 00:27:02,009 --> 00:27:06,049 But yes. Okay, so back to your question about the hashmap. 574 00:27:06,049 --> 00:27:07,789 The thing about a hashmap is that, 575 00:27:07,789 --> 00:27:10,809 for all you can tell, it takes the same amount of time 576 00:27:10,809 --> 00:27:14,210 to find the element or to insert the element that you're trying to find or 577 00:27:14,210 --> 00:27:14,900 insert, 578 00:27:14,900 --> 00:27:17,180 no matter how many elements you put into 579 00:27:17,180 --> 00:27:21,670 the hashmap so far. So how is that even possible? So the book talks 580 00:27:21,670 --> 00:27:23,729 about this in some detail, 581 00:27:23,729 --> 00:27:25,069 but the basic idea 582 00:27:25,069 --> 00:27:26,460 is that 583 00:27:26,460 --> 00:27:29,910 the hashmap has a bunch of buckets behind the scenes, 584 00:27:29,910 --> 00:27:31,230 and it can figure out 585 00:27:31,230 --> 00:27:33,130 in constant time - 586 00:27:33,130 --> 00:27:36,000 that is in an amount of time that doesn't depend on the number of buckets 587 00:27:36,000 --> 00:27:38,169 or the number of elements - 588 00:27:38,169 --> 00:27:39,370 which bucket 589 00:27:39,369 --> 00:27:42,159 the element that you're inserting or looking for 590 00:27:42,160 --> 00:27:43,170 should go 591 00:27:43,170 --> 00:27:45,340 or should be found. 592 00:27:45,339 --> 00:27:47,509 And then 593 00:27:47,509 --> 00:27:50,339 hoping that the number of things that end up in that bucket - 594 00:27:50,339 --> 00:27:52,859 presuming you have enough buckets - 595 00:27:52,859 --> 00:27:55,939 is relatively small, you can just do a linear search 596 00:27:55,940 --> 00:27:59,009 on the list of things for each bucket. So 597 00:27:59,009 --> 00:28:00,109 the idea is that 598 00:28:00,109 --> 00:28:04,329 very quickly, you pair it down to a single small list 599 00:28:04,329 --> 00:28:05,269 to search in, 600 00:28:05,269 --> 00:28:12,210 and then you just search in that list. So for very short lists - or 601 00:28:12,210 --> 00:28:13,519 I guess I should say 602 00:28:13,519 --> 00:28:15,638 for very small numbers of elements - 603 00:28:15,638 --> 00:28:19,788 I have said actually has a lot of overhead that may outweigh the cost of 604 00:28:19,788 --> 00:28:21,920 just doing a linear search on 605 00:28:21,920 --> 00:28:24,390 the elements. But for - 606 00:28:24,390 --> 00:28:26,380 again, the way to think about 607 00:28:26,380 --> 00:28:30,179 these algorithms in relationship to each other is to consider 608 00:28:30,179 --> 00:28:31,600 really big examples 609 00:28:31,599 --> 00:28:34,469 and think about how the running time would sort of 610 00:28:34,470 --> 00:28:41,470 grow as a function of the input. All right. 611 00:28:42,339 --> 00:28:45,309 So let's say a little bit about 612 00:28:45,309 --> 00:28:46,058 binary search 613 00:28:46,058 --> 00:28:47,930 and it's efficiency. I've 614 00:28:47,930 --> 00:28:52,100 already suggested that we divide the number of elements that we're considering 615 00:28:52,099 --> 00:28:57,480 by two each time. So if you're mathematically inclined, you may notice 616 00:28:57,480 --> 00:28:57,870 that 617 00:28:57,869 --> 00:29:00,288 we want to solve this equation 618 00:29:00,288 --> 00:29:03,940 at the bottom where we find N such that N divided by 619 00:29:03,940 --> 00:29:05,360 two to the 620 00:29:05,359 --> 00:29:07,799 K times - no, 621 00:29:07,799 --> 00:29:10,470 N we already know. N is the size of our 622 00:29:10,470 --> 00:29:17,329 data set. We want to find K, which is the number of times we divide N by two. Okay? 623 00:29:17,329 --> 00:29:18,009 So 624 00:29:18,009 --> 00:29:21,200 we can rearrange that by multiplying each side, and then the only way of 625 00:29:21,200 --> 00:29:27,600 solving that is to find K by taking the base two logarithm of N. That's - 626 00:29:27,599 --> 00:29:30,859 you could find it by inspection, but you could also type this into your 627 00:29:30,859 --> 00:29:33,278 calculator. There's an equivalence between these two things that, 628 00:29:33,278 --> 00:29:35,400 if you've forgotten, may be worth 629 00:29:35,400 --> 00:29:36,290 remembering now, but that 630 00:29:36,289 --> 00:29:37,839 is certainly not something we'll 631 00:29:37,839 --> 00:29:42,519 ever test you about. I would also say that the content of this lecture 632 00:29:42,519 --> 00:29:44,370 is not something that you are going to 633 00:29:44,369 --> 00:29:45,759 be responsible for 634 00:29:45,759 --> 00:29:47,089 in your assignments 635 00:29:47,089 --> 00:29:48,459 or the exams. 636 00:29:48,460 --> 00:29:49,240 So if 637 00:29:49,240 --> 00:29:50,690 you find it interesting, 638 00:29:50,690 --> 00:29:56,610 then let that be it's own reward. And if not, I'm terribly, terribly sorry. Okay. 639 00:29:56,609 --> 00:30:00,399 So what - how do we compare these two numbers? So you look at an expression 640 00:30:00,400 --> 00:30:01,130 like 641 00:30:01,130 --> 00:30:04,409 log base two of N, and that may not mean anything to you. It certainly didn't mean 642 00:30:04,409 --> 00:30:05,769 much to me. But 643 00:30:05,769 --> 00:30:08,339 intuitively, if you remember that what we're talking about is dividing this 644 00:30:08,339 --> 00:30:10,029 array in half 645 00:30:10,029 --> 00:30:11,210 repeatedly, 646 00:30:11,210 --> 00:30:14,090 then it should seem like that's a lot faster 647 00:30:14,089 --> 00:30:15,499 than searching every element. 648 00:30:15,499 --> 00:30:19,019 And indeed, it is considerably, considerably faster. 649 00:30:19,019 --> 00:30:20,829 If it had been log base 650 00:30:20,829 --> 00:30:23,599 ten, the number would be approximately the number of 651 00:30:23,599 --> 00:30:28,679 digits in - NN, if you wrote it out as a decimal number. 652 00:30:28,680 --> 00:30:32,110 In log base two, it's the number of binary digits if you were to write it out as a 653 00:30:32,109 --> 00:30:32,969 binary number. 654 00:30:32,970 --> 00:30:36,929 So you know that it's easy to write very large numbers with a string of 655 00:30:36,929 --> 00:30:38,870 digits. 656 00:30:38,869 --> 00:30:40,158 And in this case, 657 00:30:40,159 --> 00:30:42,230 658 00:30:42,230 --> 00:30:45,370 if we scale in all the way up to a billion here, 659 00:30:45,369 --> 00:30:48,109 then a billion has approximately 660 00:30:48,109 --> 00:30:50,529 30 binary digits - or 661 00:30:50,529 --> 00:30:53,139 another way of saying this is that the logarithm 662 00:30:53,140 --> 00:30:53,920 of N 663 00:30:53,920 --> 00:30:56,890 base two when N is a billion is 30. So 664 00:30:56,890 --> 00:30:57,809 notice how 665 00:30:57,808 --> 00:31:00,129 log base two of N is growing relative to 666 00:31:00,130 --> 00:31:03,360 N. The difference is just enormous. So 667 00:31:03,359 --> 00:31:05,508 by imposing a little bit of structure 668 00:31:05,509 --> 00:31:09,509 on the data that we wanted to search through, we made it much, much easier to 669 00:31:09,509 --> 00:31:16,009 find elements. Okay. All right. 670 00:31:16,009 --> 00:31:17,960 So that's searching, 671 00:31:17,960 --> 00:31:19,640 and 672 00:31:19,640 --> 00:31:23,249 I hope the points there are well taken because 673 00:31:23,249 --> 00:31:26,679 sorting is now sort of like the dual concept 674 00:31:26,679 --> 00:31:28,610 of the searching that we've just been doing. 675 00:31:28,609 --> 00:31:31,709 If it's easier to search when we have imposed 676 00:31:31,710 --> 00:31:32,669 some order 677 00:31:32,669 --> 00:31:36,810 on the elements that we're trying to search among, then 678 00:31:36,809 --> 00:31:39,129 how do we go about imposing that order? 679 00:31:39,130 --> 00:31:40,759 So 106B 680 00:31:40,759 --> 00:31:43,140 spends quite a bit of time talking about 681 00:31:43,140 --> 00:31:47,070 different sorting algorithms. There's a great assignment where 682 00:31:47,069 --> 00:31:50,439 they give you a program that implements these sorting algorithms, but not 683 00:31:50,440 --> 00:31:51,479 the source code. 684 00:31:51,479 --> 00:31:52,690 And you just have to 685 00:31:52,690 --> 00:31:56,028 run it on a bunch of different kinds of input. And given what you know about how 686 00:31:56,028 --> 00:31:58,140 those sorting algorithms behave, 687 00:31:58,140 --> 00:32:00,400 figure out which one of them is which. 688 00:32:00,400 --> 00:32:02,889 So it's sort of like a practical lab from 689 00:32:02,888 --> 00:32:06,278 a natural science. It's the only time 690 00:32:06,278 --> 00:32:08,509 I've done anything like that 691 00:32:08,509 --> 00:32:11,839 in the computer science world, but 692 00:32:11,839 --> 00:32:13,399 if that sounds exciting, 693 00:32:13,400 --> 00:32:20,400 think about taking 106B. All right. 694 00:32:20,569 --> 00:32:22,019 So 695 00:32:22,019 --> 00:32:22,798 here's 696 00:32:22,798 --> 00:32:26,058 the sorting algorithm that corresponds, perhaps 697 00:32:26,058 --> 00:32:29,569 the best, to the way that the humans sort things. So 698 00:32:29,569 --> 00:32:33,819 if you had a deck of cards, and you wanted to put those cards in - 699 00:32:33,819 --> 00:32:36,480 back in order after sort of - after shuffling them, 700 00:32:36,480 --> 00:32:39,319 you would look at the first card, 701 00:32:39,319 --> 00:32:40,179 and 702 00:32:40,180 --> 00:32:41,570 you'd think to yourself, 703 00:32:41,569 --> 00:32:43,958 gGosh. This is great. I've already got another 704 00:32:43,959 --> 00:32:47,690 deck of cards that is sorted. It happens to have only one card in it.h 705 00:32:47,690 --> 00:32:49,460 So it's sort of sorted vacuously. 706 00:32:49,460 --> 00:32:51,288 But starting with this sorted deck, 707 00:32:51,288 --> 00:32:53,099 can I add more cards to it 708 00:32:53,099 --> 00:32:55,028 and keep them in the same 709 00:32:55,028 --> 00:32:56,349 sorted order? Right? 710 00:32:56,349 --> 00:32:57,839 So you would take the next card 711 00:32:57,839 --> 00:32:59,918 off of the larger deck, and you would 712 00:32:59,919 --> 00:33:02,159 either put it in front of or 713 00:33:02,159 --> 00:33:03,360 behind the card 714 00:33:03,359 --> 00:33:04,479 that you had. And now 715 00:33:04,480 --> 00:33:08,048 you'll have a deck of two cards that are sorted with respect to each other. And 716 00:33:08,048 --> 00:33:10,039 you would continue on in this way 717 00:33:10,039 --> 00:33:13,059 until you had exhausted the original deck, 718 00:33:13,059 --> 00:33:13,829 and you had 719 00:33:13,829 --> 00:33:17,049 a fully sorted secondary deck. 720 00:33:17,049 --> 00:33:18,329 Would anybody 721 00:33:18,329 --> 00:33:21,369 go about this differently? 722 00:33:21,369 --> 00:33:25,229 Is there a better way of sorting? Okay. Well, 723 00:33:25,230 --> 00:33:28,430 that's certainly what I would have done, and that's what is implemented by 724 00:33:28,430 --> 00:33:29,410 this 725 00:33:29,410 --> 00:33:32,779 as long as you trust the defined smallest function. 726 00:33:32,779 --> 00:33:34,519 So we just step through the array - 727 00:33:34,519 --> 00:33:35,138 and 728 00:33:35,138 --> 00:33:38,519 this is a bit different. We're not - we don't have two separate arrays that 729 00:33:38,519 --> 00:33:40,359 we're building up. 730 00:33:40,359 --> 00:33:41,298 But for 731 00:33:41,298 --> 00:33:45,299 each element, we look for the smallest element that's after that. 732 00:33:45,299 --> 00:33:48,019 And on the assumption that there shouldn't be any smaller elements after 733 00:33:48,019 --> 00:33:50,430 that, if we find such a smaller element, 734 00:33:50,430 --> 00:33:54,390 then we swap it with the element that we're currently looking at. Right? 735 00:33:54,390 --> 00:33:58,059 So by the time we get to the end of the array, and we swapped out each position 736 00:33:58,058 --> 00:34:00,549 with anything that was smaller than 737 00:34:00,549 --> 00:34:02,538 the element at that position, 738 00:34:02,538 --> 00:34:04,190 then our array 739 00:34:04,190 --> 00:34:08,050 should be sorted. 740 00:34:08,050 --> 00:34:10,289 Right? As a bit of review, 741 00:34:10,289 --> 00:34:11,169 742 00:34:11,168 --> 00:34:13,068 why do we call swap elements? 743 00:34:13,068 --> 00:34:15,940 You'll have to imagine how swap elements is implemented. 744 00:34:15,940 --> 00:34:19,418 But based on what you can assume, why do we have to call swap elements with three 745 00:34:19,418 --> 00:34:21,558 arguments? Why couldn't we just pass 746 00:34:21,559 --> 00:34:23,339 an array sub LH 747 00:34:23,338 --> 00:34:29,078 and array sub RH? [Inaudible]. 748 00:34:29,079 --> 00:34:31,919 Yeah. It only passes N. So those are just the values from the array, 749 00:34:31,918 --> 00:34:35,048 and our goal is to modify the array in place. 750 00:34:35,048 --> 00:34:36,889 Right? So 751 00:34:36,889 --> 00:34:43,889 the swap elements function needs to know about that array, too. Okay? 752 00:34:45,329 --> 00:34:47,269 So I don't know if I can 753 00:34:47,268 --> 00:34:48,838 trust 754 00:34:48,838 --> 00:34:50,398 the simulation, 755 00:34:50,398 --> 00:34:51,348 but 756 00:34:51,349 --> 00:34:53,539 I actually coded up a 757 00:34:53,539 --> 00:34:56,509 version of select and sort myself 758 00:34:56,509 --> 00:35:00,478 and an eclipse, so 759 00:35:00,478 --> 00:35:04,608 let us see a somewhat more interesting 760 00:35:04,608 --> 00:35:08,489 simulation of how that would work. Okay. 761 00:35:08,489 --> 00:35:10,159 So I'm gonna use a much smaller 762 00:35:10,159 --> 00:35:13,000 file because as I have - as I've already suggested, sorting 763 00:35:13,000 --> 00:35:13,940 takes 764 00:35:13,940 --> 00:35:20,940 more time than even linear search. All right. 765 00:35:23,309 --> 00:35:27,939 So just to give you some bearings as to what's going on in this program, 766 00:35:27,938 --> 00:35:31,389 I have this number display, which is just a G canvas that 767 00:35:31,389 --> 00:35:34,358 has these operations that let us highlight numbers and 768 00:35:34,358 --> 00:35:36,278 swap them around. And 769 00:35:36,278 --> 00:35:37,679 I'm 770 00:35:37,679 --> 00:35:41,440 creating one of those down here as an instance variable 771 00:35:41,440 --> 00:35:44,159 and adding it and run and then 772 00:35:44,159 --> 00:35:49,909 initializing it with the contents of the file that's typed in. Right? So 773 00:35:49,909 --> 00:35:50,898 if 774 00:35:50,898 --> 00:35:52,279 I were to just run 775 00:35:52,280 --> 00:35:52,899 776 00:35:52,898 --> 00:35:55,328 the program as it is right now - 777 00:35:55,329 --> 00:36:00,929 I use a smaller text file. It 778 00:36:00,929 --> 00:36:03,929 should slow this down, but it's going to take enough time anyway that you can 779 00:36:03,929 --> 00:36:04,650 see 780 00:36:04,650 --> 00:36:06,469 sort of what's happening. 781 00:36:06,469 --> 00:36:10,548 Each number is turning red when we examine it, so 782 00:36:10,548 --> 00:36:13,498 in some sense, you can think of the running time of the program as depending 783 00:36:13,498 --> 00:36:14,198 on 784 00:36:14,199 --> 00:36:16,869 the number of times, given the input that 785 00:36:16,869 --> 00:36:20,659 you have to examine elements of the array. So 786 00:36:20,659 --> 00:36:22,849 the way I structured this was just that 787 00:36:22,849 --> 00:36:25,699 examining any element of this pseudo-array would have the side effect of 788 00:36:25,699 --> 00:36:28,269 making it a little bigger and turning it red - 789 00:36:28,268 --> 00:36:31,568 and also pausing for 790 00:36:31,568 --> 00:36:34,400 about a quarter of a second. Okay. 791 00:36:34,400 --> 00:36:35,818 So actually, 792 00:36:35,818 --> 00:36:38,208 I think it would be worthwhile to 793 00:36:38,208 --> 00:36:39,219 slow that down 794 00:36:39,219 --> 00:36:46,219 and look at what's happening. 795 00:36:46,599 --> 00:36:48,359 The value of constants - 796 00:36:48,358 --> 00:36:54,278 that was easy. Okay. 797 00:36:54,278 --> 00:36:57,548 So we're considering four right now. One of the things that this doesn't keep 798 00:36:57,548 --> 00:36:58,909 track of very well is 799 00:36:58,909 --> 00:37:01,129 what number we're considering. Right? 800 00:37:01,130 --> 00:37:06,088 But after four, the smallest number that we found was two, and so we swapped them out. 801 00:37:06,088 --> 00:37:11,318 Two happened to occur immediately after four. 802 00:37:11,318 --> 00:37:15,219 And so we considered four. Again, this is just sort of a special case where four happened 803 00:37:15,219 --> 00:37:22,219 to be the - happened to be swapped into the position that we next considered. 804 00:37:25,608 --> 00:37:27,338 So now is anything 805 00:37:27,338 --> 00:37:31,159 less than eight? Well, seven is. Oh, four is even less than seven. 806 00:37:31,159 --> 00:37:32,588 Now we swap the 807 00:37:32,588 --> 00:37:34,268 eight with the four. 808 00:37:34,268 --> 00:37:36,558 It's going to turn out that nothing is less than - 809 00:37:36,559 --> 00:37:38,660 or everything is less than nine - 810 00:37:38,659 --> 00:37:45,659 five in particular. Okay. 811 00:37:47,130 --> 00:37:49,829 So how long did that take? Was anybody 812 00:37:49,829 --> 00:37:51,559 timing that? 813 00:37:51,559 --> 00:37:55,330 What would it mean? I mean, if we could just change the amount of delay 814 00:37:55,329 --> 00:37:58,489 arbitrarily, right, we could have made this take any amount of time. And in 815 00:37:58,489 --> 00:38:01,108 some sense, when computers get faster and faster because 816 00:38:01,108 --> 00:38:04,119 the hardware people are doing their jobs, that's what's happening. 817 00:38:04,119 --> 00:38:06,989 The delay that someone artificially 818 00:38:06,989 --> 00:38:09,458 - or in that case, not artificially - put into the system 819 00:38:09,458 --> 00:38:11,669 is just being reduced. Right? So 820 00:38:11,670 --> 00:38:13,210 everything seems to run faster. 821 00:38:13,210 --> 00:38:15,309 The rising tide 822 00:38:15,309 --> 00:38:16,290 raises all boats. 823 00:38:16,289 --> 00:38:17,829 So 824 00:38:17,829 --> 00:38:20,689 what would - I mean, would it even be useful to 825 00:38:20,688 --> 00:38:24,348 have timed that - to have a number of seconds representing how long it took to 826 00:38:24,349 --> 00:38:26,359 sort these 827 00:38:26,358 --> 00:38:27,398 ten elements? It's 828 00:38:27,398 --> 00:38:30,778 a lot harder to count when you're standing in front of a bunch of people. 829 00:38:30,778 --> 00:38:32,588 830 00:38:32,588 --> 00:38:34,259 But it - yeah, I think that's ten. Okay. Right? So 831 00:38:34,259 --> 00:38:36,139 clearly, it's sort of 832 00:38:36,139 --> 00:38:38,510 a result of a number of factors 833 00:38:38,510 --> 00:38:39,809 exactly how much time 834 00:38:39,809 --> 00:38:41,569 these operations take. 835 00:38:41,568 --> 00:38:43,009 So I - with 836 00:38:43,009 --> 00:38:45,119 what remains of the lecture, I'd like to give you 837 00:38:45,119 --> 00:38:49,989 sort of a better way of talking about how fast a program 838 00:38:49,989 --> 00:38:52,088 is. So if you buy my contention 839 00:38:52,088 --> 00:38:55,268 that you can't just time your program without 840 00:38:55,268 --> 00:38:58,558 making sure that all of the other factors stay the same, 841 00:38:58,559 --> 00:38:59,890 then 842 00:38:59,889 --> 00:39:03,039 what ideally we would want to be able to say is that, 843 00:39:03,039 --> 00:39:04,820 given some kind of input, 844 00:39:04,820 --> 00:39:06,989 about how long is it going to take? 845 00:39:06,989 --> 00:39:07,749 And 846 00:39:07,748 --> 00:39:11,548 in particular, how long would it take if, for instance, I 847 00:39:11,548 --> 00:39:15,139 doubled the size of the input - if there were twice as many numbers here. 848 00:39:15,139 --> 00:39:16,479 For linear search, 849 00:39:16,478 --> 00:39:21,458 if we were searching for the five in this array - 850 00:39:21,458 --> 00:39:24,998 or let's just take the pathological case. If we were searching for the nine, 851 00:39:24,998 --> 00:39:28,228 we would have to examine all of the elements before we found it. 852 00:39:28,228 --> 00:39:31,178 If we double the length of the array 853 00:39:31,179 --> 00:39:36,358 and made sure that nine didn't appear anywhere except the end, again 854 00:39:36,358 --> 00:39:37,579 we would have doubled 855 00:39:37,579 --> 00:39:38,219 856 00:39:38,219 --> 00:39:40,708 the time that it took to find the nine. 857 00:39:40,708 --> 00:39:44,778 But would that be true in the binary search case? 858 00:39:44,778 --> 00:39:49,518 Well, no. We had only - we doubled N. And remember that the running time is, in some sense, 859 00:39:49,518 --> 00:39:51,008 proportional to 860 00:39:51,009 --> 00:39:52,969 the logarithm of N 861 00:39:52,969 --> 00:39:54,648 base two. So 862 00:39:54,648 --> 00:39:57,759 we wouldn't have doubled the running time. We would have only increased it by 863 00:39:57,759 --> 00:39:59,019 a factor of 864 00:39:59,018 --> 00:40:00,209 logarithm of N 865 00:40:00,210 --> 00:40:02,449 times two minus - or 866 00:40:02,449 --> 00:40:05,619 divided by the logarithm of N. 867 00:40:05,619 --> 00:40:07,088 I 868 00:40:07,088 --> 00:40:09,659 didn't say the base there because it actually doesn't matter. If you divide 869 00:40:09,659 --> 00:40:13,039 two logarithms by each other, you get the same answer no matter what base 870 00:40:13,039 --> 00:40:14,310 you took them at originally. It's 871 00:40:14,309 --> 00:40:21,309 sort of a nice property of logarithms. Okay. 872 00:40:23,588 --> 00:40:27,699 So I think my simulation was better. Feel free to disagree. But I'm trying to 873 00:40:27,699 --> 00:40:28,599 skip - 874 00:40:28,599 --> 00:40:32,499 oh. This actually has a little finger pointing to the elements. 875 00:40:32,498 --> 00:40:36,328 Maybe there's something to be said for this. Anyway. 876 00:40:36,329 --> 00:40:39,619 I'll post that program online incase anyone is curious. 877 00:40:39,619 --> 00:40:43,818 This is 878 00:40:43,818 --> 00:40:45,628 taking a bit longer to 879 00:40:45,628 --> 00:40:52,628 blow through than I was hoping. All right. 880 00:40:56,429 --> 00:41:00,239 So how efficient is selection sort? Well, 881 00:41:00,239 --> 00:41:03,708 it's not okay to just give a number seconds. We want to give some kind of 882 00:41:03,708 --> 00:41:05,089 proportion, as we did for 883 00:41:05,090 --> 00:41:08,059 the searches. Right? 884 00:41:08,059 --> 00:41:09,460 And if you noticed, 885 00:41:09,460 --> 00:41:12,900 for every element of the array, we had to consider 886 00:41:12,900 --> 00:41:15,809 every remaining element of the array. 887 00:41:15,809 --> 00:41:17,960 If we wanted to be a little less 888 00:41:17,960 --> 00:41:22,369 efficient in some sense, we could have just - well, 889 00:41:22,369 --> 00:41:25,108 I was going to say we could have considered all of the elements and just 890 00:41:25,108 --> 00:41:29,848 ignored the ones that did not actually come afterwards. But 891 00:41:29,849 --> 00:41:31,259 we can calculate 892 00:41:31,259 --> 00:41:34,119 sort of in a rough sense 893 00:41:34,119 --> 00:41:35,719 what that proportion 894 00:41:35,719 --> 00:41:39,009 should be in this case. 895 00:41:39,009 --> 00:41:42,659 But before I do that, here's a slide where I 896 00:41:42,659 --> 00:41:45,309 attempted 897 00:41:45,309 --> 00:41:47,329 to do the timing test. 898 00:41:47,329 --> 00:41:50,720 So I held all the factors constant that I could, 899 00:41:50,719 --> 00:41:54,858 except that I changed the size of the input. 900 00:41:54,858 --> 00:41:55,938 So 901 00:41:55,938 --> 00:41:59,938 I went from having just ten elements, which took me a 902 00:41:59,938 --> 00:42:01,748 very, very short amount of time, 903 00:42:01,748 --> 00:42:06,649 to having 1,000 elements, which took me a matter 904 00:42:06,650 --> 00:42:09,048 of minutes 905 00:42:09,048 --> 00:42:14,418 - 10,000 elements. Excuse me. Right? So 906 00:42:14,418 --> 00:42:16,690 this sort of illustrates the point, and 907 00:42:16,690 --> 00:42:17,858 it shows 908 00:42:17,858 --> 00:42:20,009 in a heuristic sense 909 00:42:20,009 --> 00:42:24,139 how the running time grows with the size of the input. But 910 00:42:24,139 --> 00:42:27,989 what's going on here? The - in the far column, we have 911 00:42:27,989 --> 00:42:32,108 the standard deviation, which is a measure of how different the timings 912 00:42:32,108 --> 00:42:32,699 were. 913 00:42:32,699 --> 00:42:34,818 And if you look on a particular row 914 00:42:34,818 --> 00:42:37,699 for any of these trials, all of those numbers are really supposed to be the 915 00:42:37,699 --> 00:42:38,608 same because I'm 916 00:42:38,608 --> 00:42:40,029 sorting the same 917 00:42:40,030 --> 00:42:41,979 number of elements. 918 00:42:41,978 --> 00:42:43,960 Right? But why are they different? 919 00:42:43,960 --> 00:42:45,588 Why such 920 00:42:45,588 --> 00:42:50,130 variation? 921 00:42:50,130 --> 00:42:54,900 So I'm sorting 100 elements, and sometimes it takes 922 00:42:54,900 --> 00:42:58,369 .146 seconds, and sometimes it takes as much as 923 00:42:58,369 --> 00:42:59,919 924 00:42:59,918 --> 00:43:03,518 .176. Just chalk it up to the randomness of the universe? I guess, but it's even 925 00:43:03,518 --> 00:43:07,128 easier to explain that problem away because your computer is doing so many 926 00:43:07,128 --> 00:43:09,688 more things than just running your 927 00:43:09,688 --> 00:43:11,288 eclipse program. It's written 928 00:43:11,289 --> 00:43:15,049 in JAVA. It may have to pay attention to activity on the 929 00:43:15,048 --> 00:43:15,909 network. You may be 930 00:43:15,909 --> 00:43:21,368 running iTunes at the same time, and it has to 931 00:43:21,369 --> 00:43:24,880 load a file in from the disc, 932 00:43:24,880 --> 00:43:26,510 which takes away time. 933 00:43:26,510 --> 00:43:30,380 So these time measurements are just in terms of absolute time, 934 00:43:30,380 --> 00:43:33,699 and for that reason, they don't give a very accurate 935 00:43:33,699 --> 00:43:35,259 indication of 936 00:43:35,259 --> 00:43:38,989 the job that your program was doing all by itself. All 937 00:43:38,989 --> 00:43:40,849 right. So [inaudible] is sort of 938 00:43:40,849 --> 00:43:44,749 crazy values. For 1,000 for the most part, it took just 939 00:43:44,748 --> 00:43:45,709 13 940 00:43:45,710 --> 00:43:50,170 seconds almost as much as 20 and almost as little 941 00:43:50,170 --> 00:43:52,710 as ten. The average was 21, though, 942 00:43:52,710 --> 00:43:55,219 because one of those time trials took 943 00:43:55,219 --> 00:43:57,528 81 seconds. Right? Just because 944 00:43:57,528 --> 00:44:04,528 something else on the computer was happening that was taking time away. Right? So 945 00:44:05,099 --> 00:44:05,999 clearly, 946 00:44:05,998 --> 00:44:10,478 we can begin to supply a better intuition for the question that was asked earlier 947 00:44:10,478 --> 00:44:11,399 about 948 00:44:11,400 --> 00:44:15,849 whether it would be okay to sort an array and then search it if all we wanted was 949 00:44:15,849 --> 00:44:22,849 to do a single search. All right. So 950 00:44:24,699 --> 00:44:27,929 on the very first time through, we're looking at all of the elements in the array 951 00:44:27,929 --> 00:44:29,209 after the first one. 952 00:44:29,208 --> 00:44:34,308 How many are we looking at on the second time through? [Inaudible]. 953 00:44:34,309 --> 00:44:36,099 One less than that? Okay. 954 00:44:36,099 --> 00:44:38,499 And on the third time through, one less than that. 955 00:44:38,498 --> 00:44:40,058 So 956 00:44:40,059 --> 00:44:42,109 the total number that 957 00:44:42,108 --> 00:44:45,088 we're looking for is - 958 00:44:45,088 --> 00:44:46,619 or the total number of 959 00:44:46,619 --> 00:44:51,869 times we have to examine some element of the array is N plus 960 00:44:51,869 --> 00:44:53,709 N minus one 961 00:44:53,708 --> 00:44:56,018 plus N 962 00:44:56,018 --> 00:44:58,778 minus two 963 00:44:58,778 --> 00:45:00,418 all the way down 964 00:45:00,418 --> 00:45:05,038 to one. Right? 965 00:45:05,039 --> 00:45:06,869 Does anyone know how 966 00:45:06,869 --> 00:45:08,539 - what this sums up to? 967 00:45:08,539 --> 00:45:15,539 There's actually a closed form for it. Yeah? [Inaudible]. Okay. 968 00:45:16,539 --> 00:45:19,910 Yeah. So if you work it out with small examples, you'll find that 969 00:45:19,909 --> 00:45:21,118 this is true, and you can 970 00:45:21,119 --> 00:45:23,309 probably then convince yourself of it. 971 00:45:23,309 --> 00:45:24,730 This sums to 972 00:45:24,730 --> 00:45:27,088 N times N 973 00:45:27,088 --> 00:45:28,679 plus one 974 00:45:28,679 --> 00:45:30,539 over two. Okay? 975 00:45:30,539 --> 00:45:33,939 So how do we make sense of that? How do you look at that and tell intuitively what 976 00:45:33,938 --> 00:45:35,498 kind of algorithm this is - 977 00:45:35,498 --> 00:45:38,808 how efficient this selection sort is? Well, 978 00:45:38,809 --> 00:45:41,979 in computer science - this is one of the things I love about computer science - when we're 979 00:45:41,978 --> 00:45:45,078 talking about algorithmic efficiency because we really don't care about any 980 00:45:45,079 --> 00:45:47,839 of the sort of details. We only care about 981 00:45:47,838 --> 00:45:50,838 the aggregate behavior of these programs. 982 00:45:50,838 --> 00:45:52,338 We can drop 983 00:45:52,338 --> 00:45:55,958 pieces of information like this, too. Right? Who cares if the algorithm is 984 00:45:55,958 --> 00:46:00,948 twice as fast or half as fast as another algorithm? If 985 00:46:00,949 --> 00:46:06,119 it's - you just need a computer that's twice as fast or half as fast. 986 00:46:06,119 --> 00:46:07,858 But if, say, 987 00:46:07,858 --> 00:46:11,338 an algorithm takes the square of the amount of time that another algorithm 988 00:46:11,338 --> 00:46:12,219 took, then 989 00:46:12,219 --> 00:46:16,899 really, getting a faster computer is not the answer for sufficiently large 990 00:46:16,900 --> 00:46:18,119 inputs. So - 991 00:46:18,119 --> 00:46:22,528 whereas we can drop constant factors like two and just pretend 992 00:46:22,528 --> 00:46:25,289 that this is - well, 993 00:46:25,289 --> 00:46:27,528 you can see that this 994 00:46:27,528 --> 00:46:33,268 simplifies to N squared plus N. Right? 995 00:46:33,268 --> 00:46:37,468 Not only can we drop constant factors, but we can drop anything that doesn't really 996 00:46:37,469 --> 00:46:40,700 contribute in a big way to the running time. So it doesn't even matter that we have 997 00:46:40,699 --> 00:46:42,038 998 00:46:42,039 --> 00:46:43,170 this plus in here. 999 00:46:43,170 --> 00:46:46,479 So we can look at this and sort of think of it intuitively as 1000 00:46:46,478 --> 00:46:47,759 a quadratic 1001 00:46:47,759 --> 00:46:50,989 that is squared function of the size of the input. 1002 00:46:50,989 --> 00:46:52,079 So what does this mean? 1003 00:46:52,079 --> 00:46:54,528 It means that if we double the size of our input, 1004 00:46:54,528 --> 00:46:56,130 then the selection sort 1005 00:46:56,130 --> 00:47:00,148 that we just walked through is going to take the square of the 1006 00:47:00,148 --> 00:47:06,228 amount of time that it took originally. All right? 1007 00:47:06,228 --> 00:47:08,818 So this is a 1008 00:47:08,818 --> 00:47:11,248 computation of that. You can also think about it 1009 00:47:11,248 --> 00:47:12,238 geometrically 1010 00:47:12,239 --> 00:47:14,599 if you 1011 00:47:14,599 --> 00:47:21,599 treat each of the accesses of array elements as these circles. All right. 1012 00:47:23,268 --> 00:47:27,198 So this isn't quite as bad as the difference between - no. 1013 00:47:27,199 --> 00:47:27,659 Actually, 1014 00:47:27,659 --> 00:47:30,928 I should say that we - 1015 00:47:30,929 --> 00:47:33,079 to demonstrate the difference between 1016 00:47:33,079 --> 00:47:35,929 linear search and binary search, we 1017 00:47:35,929 --> 00:47:38,188 had to make N 1018 00:47:38,188 --> 00:47:41,728 really large, 1019 00:47:41,728 --> 00:47:42,728 while 1020 00:47:42,728 --> 00:47:46,728 log base two of N was still relatively small. 1021 00:47:46,728 --> 00:47:53,728 Here, it's sort of the other way around. 1022 00:47:56,869 --> 00:47:58,709 I guess explaining slides 1023 00:47:58,708 --> 00:48:01,038 that I didn't write is a bit of 1024 00:48:01,039 --> 00:48:03,209 a 1025 00:48:03,208 --> 00:48:04,368 waste of time 1026 00:48:04,369 --> 00:48:06,079 for you guys. So 1027 00:48:06,079 --> 00:48:06,829 do you have 1028 00:48:06,829 --> 00:48:09,668 questions about 1029 00:48:09,668 --> 00:48:11,918 how we did that calculation? 1030 00:48:11,918 --> 00:48:15,568 Why I modified that to make it simpler? 1031 00:48:15,568 --> 00:48:17,099 If you do afterwards, 1032 00:48:17,099 --> 00:48:18,798 then feel free 1033 00:48:18,798 --> 00:48:22,079 to 1034 00:48:22,079 --> 00:48:25,639 ask. So I sort of wanted to leave you with some insight about how we could possibly do a 1035 00:48:25,639 --> 00:48:27,509 little bit better than 1036 00:48:27,509 --> 00:48:28,769 selection sort. 1037 00:48:28,768 --> 00:48:31,929 And it turns out there are algorithms that do substantially better 1038 00:48:31,929 --> 00:48:33,348 than selection sort. 1039 00:48:33,349 --> 00:48:37,729 And one of the easiest of them to describe is an algorithm called radix - 1040 00:48:37,728 --> 00:48:38,108 or 1041 00:48:38,108 --> 00:48:40,098 radix sort. 1042 00:48:40,099 --> 00:48:44,400 And it comes from back in the days when we had these punch cards 1043 00:48:44,400 --> 00:48:47,499 with holes in them to represent 1044 00:48:47,498 --> 00:48:48,498 numbers. 1045 00:48:48,498 --> 00:48:49,480 And - 1046 00:48:49,480 --> 00:48:52,199 so say you had a bunch of three digit numbers that 1047 00:48:52,199 --> 00:48:53,639 you wanted to sort. 1048 00:48:53,639 --> 00:48:55,528 Well, the basic idea 1049 00:48:55,528 --> 00:48:57,329 with radix sort 1050 00:48:57,329 --> 00:48:59,589 is that you would 1051 00:48:59,588 --> 00:49:01,389 set the - you 1052 00:49:01,389 --> 00:49:05,108 would have this machine that could sort of read the holes and 1053 00:49:05,108 --> 00:49:09,199 sort the elements into buckets such that every - 1054 00:49:09,199 --> 00:49:11,938 the elements in this case being the cards. 1055 00:49:11,938 --> 00:49:15,338 So each one of the cards 1056 00:49:15,338 --> 00:49:17,788 is sorted into a bucket according to its - 1057 00:49:17,789 --> 00:49:20,759 the first digit of the number that it represents. Right? 1058 00:49:20,759 --> 00:49:22,878 So you've got these buckets that have 1059 00:49:22,878 --> 00:49:25,568 cards that aren't in any order within the bucket, 1060 00:49:25,568 --> 00:49:26,599 but then 1061 00:49:26,599 --> 00:49:29,240 you at least know that each bucket 1062 00:49:29,239 --> 00:49:31,498 contains all the cards that are 1063 00:49:31,498 --> 00:49:33,438 less than all the cards in the next bucket. 1064 00:49:33,438 --> 00:49:36,199 So you can take them all out and stick them back together 1065 00:49:36,199 --> 00:49:37,459 as a single thing, 1066 00:49:37,458 --> 00:49:39,288 and then set the machine - 1067 00:49:39,289 --> 00:49:42,729 the machine has this thing called the hopper, which is just a place to put the 1068 00:49:42,728 --> 00:49:45,489 cards - all 1069 00:49:45,489 --> 00:49:47,429 right. So run the machine 1070 00:49:47,429 --> 00:49:47,980 again, 1071 00:49:47,980 --> 00:49:49,469 but the second time, you run it 1072 00:49:49,469 --> 00:49:53,798 - have it sort them in the order of the second digit 1073 00:49:53,798 --> 00:49:56,378 in the numbers, 1074 00:49:56,378 --> 00:49:57,639 and 1075 00:49:57,639 --> 00:50:02,038 assuming that we have three digit numbers, then - 1076 00:50:02,039 --> 00:50:05,309 now the cards are all going to be sorted 1077 00:50:05,309 --> 00:50:07,449 within their bins with respect 1078 00:50:07,449 --> 00:50:09,079 to 1079 00:50:09,079 --> 00:50:12,059 both the last digit and the second to last digit. 1080 00:50:12,059 --> 00:50:16,239 And so if we do this just on more time and sort the resulting 1081 00:50:16,239 --> 00:50:22,499 set of cards on the one remaining digit, then we end up with 1082 00:50:22,498 --> 00:50:25,718 bins that actually represent a complete ordering. If we took out the dividers 1083 00:50:25,719 --> 00:50:27,489 between the bins, we could 1084 00:50:27,489 --> 00:50:29,619 stick a stack of cards together, 1085 00:50:29,619 --> 00:50:36,619 and that would be ordered. So 1086 00:50:39,358 --> 00:50:41,078 this again is an animation that 1087 00:50:41,079 --> 00:50:46,259 didn't come through so well. I think 1088 00:50:46,259 --> 00:50:49,599 we're about out of time. So 1089 00:50:49,599 --> 00:50:52,679 I wish that I could have given a better intuition about how much better radix 1090 00:50:52,679 --> 00:50:55,430 sort is than selection sort, but 1091 00:50:55,429 --> 00:50:57,509 thank you very much for paying attention to me. 1092 00:50:57,509 --> 00:50:58,318 And 1093 00:50:58,318 --> 00:50:59,259 have a great Thanksgiving.