1 00:00:00,000 --> 00:00:12,000 2 00:00:12,000 --> 00:00:15,000 This presentation is delivered by the Stanford Center for Professional 3 00:00:15,000 --> 00:00:28,000 Development. 4 00:00:28,000 --> 00:00:29,000 5 00:00:29,000 --> 00:00:33,000 What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever 6 00:00:33,000 --> 00:00:36,000 gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda 7 00:00:36,000 --> 00:00:39,000 clever, inventive idea for how you do search and retrieval in a very efficient 8 00:00:39,000 --> 00:00:40,000 way. 9 00:00:40,000 --> 00:00:44,000 So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday, 10 00:00:44,000 --> 00:00:48,000 where we had talked about ways we might implement the map abstraction, right, the key 11 00:00:48,000 --> 00:00:49,000 value pairing 12 00:00:49,000 --> 00:00:52,000 to allow for quick search and retrieval 13 00:00:52,000 --> 00:00:56,000 and we had looked at vector and had just done a little thought experiment about 14 00:00:56,000 --> 00:00:59,000 what sorting the vector would buy us, right, and that would give us the 15 00:00:59,000 --> 00:01:02,000 ability to do searches faster because we could use binary search, 16 00:01:02,000 --> 00:01:07,000 right. But in the case of add, right, we'd still be slow because the 17 00:01:07,000 --> 00:01:10,000 need be kinda shuffled or rearranged in the continuous memory 18 00:01:10,000 --> 00:01:13,000 and then that's what motivated our introduction to trees there, the binary search tree, 19 00:01:13,000 --> 00:01:17,000 which is designed exactly to enable that kind of searching, right, that logarithmic 20 00:01:17,000 --> 00:01:20,000 divide and conquer, divide in half and work your way down 21 00:01:20,000 --> 00:01:22,000 and then because of the flexibility of the pointer base structure there, right, we 22 00:01:22,000 --> 00:01:24,000 can also do the insert 23 00:01:24,000 --> 00:01:26,000 in the same logarithmic 24 00:01:26,000 --> 00:01:30,000 time because we're no longer constrained by that contiguous memory that the array 25 00:01:30,000 --> 00:01:30,000 fact 26 00:01:30,000 --> 00:01:32,000 data structures force us to. 27 00:01:32,000 --> 00:01:36,000 And so, at that point what we have seen this logarithmic and both get valued at and so 28 00:01:36,000 --> 00:01:38,000 on the PQ, right, which you were just finishing, 29 00:01:38,000 --> 00:01:42,000 you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you 30 00:01:42,000 --> 00:01:44,000 would normally run into in an 31 00:01:44,000 --> 00:01:45,000 ordinary use, right, 32 00:01:45,000 --> 00:01:49,000 it's un-measurable, right. So those of you who got your heap up and running and 33 00:01:49,000 --> 00:01:51,000 correctly working logarithmic and time was discovering that you can put 10,000, 20,000, 34 00:01:51,000 --> 00:01:54,000 50,000, 100,000 things in it and just never be able to measure 35 00:01:54,000 --> 00:01:58,000 what it took to NQ or DQ from those structures because logarithmic is very, 36 00:01:58,000 --> 00:02:00,000 very fast. 37 00:02:00,000 --> 00:02:01,000 So that said, 38 00:02:01,000 --> 00:02:04,000 that's actually a fine point to say, we have a great math notation, we don't have to look further, 39 00:02:04,000 --> 00:02:07,000 but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where 40 00:02:07,000 --> 00:02:10,000 we can do both those operations in constant time, which is to say that 41 00:02:10,000 --> 00:02:13,000 no matter how large the table gets, right, we expect 42 00:02:13,000 --> 00:02:16,000 we'll be able to insert and retrieve with 43 00:02:16,000 --> 00:02:21,000 the exact same time required kind of for a 10,000 table as a million entry 44 00:02:21,000 --> 00:02:22,000 table which is pretty neat. 45 00:02:22,000 --> 00:02:26,000 To note that as we start to get to these more complicated data structures, 46 00:02:26,000 --> 00:02:29,000 one thing we also need to keep in mind, right, is that there are other things 47 00:02:29,000 --> 00:02:33,000 of trade offs, right, in adding that complexity of the code that 48 00:02:33,000 --> 00:02:36,000 starting with something that's simple and 49 00:02:36,000 --> 00:02:39,000 a little slower performer but that's easy to write might actually solve the problem 50 00:02:39,000 --> 00:02:42,000 we have at hand. So maybe that we don't really wanna go to one of these fancier 51 00:02:42,000 --> 00:02:45,000 things because we don't really need it to be that much faster 52 00:02:45,000 --> 00:02:48,000 because the BST, right, adds a bunch of memory, right, so left and right pointers 53 00:02:48,000 --> 00:02:53,000 for every cell, which is an 8 byte overhead on top of the entry itself. 54 00:02:53,000 --> 00:02:56,000 Now we have to deal with pointers, dynamic memories, all this allocation, 55 00:02:56,000 --> 00:02:59,000 deallocation, right, we've got a lot of recursion going on in there 56 00:02:59,000 --> 00:03:01,000 and so just opportunities for error 57 00:03:01,000 --> 00:03:03,000 and stuff to creep into 58 00:03:03,000 --> 00:03:07,000 and if we take the steps to provide tree balancing, which we only kind of grazed 59 00:03:07,000 --> 00:03:11,000 the surface of. We just mentioned the need for it. It actually adds quite a lot of 60 00:03:11,000 --> 00:03:14,000 complication to the code to do the rotations 61 00:03:14,000 --> 00:03:16,000 or the rebalancing 62 00:03:16,000 --> 00:03:21,000 that from a simple binary search tree. So in fact the whole package, right, of 63 00:03:21,000 --> 00:03:24,000 building a really guaranteed log n performing 64 00:03:24,000 --> 00:03:27,000 binary search tree is actually a pretty big investment of human capital 65 00:03:27,000 --> 00:03:29,000 that you 66 00:03:29,000 --> 00:03:30,000 have to kinda 67 00:03:30,000 --> 00:03:33,000 balance against the future hit off of it running faster. I'm 68 00:03:33,000 --> 00:03:36,000 gonna show this strategy today that actually is 69 00:03:36,000 --> 00:03:37,000 kind of simple to write. 70 00:03:37,000 --> 00:03:40,000 It gets us solve one case and doesn't have the degenerate 71 00:03:40,000 --> 00:03:43,000 looming that the binary search tree did. 72 00:03:43,000 --> 00:03:46,000 So, let me kinda just give you a theoretical concept to think about how 73 00:03:46,000 --> 00:03:49,000 you do stuff. 74 00:03:49,000 --> 00:03:51,000 You have a big dictionary. You've got, you know, like a 75 00:03:51,000 --> 00:03:54,000 many thousand page dictionary, right, and you want to look up a word. You certainly 76 00:03:54,000 --> 00:03:57,000 don't do linear search. You're about to look up the word xylophone and you're actually not 77 00:03:57,000 --> 00:04:00,000 going to start from the A's and kind of page your way through, you 78 00:04:00,000 --> 00:04:03,000 know, 2,000 pages to get to the X's. You 79 00:04:03,000 --> 00:04:07,000 don't even ? like we do binary search. You can do binary searches pretty fast, right? It 80 00:04:07,000 --> 00:04:10,000 turns out like starting at the M's to get something that's in X, right? You 81 00:04:10,000 --> 00:04:11,000 tend to know about where X 82 00:04:11,000 --> 00:04:14,000 is and often dictionaries have little tabs 83 00:04:14,000 --> 00:04:17,000 along the pages that give you an idea of kinda where the breaks for 84 00:04:17,000 --> 00:04:22,000 certain words are. Maybe it says Xa is here and Xe is there or something, 85 00:04:22,000 --> 00:04:25,000 that lets you kinda more quickly just illuminate all the surrounding noise 86 00:04:25,000 --> 00:04:29,000 in the dictionary, kinda get straight to the right region where the X words are 87 00:04:29,000 --> 00:04:30,000 and then you can do 88 00:04:30,000 --> 00:04:33,000 a simple search from there, maybe a little bit of a linear or binary search to kind of zero 89 00:04:33,000 --> 00:04:37,000 in on the page where the word is you wanna find is. 90 00:04:37,000 --> 00:04:38,000 This is the idea 91 00:04:38,000 --> 00:04:45,000 behind this concept called a hash table or based on the algorithmic technique of hashing 92 00:04:45,000 --> 00:04:46,000 is that 93 00:04:46,000 --> 00:04:49,000 rather than try to figure out how to kinda keep this very large collection sorted or organized 94 00:04:49,000 --> 00:04:51,000 in a way that you can kind of 95 00:04:51,000 --> 00:04:52,000 jump around within it, it says, 96 00:04:52,000 --> 00:04:56,000 ''Well how about we maintain a bunch of different collections.'' We break up the 97 00:04:56,000 --> 00:04:59,000 data set into little tiny pieces 98 00:04:59,000 --> 00:05:00,000 and if we can 99 00:05:00,000 --> 00:05:02,000 tell, given a key, 100 00:05:02,000 --> 00:05:05,000 like which piece of the data set it's supposed to be in and we'll call those pieces 101 00:05:05,000 --> 00:05:08,000 buckets. I'm gonna put these in different buckets. 102 00:05:08,000 --> 00:05:11,000 So if I had a whole bunch of students I might put all the freshmen in this bucket 103 00:05:11,000 --> 00:05:13,000 and the sophomores in this bucket or something like that and then when I look 104 00:05:13,000 --> 00:05:16,000 for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the 105 00:05:16,000 --> 00:05:18,000 freshmen bucket. 106 00:05:18,000 --> 00:05:19,000 Okay. You know, the idea 107 00:05:19,000 --> 00:05:22,000 in a simple case would be like, ''Okay, well I ? it has some 108 00:05:22,000 --> 00:05:26,000 criteria by which I can divide you into these buckets that don't overlap and everyone 109 00:05:26,000 --> 00:05:28,000 has a unique bucket to go in 110 00:05:28,000 --> 00:05:31,000 and if I can make up enough of those categorizations, 111 00:05:31,000 --> 00:05:33,000 have as many of those buckets around 112 00:05:33,000 --> 00:05:35,000 to divide you up into, 113 00:05:35,000 --> 00:05:38,000 then I can hopefully keep the set that lands in that bucket pretty small, 114 00:05:38,000 --> 00:05:42,000 so that I don't have a smaller of students. So maybe I did it by the 115 00:05:42,000 --> 00:05:45,000 first letter of your last name. I could say, ''Well everybody whose last name begins with W 116 00:05:45,000 --> 00:05:47,000 goes here and everybody who's B goes here and so on.'' Well, 117 00:05:47,000 --> 00:05:50,000 when somebody comes to me then I don't have to look through the B's. 118 00:05:50,000 --> 00:05:51,000 You know, and if you have 119 00:05:51,000 --> 00:05:53,000 a class of 200 students 120 00:05:53,000 --> 00:05:56,000 and you have 26 letters of the alphabet then you'd think that if it were 121 00:05:56,000 --> 00:05:59,000 evenly divided across those letters, it's not going to be to be, it turns out in practice, 122 00:05:59,000 --> 00:06:02,000 but as a thought experiment there'd be about 4 people in a bucket and 123 00:06:02,000 --> 00:06:06,000 it'd be pretty fast to just look through the bucket and find the one who I was 124 00:06:06,000 --> 00:06:08,000 looking for. Okay, 125 00:06:08,000 --> 00:06:10,000 so let's take this idea. This idea of, like, 126 00:06:10,000 --> 00:06:14,000 there's going to be some strategy for how we divide into buckets and we call 127 00:06:14,000 --> 00:06:16,000 that strategy the hash function. 128 00:06:16,000 --> 00:06:18,000 That, given a key, 129 00:06:18,000 --> 00:06:23,000 and a buckets and we'll just label them with numbers. Zero to the number of buckets 130 00:06:23,000 --> 00:06:26,000 minus one. So let's say I have 99 buckets, I've got zero to 131 00:06:26,000 --> 00:06:27,000 98 132 00:06:27,000 --> 00:06:28,000 that I will take the key 133 00:06:28,000 --> 00:06:32,000 and the key in this case is the string key that we're using in the map, 134 00:06:32,000 --> 00:06:33,000 and I have a hash function 135 00:06:33,000 --> 00:06:37,000 which given a key is input produces a number as output 136 00:06:37,000 --> 00:06:39,000 in the range zero to B minus one. 137 00:06:39,000 --> 00:06:42,000 And that is what's called the hash value for that key 138 00:06:42,000 --> 00:06:45,000 and that tells me which bucket to look at, 139 00:06:45,000 --> 00:06:48,000 either find a match for that if I'm trying to do a get value or where to place a 140 00:06:48,000 --> 00:06:52,000 new entry if I'm trying to add into the table. 141 00:06:52,000 --> 00:06:57,000 So that's the basic idea. I'm 142 00:06:57,000 --> 00:07:00,000 gonna talk a little bit about hash functions. 143 00:07:00,000 --> 00:07:03,000 In terms of writing a really good hash function is actually pretty complicated 144 00:07:03,000 --> 00:07:04,000 and so 145 00:07:04,000 --> 00:07:07,000 we're gonna think about it in sort of a ? kind 146 00:07:07,000 --> 00:07:10,000 of a big picture perspective and think about what qualities a hash 147 00:07:10,000 --> 00:07:13,000 function has to have and then I'm actually not gonna go through 148 00:07:13,000 --> 00:07:16,000 proving to you about what mathematically constitutes a good hash function. I'm just gonna give you 149 00:07:16,000 --> 00:07:21,000 some intuitive sense of what kinda things matter when designing a hash function. 150 00:07:21,000 --> 00:07:23,000 So it's given a key that maps it to a number. 151 00:07:23,000 --> 00:07:25,000 Typically it's gonna say, you know, 152 00:07:25,000 --> 00:07:27,000 zero to the number of buckets minus one. 153 00:07:27,000 --> 00:07:30,000 Often that means that somehow it's gonna compute something and then use 154 00:07:30,000 --> 00:07:33,000 the mod operation to bring it back in range. 155 00:07:33,000 --> 00:07:35,000 So it's gonna take your number 156 00:07:35,000 --> 00:07:37,000 and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you 157 00:07:37,000 --> 00:07:39,000 a number zero to four.'' 158 00:07:39,000 --> 00:07:41,000 It's very important that it be 159 00:07:41,000 --> 00:07:44,000 stable, that you put the key in, you get a number out. If you put that same key in 160 00:07:44,000 --> 00:07:48,000 later you should get the same number out. So it has to be sort of reliable guaranteed 161 00:07:48,000 --> 00:07:52,000 mapping. It can't be random. That 162 00:07:52,000 --> 00:07:55,000 said, you want it to almost look as though it's random in terms of 163 00:07:55,000 --> 00:07:59,000 how it maps things to buckets, that you don't want it to have a real systematic plan 164 00:07:59,000 --> 00:08:02,000 where it puts a whole bunch of things in bucket one and very few things in bucket two 165 00:08:02,000 --> 00:08:06,000 or somehow always uses the even buckets and not the odds ones or something like that. So 166 00:08:06,000 --> 00:08:08,000 you wanna have a strategy that's gonna take those keys 167 00:08:08,000 --> 00:08:11,000 and use your whole bucket range 168 00:08:11,000 --> 00:08:12,000 in an equal manner. 169 00:08:12,000 --> 00:08:15,000 So that if you put 100 keys in and you have 100 buckets you're hoping 170 00:08:15,000 --> 00:08:16,000 that, 171 00:08:16,000 --> 00:08:19,000 on average, each bucket will have one thing. That you won't have these clusters 172 00:08:19,000 --> 00:08:22,000 where 25 things are in one bucket, ten here and then a whole bunch of buckets 173 00:08:22,000 --> 00:08:24,000 are empty. So 174 00:08:24,000 --> 00:08:27,000 you want this kinda divided across the range. 175 00:08:27,000 --> 00:08:29,000 So given that we have string input, 176 00:08:29,000 --> 00:08:32,000 likely what we're gonna be using is the characters in the string as part of 177 00:08:32,000 --> 00:08:34,000 the ? how can I take these characters 178 00:08:34,000 --> 00:08:40,000 and somehow kinda jungle them up in a way to sort of move them around in these 179 00:08:40,000 --> 00:08:43,000 buckets in a way that's reliable so that I get that same string back, I get the 180 00:08:43,000 --> 00:08:44,000 same one. 181 00:08:44,000 --> 00:08:47,000 But that doesn't actually produce a lot of artifacts of 182 00:08:47,000 --> 00:08:50,000 similar strings, let's say mapping to the same place, hopefully spreading them 183 00:08:50,000 --> 00:08:52,000 around. 184 00:08:52,000 --> 00:08:55,000 So for example, some really simple things you could use, you could just take the first 185 00:08:55,000 --> 00:08:56,000 letter. So I 186 00:08:56,000 --> 00:08:59,000 talked about this sorta with last names. If I was trying to divide up my students, right, I could divide 187 00:08:59,000 --> 00:09:01,000 them by the first letter of their last name. 188 00:09:01,000 --> 00:09:05,000 That would use exactly 26 buckets, no more, no less, 189 00:09:05,000 --> 00:09:08,000 and if I happen to have 26 buckets 190 00:09:08,000 --> 00:09:11,000 then at least there's some chance it would get some coverage across the range. 191 00:09:11,000 --> 00:09:14,000 There is likely to be clustering, right, some 192 00:09:14,000 --> 00:09:15,000 names are a lot more common 193 00:09:15,000 --> 00:09:18,000 or letters are a lot more commonly used than others. 194 00:09:18,000 --> 00:09:21,000 So it would tend to produce a little bit of an artifact, but it would be reliable and it would 195 00:09:21,000 --> 00:09:24,000 use 26 buckets. If I had 100 buckets, 196 00:09:24,000 --> 00:09:26,000 then this would be kind of a crummy function 197 00:09:26,000 --> 00:09:28,000 because in fact it would use the first 26 buckets and none of the other ones, 198 00:09:28,000 --> 00:09:30,000 right, if it was just taking 199 00:09:30,000 --> 00:09:32,000 that letter and deciding. So, 200 00:09:32,000 --> 00:09:36,000 you know, in some situations this would get me a little bit of the thing I wanted. 201 00:09:36,000 --> 00:09:40,000 Things like using the length of the word. A 202 00:09:40,000 --> 00:09:42,000 very, very bad clustering function here, 203 00:09:42,000 --> 00:09:44,000 right because there'd be a bunch of strings that are, like, six characters, seven 204 00:09:44,000 --> 00:09:46,000 characters, ten characters, 205 00:09:46,000 --> 00:09:50,000 but very few that are one or two, and very few that are more than ten, or something. 206 00:09:50,000 --> 00:09:53,000 You'd have this very tight little cluster in the middle, right, and would not use a 207 00:09:53,000 --> 00:09:57,000 large bucket range effectively at all. 208 00:09:57,000 --> 00:10:01,000 More likely what you're gonna try to use is kind of more of the information about the word, 209 00:10:01,000 --> 00:10:04,000 the key that you have. Instead of just a single letter or the count of letters, it's 210 00:10:04,000 --> 00:10:06,000 trying to use kinda all the data you have at hand, 211 00:10:06,000 --> 00:10:08,000 which is to say all the letters, 212 00:10:08,000 --> 00:10:12,000 and so one way to do it would be to take the ASCII values for the letters, so A is 213 00:10:12,000 --> 00:10:13,000 97, you 214 00:10:13,000 --> 00:10:15,000 know, B is 98, 215 00:10:15,000 --> 00:10:17,000 is you would take a letter, a word like atoms and you'd add 216 00:10:17,000 --> 00:10:18,000 96 217 00:10:18,000 --> 00:10:22,000 plus 99 plus 96, you know, add 218 00:10:22,000 --> 00:10:23,000 them together 219 00:10:23,000 --> 00:10:26,000 and then you'd get some sum of them, 500, 600, whatever it is, 220 00:10:26,000 --> 00:10:28,000 depending on how many letters there are 221 00:10:28,000 --> 00:10:32,000 and then use mod to back it back into ? let's say your range was, you had 100 222 00:10:32,000 --> 00:10:34,000 buckets, right, so you'd want it to be zero to 99, so 223 00:10:34,000 --> 00:10:37,000 you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd 224 00:10:37,000 --> 00:10:42,000 be, okay, the sum to 524 then it goes into bucket 24. 225 00:10:42,000 --> 00:10:47,000 So if you did that, right, and so if act 226 00:10:47,000 --> 00:10:50,000 is the word that you have, right, you'd add these codes together and let's say this is 227 00:10:50,000 --> 00:10:53,000 97 and then this is 99 and then 228 00:10:53,000 --> 00:10:57,000 that's, you know, a hundred and something and so we end up with ? I'm making this 229 00:10:57,000 --> 00:10:59,000 number up, right ? 283, then we might 230 00:10:59,000 --> 00:11:03,000 mod to the last two digit and we'll say this one goes into bucket 83. 231 00:11:03,000 --> 00:11:07,000 And if we have, you know, dog, 232 00:11:07,000 --> 00:11:11,000 maybe that adds to 267 and we would 233 00:11:11,000 --> 00:11:13,000 mod that down to 234 00:11:13,000 --> 00:11:17,000 67. So this actually tries to use, kinda, all the information at hand, all the letters 235 00:11:17,000 --> 00:11:18,000 that we have 236 00:11:18,000 --> 00:11:19,000 and bring them down, 237 00:11:19,000 --> 00:11:22,000 and so if you figure that across the space of all the numbers that are being added here 238 00:11:22,000 --> 00:11:25,000 you are just as likely to get 210 as you are to get 299, you know 239 00:11:25,000 --> 00:11:28,000 that any of the subsequent numbers in between, 240 00:11:28,000 --> 00:11:32,000 then hopefully you'd get kind of a spread across your buckets. 241 00:11:32,000 --> 00:11:36,000 There is one thing this clusters on. Does anybody see what words cluster in this 242 00:11:36,000 --> 00:11:40,000 system? 243 00:11:40,000 --> 00:11:41,000 That you'd think of as 244 00:11:41,000 --> 00:11:47,000 maybe not being related, but would land in the same bucket? To 245 00:11:47,000 --> 00:11:49,000 rearrange some numbers. Yes. Any 246 00:11:49,000 --> 00:11:51,000 anagram, right, so 247 00:11:51,000 --> 00:11:52,000 you 248 00:11:52,000 --> 00:11:54,000 take cat which is just a ranagram ? 249 00:11:54,000 --> 00:11:57,000 anagram of act, right, it's gonna add to the same thing, right, 250 00:11:57,000 --> 00:12:00,000 and so it would be a little bit of clustering in that respect. 251 00:12:00,000 --> 00:12:02,000 So anagrams come down to the same place. 252 00:12:02,000 --> 00:12:05,000 And so the words that you expect to, kinda like, act and cat to 253 00:12:05,000 --> 00:12:09,000 you seem like different words, they are gonna collide in this system and that might be a little 254 00:12:09,000 --> 00:12:11,000 bit of something we worry about. So one way you can actually kind of avoid that in the 255 00:12:11,000 --> 00:12:12,000 way that 256 00:12:12,000 --> 00:12:15,000 the hash function we're going to eventually use will do, is 257 00:12:15,000 --> 00:12:16,000 it'll do this thing where 258 00:12:16,000 --> 00:12:21,000 not only does it add the letters together, but it also multiplies the letter by a 259 00:12:21,000 --> 00:12:22,000 very large prime number 260 00:12:22,000 --> 00:12:25,000 that helps to kinda say, ''Well the large prime number 261 00:12:25,000 --> 00:12:26,000 times C plus 262 00:12:26,000 --> 00:12:30,000 the large prime number squared times A plus the large prime number cubed times C and 263 00:12:30,000 --> 00:12:33,000 then it just allows the, each position of the letter actually is also then 264 00:12:33,000 --> 00:12:35,000 encoded in the number that comes out 265 00:12:35,000 --> 00:12:37,000 and so something that's positionally different 266 00:12:37,000 --> 00:12:40,000 won't necessarily land in the same bucket 267 00:12:40,000 --> 00:12:43,000 without some other factors kinda coming into play. So a little 268 00:12:43,000 --> 00:12:49,000 bit of trying to just take the information and jumble it up a little bit. 269 00:12:49,000 --> 00:12:51,000 So let me show you a little bit about how this looks. Actually I'm gonna tell you about 270 00:12:51,000 --> 00:12:54,000 the collisions and then I'll show you the data structure. 271 00:12:54,000 --> 00:12:58,000 So certainly I said there's gonna be this problem where things collide, 272 00:12:58,000 --> 00:13:01,000 that 83 and 83 in this simple system of adding the codes will 273 00:13:01,000 --> 00:13:03,000 come out the same. 274 00:13:03,000 --> 00:13:06,000 So it can't be that each of these buckets holds exactly one thing. 275 00:13:06,000 --> 00:13:07,000 The hashing, 276 00:13:07,000 --> 00:13:08,000 right, although we may 277 00:13:08,000 --> 00:13:11,000 try to get those buckets as small as possible we have to account for the 278 00:13:11,000 --> 00:13:14,000 possibility that two things will collide, that their hash function, even though 279 00:13:14,000 --> 00:13:16,000 they were different keys, 280 00:13:16,000 --> 00:13:18,000 will land in the same bucket 281 00:13:18,000 --> 00:13:20,000 based on how I've jumbled it up 282 00:13:20,000 --> 00:13:23,000 and so we're trying to avoid that, but when it does happen we also have to have a 283 00:13:23,000 --> 00:13:25,000 strategy for it. So we have to have a way of saying, ''Well 284 00:13:25,000 --> 00:13:28,000 there can be more than one thing in a bucket.'' Or some other way of when there's 285 00:13:28,000 --> 00:13:33,000 a collision, deciding where to put the one that was placed in the table second. 286 00:13:33,000 --> 00:13:36,000 And in the strategy, we're gonna use the one called chaning, 287 00:13:36,000 --> 00:13:39,000 and the chaning basically says, ''Well if once we have something in a bucket like ACT, 288 00:13:39,000 --> 00:13:43,000 if we hash something and it comes to the same bucket, we'll just tack it onto that bucket, 289 00:13:43,000 --> 00:13:45,000 in this case using a link list.'' 290 00:13:45,000 --> 00:13:48,000 So each of these buckets is actually gonna be a link list of 291 00:13:48,000 --> 00:13:49,000 the entries 292 00:13:49,000 --> 00:13:52,000 and so the entire table will be a 293 00:13:52,000 --> 00:13:54,000 vector or array of those. 294 00:13:54,000 --> 00:13:56,000 So let me show you, 295 00:13:56,000 --> 00:13:59,000 a little 296 00:13:59,000 --> 00:14:05,000 live action. 297 00:14:05,000 --> 00:14:08,000 So this is a hash table that has seven buckets. 298 00:14:08,000 --> 00:14:08,000 In 299 00:14:08,000 --> 00:14:11,000 this case number 0 through 6 are assigned to each one. 300 00:14:11,000 --> 00:14:14,000 They all start empty, and so these are all empty link lists illustrated by the 301 00:14:14,000 --> 00:14:15,000 null, 302 00:14:15,000 --> 00:14:17,000 and then the hash function 303 00:14:17,000 --> 00:14:20,000 is shown here like a black box. You don't actually know how the workings of hash 304 00:14:20,000 --> 00:14:23,000 function is, but you know what it's job is to do, which is you give it a key and it gives you 305 00:14:23,000 --> 00:14:27,000 a number in the range 0 to 6, and that same key when passed, and again gives you 306 00:14:27,000 --> 00:14:28,000 the same number. 307 00:14:28,000 --> 00:14:31,000 So if I say 106B 308 00:14:31,000 --> 00:14:34,000 equals Julie, it pushes 106B through the hash function and says, 309 00:14:34,000 --> 00:14:35,000 ''Okay, well 106B 310 00:14:35,000 --> 00:14:37,000 comes out as one.'' 311 00:14:37,000 --> 00:14:40,000 You add those numbers together, jumble them up, the letters, and you get one. 312 00:14:40,000 --> 00:14:43,000 And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new 313 00:14:43,000 --> 00:14:46,000 cell for 106B Julie and stick it in the table.'' 314 00:14:46,000 --> 00:14:48,000 So now there's a 315 00:14:48,000 --> 00:14:51,000 non empty chain in one place, the rest are empty. 316 00:14:51,000 --> 00:14:53,000 If I say 107 317 00:14:53,000 --> 00:14:55,000 is being taught by 318 00:14:55,000 --> 00:14:57,000 Jerry, 107 happens to get the same hash function. 319 00:14:57,000 --> 00:15:00,000 107, 106B, different letters but it just happen to come out that way, 320 00:15:00,000 --> 00:15:01,000 hash to one 321 00:15:01,000 --> 00:15:04,000 and so it went ahead in this case just tacked it on to the front of the list. 322 00:15:04,000 --> 00:15:05,000 If I do 323 00:15:05,000 --> 00:15:08,000 106A 324 00:15:08,000 --> 00:15:10,000 being taught by Patrick, 325 00:15:10,000 --> 00:15:11,000 106A 326 00:15:11,000 --> 00:15:14,000 happens to sum to zero in the hash function. 327 00:15:14,000 --> 00:15:17,000 That's all we know about the black box that ends up loading it up into the zero's 328 00:15:17,000 --> 00:15:17,000 slot. 329 00:15:17,000 --> 00:15:21,000 I say, well wait, it's being taught by 330 00:15:21,000 --> 00:15:25,000 Nick, it also come out to be zero and so 331 00:15:25,000 --> 00:15:27,000 it ends up kinda pushing on the front of that. So I go 103 332 00:15:27,000 --> 00:15:30,000 is being taught by Maron and it's in 333 00:15:30,000 --> 00:15:33,000 bucket five and so as I kinda type things in I'm starting to see that things are just kinda 334 00:15:33,000 --> 00:15:35,000 going around in difference place in the table. 335 00:15:35,000 --> 00:15:38,000 A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue 336 00:15:38,000 --> 00:15:42,000 to type in strings for other course numbers, I'd hopefully kinda fill out the table 337 00:15:42,000 --> 00:15:43,000 to where there was kind of an even, 338 00:15:43,000 --> 00:15:46,000 roughly even spread across that. 339 00:15:46,000 --> 00:15:47,000 When I want to look something up 340 00:15:47,000 --> 00:15:51,000 because I wanna see who's teaching 106B, then 341 00:15:51,000 --> 00:15:54,000 it will use the same process in reverse. It asks the hash function, 342 00:15:54,000 --> 00:15:56,000 ''Well which bucket should I look in?'' And it 343 00:15:56,000 --> 00:15:59,000 says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So 344 00:15:59,000 --> 00:16:01,000 all the other buckets are eliminated 345 00:16:01,000 --> 00:16:04,000 from consideration, so very quick kind of 346 00:16:04,000 --> 00:16:07,000 focused down to the right place and then it actually just does a link list reversal to 347 00:16:07,000 --> 00:16:10,000 find 106B in the bucket 348 00:16:10,000 --> 00:16:11,000 where it has to be, 349 00:16:11,000 --> 00:16:13,000 and so 350 00:16:13,000 --> 00:16:16,000 the key to realizing what makes this so magical, right, 351 00:16:16,000 --> 00:16:19,000 is that the choice of the number of buckets 352 00:16:19,000 --> 00:16:21,000 is up to me. 353 00:16:21,000 --> 00:16:24,000 I can pick the number of buckets to be whatever I want. 354 00:16:24,000 --> 00:16:27,000 I can pick it to be five, I can pick it to be 10, I can be to be 355 00:16:27,000 --> 00:16:29,000 1,000, I can pick it to be a million. If 356 00:16:29,000 --> 00:16:34,000 I choose the number of buckets to be kind of roughly in the same order of magnitude 357 00:16:34,000 --> 00:16:35,000 as the number of entries. So if I 358 00:16:35,000 --> 00:16:38,000 wanna put a thousand things in my table, 359 00:16:38,000 --> 00:16:40,000 if I make a thousand buckets to hold them in 360 00:16:40,000 --> 00:16:43,000 and my hash function can divide stuff 361 00:16:43,000 --> 00:16:45,000 pretty equally across those thousand buckets, 362 00:16:45,000 --> 00:16:47,000 then I will expect that each of those link lists is about length one. 363 00:16:47,000 --> 00:16:52,000 Some will be two, some will be zero, but if it's doing its job right then I have a 364 00:16:52,000 --> 00:16:56,000 bunch of little tiny buckets, each of which holds a tiny fraction of 365 00:16:56,000 --> 00:16:58,000 the data set, in this case one or two, 366 00:16:58,000 --> 00:17:00,000 and it's very easy to search through them. 367 00:17:00,000 --> 00:17:05,000 If I know I'm gonna put a million things in, then I make a million buckets, and then 368 00:17:05,000 --> 00:17:08,000 I divide one out of a million things, I run it through the hash function, it gives me 369 00:17:08,000 --> 00:17:09,000 this number 370 00:17:09,000 --> 00:17:12,000 that says, ''Oh, it's 862,444.'' I go to 371 00:17:12,000 --> 00:17:16,000 that bucket and the item is either there in that bucket or I have to look a little 372 00:17:16,000 --> 00:17:18,000 bit to find it, 373 00:17:18,000 --> 00:17:20,000 but there's no other place in the table I have to look. So the fact that there's 374 00:17:20,000 --> 00:17:21,000 a 375 00:17:21,000 --> 00:17:24,000 million other entries kinda near by in these other buckets is completely not 376 00:17:24,000 --> 00:17:26,000 important to us, right, and so 377 00:17:26,000 --> 00:17:28,000 the searching and 378 00:17:28,000 --> 00:17:32,000 updating, adding, and replacing, and removing, and stuff all happens at the bucket level 379 00:17:32,000 --> 00:17:35,000 and by choosing the buckets to be 380 00:17:35,000 --> 00:17:36,000 381 00:17:36,000 --> 00:17:39,000 the number, total number are gonna be roughly equal to the size of the things, then 382 00:17:39,000 --> 00:17:40,000 I have this 383 00:17:40,000 --> 00:17:43,000 constant time access to 384 00:17:43,000 --> 00:17:50,000 the tiny part of the subset of the set that matters. That's pretty cool. Let's write some 385 00:17:50,000 --> 00:17:51,000 code. It's kinda 386 00:17:51,000 --> 00:17:53,000 good to see the whole thing 387 00:17:53,000 --> 00:18:00,000 doing what it needs to do. So I'm gonna go ahead and go back over here. 388 00:18:00,000 --> 00:18:02,000 All right, so I've got my 389 00:18:02,000 --> 00:18:05,000 map which has add and get value and 390 00:18:05,000 --> 00:18:07,000 not much else there, right, 391 00:18:07,000 --> 00:18:12,000 but kinda just set up and ready for us to go. What I'm gonna build here 392 00:18:12,000 --> 00:18:13,000 is a 393 00:18:13,000 --> 00:18:15,000 link list cell 394 00:18:15,000 --> 00:18:20,000 that's gonna hold a string key, a val type value, and 395 00:18:20,000 --> 00:18:24,000 a pointer to the next, and 396 00:18:24,000 --> 00:18:27,000 so I plan on having each one of these cells, right, for each entry that goes in the 397 00:18:27,000 --> 00:18:27,000 table 398 00:18:27,000 --> 00:18:31,000 and then I'm going to have an array of 399 00:18:31,000 --> 00:18:32,000 these, and 400 00:18:32,000 --> 00:18:35,000 I'm gonna make a constant for it, and then I'll talk a little bit about the 401 00:18:35,000 --> 00:18:42,000 consequences of that. So I'm gonna make a constant that is ? so 402 00:18:42,000 --> 00:18:45,000 I make it 99. So then I have 99 buckets to start with. That's not 403 00:18:45,000 --> 00:18:49,000 gonna solve all my problems, it's gonna show that it's kinda tuned for maps that 404 00:18:49,000 --> 00:18:51,000 expect to hold about a hundred things, but 405 00:18:51,000 --> 00:18:56,000 ? and so I have an array in this case, so if you look at deciphering this 406 00:18:56,000 --> 00:18:58,000 declaration is a little bit complicated here. 407 00:18:58,000 --> 00:19:00,000 It says that buckets is the name of the variable, 408 00:19:00,000 --> 00:19:06,000 it is an array of num buckets length, so a fixed size, 99 409 00:19:06,000 --> 00:19:07,000 entry array here. 410 00:19:07,000 --> 00:19:11,000 Each of those entries is a pointer to a cell, so ready to be the 411 00:19:11,000 --> 00:19:16,000 head pointer of a link list for us. 412 00:19:16,000 --> 00:19:19,000 I'm gonna add a ? 413 00:19:19,000 --> 00:19:23,000 well, we'll get there in a second. I'm gonna go over here and work on the constructor first. 414 00:19:23,000 --> 00:19:25,000 So 415 00:19:25,000 --> 00:19:30,000 the first thing I'm gonna do is make sure that those pointers have good values. 416 00:19:30,000 --> 00:19:32,000 If I do not do this, right, they will be junk 417 00:19:32,000 --> 00:19:36,000 and there will be badness that will happen, so I can make sure that each of them starts off as an 418 00:19:36,000 --> 00:19:37,000 empty list, 419 00:19:37,000 --> 00:19:42,000 and then correspondently, right, I would need to be doing a delete all list 420 00:19:42,000 --> 00:19:43,000 cells here, but 421 00:19:43,000 --> 00:19:48,000 I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then 422 00:19:48,000 --> 00:19:52,000 delete all the cells in each of those lists that are there. 423 00:19:52,000 --> 00:19:54,000 And then for add and get value 424 00:19:54,000 --> 00:19:56,000 what I'm gonna ? else we need to do 425 00:19:56,000 --> 00:19:57,000 is figure out 426 00:19:57,000 --> 00:19:58,000 in both cases, right, which 427 00:19:58,000 --> 00:20:00,000 is the appropriate list to operate on. 428 00:20:00,000 --> 00:20:03,000 So running the hash function on my key, 429 00:20:03,000 --> 00:20:08,000 seeing where it tells me to look, and then either searching it to find the match 430 00:20:08,000 --> 00:20:09,000 for get value 431 00:20:09,000 --> 00:20:13,000 to return the matching value or in the add case, to search for the matching 432 00:20:13,000 --> 00:20:16,000 entry to overwrite if it exists and if not, then to create a new cell and tack it on the 433 00:20:16,000 --> 00:20:19,000 front of the list. So 434 00:20:19,000 --> 00:20:21,000 let's write a 435 00:20:21,000 --> 00:20:25,000 helper function because they're both gonna do the same thing. I'm gonna need 436 00:20:25,000 --> 00:20:28,000 the same thing twice. So I'm gonna put a hash function in here 437 00:20:28,000 --> 00:20:31,000 that, given a key and a number of buckets, is gonna 438 00:20:31,000 --> 00:20:35,000 return to me the right bucket to look in, 439 00:20:35,000 --> 00:20:39,000 and then I'm gonna write a find cell function that, given a key and 440 00:20:39,000 --> 00:20:39,000 the pointer to a list, 441 00:20:39,000 --> 00:20:43,000 will find the cell that matches that key or return a null if it didn't find 442 00:20:43,000 --> 00:20:52,000 it. Okay, let me put my hash 443 00:20:52,000 --> 00:21:03,000 function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ? 444 00:21:03,000 --> 00:21:05,000 so what I'm gonna do, I'm 445 00:21:05,000 --> 00:21:07,000 gonna make an interesting choice here 446 00:21:07,000 --> 00:21:12,000 and I'm gonna have my hash function just return zero, which 447 00:21:12,000 --> 00:21:13,000 seems pretty surprising. 448 00:21:13,000 --> 00:21:16,000 But I'm gonna talk about why that is. It turns out, when I'm first building it that 449 00:21:16,000 --> 00:21:19,000 actually is an easy way to test that my code is working 450 00:21:19,000 --> 00:21:22,000 and to not be dependent on the behavior of the hash function. At this point it says, ''We'll 451 00:21:22,000 --> 00:21:23,000 just put 452 00:21:23,000 --> 00:21:25,000 everything in the zero's bucket.'' 453 00:21:25,000 --> 00:21:27,000 Now that's not a good strategy for my long term efficiency, 454 00:21:27,000 --> 00:21:29,000 but it is a good way to test that 455 00:21:29,000 --> 00:21:33,000 my handling of the bucket, and the searching, and the finding, and 456 00:21:33,000 --> 00:21:36,000 inserting is correct, and the performance I should expect to be abysmal on this, right, it should be 457 00:21:36,000 --> 00:21:37,000 totally linear 458 00:21:37,000 --> 00:21:39,000 in terms of the number of elements to add or to get them because they'll all 459 00:21:39,000 --> 00:21:41,000 be in one link list 460 00:21:41,000 --> 00:21:45,000 with no easy access to them. And then eventually I can change this to divide across 461 00:21:45,000 --> 00:21:48,000 the buckets and get better spread, but the correctness of the hash table 462 00:21:48,000 --> 00:21:52,000 isn't affected by this choice. It's kind of an interesting way to think about designing 463 00:21:52,000 --> 00:21:54,000 the code. 464 00:21:54,000 --> 00:22:01,000 The other helper I have down here is gonna be my find cell that, given the 465 00:22:01,000 --> 00:22:04,000 head pointer to a list and the key, 466 00:22:04,000 --> 00:22:14,000 is gonna go find it, and it's basically just 467 00:22:14,000 --> 00:22:17,000 if this ker cells key equals the key, 468 00:22:17,000 --> 00:22:20,000 then we return it and then if we've gone through the whole list and haven't found it, we return 469 00:22:20,000 --> 00:22:22,000 a null. 470 00:22:22,000 --> 00:22:25,000 So this guy is returning a cell 471 00:22:25,000 --> 00:22:27,000 T star, right, 472 00:22:27,000 --> 00:22:30,000 as the matching cell within the thing, and this one is gonna provoke 473 00:22:30,000 --> 00:22:31,000 that 474 00:22:31,000 --> 00:22:35,000 little C++ compilation snafu that we encountered on the binary search 475 00:22:35,000 --> 00:22:36,000 tree, as well, 476 00:22:36,000 --> 00:22:40,000 which is that cell T, right, is a type that's declared in the private 477 00:22:40,000 --> 00:22:43,000 section within the scope of my map. So 478 00:22:43,000 --> 00:22:46,000 outside of the scope, which to say, before I've crossed the my map 479 00:22:46,000 --> 00:22:51,000 angle bracket val type colon, colon, it doesn't actually believe we're in my map scope 480 00:22:51,000 --> 00:22:54,000 yet. It hasn't seen that and it can't anticipate it, so I have to actually give 481 00:22:54,000 --> 00:22:56,000 the full name 482 00:22:56,000 --> 00:22:58,000 with the scope specifier on it, 483 00:22:58,000 --> 00:23:03,000 and then because of this use of the template name outside of its class 484 00:23:03,000 --> 00:23:09,000 is a special place for the C++ compiler, the type name keyword also 485 00:23:09,000 --> 00:23:12,000 needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the 486 00:23:12,000 --> 00:23:14,000 right place, which I kinda 487 00:23:14,000 --> 00:23:15,000 short changed, 488 00:23:15,000 --> 00:23:18,000 and then searching to find a cell. If 489 00:23:18,000 --> 00:23:23,000 I go back up here to get value, 490 00:23:23,000 --> 00:23:25,000 and I call the hash function, 491 00:23:25,000 --> 00:23:31,000 given my key and the number of buckets to find out which bucket to work on, 492 00:23:31,000 --> 00:23:35,000 I go looking for a match by calling find cell 493 00:23:35,000 --> 00:23:40,000 of the key in buckets of which bucket, 494 00:23:40,000 --> 00:23:43,000 and then if match does not equal null then 495 00:23:43,000 --> 00:23:45,000 I can return match as val and 496 00:23:45,000 --> 00:23:46,000 then 497 00:23:46,000 --> 00:23:47,000 otherwise 498 00:23:47,000 --> 00:23:52,000 we say, ''No such key found.'' So that's 499 00:23:52,000 --> 00:23:58,000 the error case, right, for get value was that if it didn't find one, its behavior is to 500 00:23:58,000 --> 00:23:59,000 raise an error 501 00:23:59,000 --> 00:24:01,000 so that someone knows that they've 502 00:24:01,000 --> 00:24:05,000 asked something it can't deal 503 00:24:05,000 --> 00:24:09,000 with. Okay, so the key thing here usually is that the hash is some sort of constant operation that 504 00:24:09,000 --> 00:24:12,000 just jumbles the key up, in this case it just returned zero, but could actually look at 505 00:24:12,000 --> 00:24:14,000 the elements of the key, 506 00:24:14,000 --> 00:24:16,000 and then it does its reversal down that link list, 507 00:24:16,000 --> 00:24:20,000 which given a correct working hash function should have divided them up in 508 00:24:20,000 --> 00:24:20,000 these 509 00:24:20,000 --> 00:24:23,000 little tiny chains, then we can just find the one 510 00:24:23,000 --> 00:24:26,000 quickly by looking through that link list. If we find it, right, we've got it, 511 00:24:26,000 --> 00:24:28,000 otherwise, right, we're in 512 00:24:28,000 --> 00:24:31,000 the error case. 513 00:24:31,000 --> 00:24:37,000 Add looks almost exactly the same. I 514 00:24:37,000 --> 00:24:40,000 start by getting the bucket, doing the find cell, 515 00:24:40,000 --> 00:24:41,000 if it's not ? 516 00:24:41,000 --> 00:24:43,000 if it did find it, right, 517 00:24:43,000 --> 00:24:46,000 then we overwrite, 518 00:24:46,000 --> 00:24:49,000 and then in the second case, right, where we didn't find it, right, then we need to 519 00:24:49,000 --> 00:24:55,000 make ourselves a new cell, and 520 00:24:55,000 --> 00:24:59,000 copy in the key, copy in the 521 00:24:59,000 --> 00:25:01,000 value, 522 00:25:01,000 --> 00:25:05,000 set it ? in this case, the easiest place, right, to add something to a link list is I should just do append 523 00:25:05,000 --> 00:25:08,000 it to the front, right, no need to make it hard on ourselves, right, we're not 524 00:25:08,000 --> 00:25:11,000 keeping them in any particular order, whatsoever, right, they're base on kind of order 525 00:25:11,000 --> 00:25:12,000 of entry. 526 00:25:12,000 --> 00:25:17,000 So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So 527 00:25:17,000 --> 00:25:18,000 it will 528 00:25:18,000 --> 00:25:20,000 point to the 529 00:25:20,000 --> 00:25:23,000 current front pointer of the cell, 530 00:25:23,000 --> 00:25:26,000 and then we will update the 531 00:25:26,000 --> 00:25:34,000 bucket pointer to point to this guy. So I think 532 00:25:34,000 --> 00:25:35,000 we are 533 00:25:35,000 --> 00:25:37,000 in okay shape here. Let me 534 00:25:37,000 --> 00:25:40,000 take a look and make sure I like what I did, right, 535 00:25:40,000 --> 00:25:44,000 so hashing, finding the match, if it found a non empty, so ? this is the 536 00:25:44,000 --> 00:25:46,000 overwrite case of finding an existing cell, we just overwrite with that. Otherwise, 537 00:25:46,000 --> 00:25:50,000 making a new cell and then attaching it to be the 538 00:25:50,000 --> 00:25:50,000 front 539 00:25:50,000 --> 00:25:52,000 of the bucket there. 540 00:25:52,000 --> 00:25:55,000 So for example, in the case where the bucket's totally empty, it's good to kinda trace 541 00:25:55,000 --> 00:25:59,000 that. That's the most common case when we start, right, if we hash to bucket zero 542 00:25:59,000 --> 00:26:01,000 and we do a search of a null link list, we should get a null out, 543 00:26:01,000 --> 00:26:03,000 so then we create a new cell, right, 544 00:26:03,000 --> 00:26:06,000 and we set the next field to be what the current bucket's head pointer was, which 545 00:26:06,000 --> 00:26:10,000 was null. So there should be a single to list and then we update the bucket 546 00:26:10,000 --> 00:26:12,000 to point to this guy. So then we should have a 547 00:26:12,000 --> 00:26:14,000 single list cell with null trailing it 548 00:26:14,000 --> 00:26:18,000 or inserting into the empty bucket case. 549 00:26:18,000 --> 00:26:20,000 I like it. Let's see if I have a little code over 550 00:26:20,000 --> 00:26:24,000 here that adds 551 00:26:24,000 --> 00:26:27,000 some things, car, car, cat, with some numbers. Okay, and that tries to 552 00:26:27,000 --> 00:26:31,000 retrieve the value of car, which it wrote a three and then it wrote a four on top 553 00:26:31,000 --> 00:26:39,000 of it, so hopefully the answer we should get out of this is four. Okay, if I ? be 554 00:26:39,000 --> 00:26:41,000 interesting to do something, like ask for something that we know is not in the 555 00:26:41,000 --> 00:26:45,000 table, right, to see that our error case gets handled correctly, 556 00:26:45,000 --> 00:26:46,000 the switch 557 00:26:46,000 --> 00:26:47,000 key's 558 00:26:47,000 --> 00:26:50,000 found. And if I were to continue to do this, I could do a little bit of stress testing to see how 559 00:26:50,000 --> 00:26:55,000 that zero is causing us some performance implications, but if I sat there and just put tons and 560 00:26:55,000 --> 00:26:56,000 tons of things in there, right, 561 00:26:56,000 --> 00:27:00,000 made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one 562 00:27:00,000 --> 00:27:04,000 long chain off the zero bucket, ignoring all the other ones, right, 563 00:27:04,000 --> 00:27:08,000 but then it's kinda stress testing my link list handling about the adding and 564 00:27:08,000 --> 00:27:09,000 searching that list, but 565 00:27:09,000 --> 00:27:12,000 would also show us that we expect that as we got to be, have 100 entries, 200 566 00:27:12,000 --> 00:27:16,000 entries, 300 entries that subsequent adds and get values, right, would 567 00:27:16,000 --> 00:27:21,000 show a linear increase in performance because of the 568 00:27:21,000 --> 00:27:27,000 cluster that we got going there. 569 00:27:27,000 --> 00:27:30,000 So let me give you a real hash function. I'll talk 570 00:27:30,000 --> 00:27:37,000 you through what it's doing and then I'll 571 00:27:37,000 --> 00:27:39,000 leave you with the thought of, 572 00:27:39,000 --> 00:27:41,000 that writing one of these actually correctly and reliably is 573 00:27:41,000 --> 00:27:44,000 actually 574 00:27:44,000 --> 00:27:47,000 more of an advance exercise than we're gonna look at, but this is actually a hash 575 00:27:47,000 --> 00:27:52,000 function that's taken from Don Knuth's Art of Computer Science, 576 00:27:52,000 --> 00:27:55,000 some of the work that guides a lot of computer scientists still to this day, 577 00:27:55,000 --> 00:27:59,000 and it has the strategy I basically told you about of, 578 00:27:59,000 --> 00:28:02,000 for all the characters that are in the string, 579 00:28:02,000 --> 00:28:06,000 right, adding it into the thing, but having multiplied, 580 00:28:06,000 --> 00:28:10,000 right, the previous hash code that's being built up by this large multiplier 581 00:28:10,000 --> 00:28:12,000 which is in effect kinda taking 582 00:28:12,000 --> 00:28:14,000 advantage of the positional information. 583 00:28:14,000 --> 00:28:17,000 And so the multiplier happens to be a very large negative number 584 00:28:17,000 --> 00:28:19,000 that's a prime that says, okay, 585 00:28:19,000 --> 00:28:20,000 the first time through, right, 586 00:28:20,000 --> 00:28:23,000 it'll take the hash code of zero and multiply it by and add the first character. So the first 587 00:28:23,000 --> 00:28:26,000 character gets added in without any modification. 588 00:28:26,000 --> 00:28:29,000 The next character, though, will take the previous one and multiply it by 589 00:28:29,000 --> 00:28:31,000 one power of the multiplier 590 00:28:31,000 --> 00:28:33,000 and then add in the next character, 591 00:28:33,000 --> 00:28:36,000 and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so 592 00:28:36,000 --> 00:28:39,000 raising it to the squared, and the third, and to the fourth powers, it works down the 593 00:28:39,000 --> 00:28:40,000 string. The 594 00:28:40,000 --> 00:28:43,000 effectiveness actually is that it's gonna do a lot of overflow. 595 00:28:43,000 --> 00:28:45,000 That's a very large number 596 00:28:45,000 --> 00:28:47,000 and it's adding 597 00:28:47,000 --> 00:28:51,000 and multiplying in this very large space, so in effect we're actually counting on the 598 00:28:51,000 --> 00:28:54,000 fact that the addition and multiplication 599 00:28:54,000 --> 00:28:56,000 that is built into C++ 600 00:28:56,000 --> 00:28:58,000 doesn't raise any errors when the numbers get so large that they can't be represented. It 601 00:28:58,000 --> 00:29:03,000 just kind of truncates back down to what fits, and so it kind of wraps around using a modular 602 00:29:03,000 --> 00:29:06,000 strategy here, like a ring. And so we'll actually end up kind of making a very big number out 603 00:29:06,000 --> 00:29:10,000 of this. The longer the string is, the more those multiplies, the more those powers raise. We've got 604 00:29:10,000 --> 00:29:13,000 this huge number, what we do is kinda truncating it back to what fits in to a long 605 00:29:13,000 --> 00:29:16,000 and then when we're done, we take whatever 606 00:29:16,000 --> 00:29:18,000 number fell out of this process 607 00:29:18,000 --> 00:29:21,000 and we mod it by the number of buckets to give us a number from zero to buckets minus 608 00:29:21,000 --> 00:29:22,000 one. 609 00:29:22,000 --> 00:29:26,000 And so whatever gets stuffed into here, right, it's reliable in the sense that it's, 610 00:29:26,000 --> 00:29:28,000 every time you put in the number you'll get the same thing back. 611 00:29:28,000 --> 00:29:33,000 We know it will be in the range end buckets because of that last mod 612 00:29:33,000 --> 00:29:34,000 call 613 00:29:34,000 --> 00:29:36,000 that worked it out for us 614 00:29:36,000 --> 00:29:37,000 and we 615 00:29:37,000 --> 00:29:40,000 then use that to say which bucket to go look in when we're ready 616 00:29:40,000 --> 00:29:41,000 to do our work. 617 00:29:41,000 --> 00:29:44,000 So if I switch that in for my old hash 618 00:29:44,000 --> 00:29:47,000 function, I shouldn't actually see any 619 00:29:47,000 --> 00:29:47,000 change 620 00:29:47,000 --> 00:29:48,000 from the outside, 621 00:29:48,000 --> 00:29:51,000 other than the performance, right, should suddenly actually be a lot faster, which is 622 00:29:51,000 --> 00:29:55,000 hard to tell when I have three things in there, but ? I'm gonna 623 00:29:55,000 --> 00:30:01,000 take out that error case. In this case car 624 00:30:01,000 --> 00:30:03,000 and cat now 625 00:30:03,000 --> 00:30:06,000 are probably not very likely to be going to the same bucket. In fact I can 626 00:30:06,000 --> 00:30:09,000 guarantee they don't go in the same bucket, whereas 627 00:30:09,000 --> 00:30:16,000 before they were kind of overlapping. 628 00:30:16,000 --> 00:30:18,000 I'll leave that code there for a second, although 629 00:30:18,000 --> 00:30:19,000 really I would say that point of 630 00:30:19,000 --> 00:30:22,000 hashing really is not to get so worried about how it is that the hash code does it's 631 00:30:22,000 --> 00:30:25,000 work, but to understand the general idea of what a hashing function has to do, 632 00:30:25,000 --> 00:30:29,000 right, what its role is and how it relates to getting this constant time 633 00:30:29,000 --> 00:30:30,000 performance. 634 00:30:30,000 --> 00:30:33,000 So I'll give you an example of hashing in the real world, I think is really interesting 635 00:30:33,000 --> 00:30:37,000 to see it actually happen and people not even realizing how hashing is so useful. 636 00:30:37,000 --> 00:30:39,000 So I ordered something from REI, which is 637 00:30:39,000 --> 00:30:41,000 this camping store and 638 00:30:41,000 --> 00:30:42,000 it ? 639 00:30:42,000 --> 00:30:43,000 they didn't have it in stock, so they had to 640 00:30:43,000 --> 00:30:46,000 place an order for it. This is actually kinda awhile ago, but I ? 641 00:30:46,000 --> 00:30:50,000 so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't 642 00:30:50,000 --> 00:30:52,000 ask me my name. I thought this was very funny. Like, 643 00:30:52,000 --> 00:30:55,000 ''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like, 644 00:30:55,000 --> 00:30:59,000 ''Okay, what are the last two digits of your phone number?'' And I'm 645 00:30:59,000 --> 00:31:02,000 like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last 646 00:31:02,000 --> 00:31:04,000 two digits of my phone number are 21. 647 00:31:04,000 --> 00:31:06,000 And they say ? and they go and they look in the 21 basket and then they say, ''What's your 648 00:31:06,000 --> 00:31:09,000 name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're 649 00:31:09,000 --> 00:31:11,000 like, ''Okay, it's here.'' I'm like, 650 00:31:11,000 --> 00:31:12,000 ''Okay.'' So then I go to get it, like 651 00:31:12,000 --> 00:31:13,000 a couple days later, and I go to the store 652 00:31:13,000 --> 00:31:15,000 and I'm standing in the line waiting for them, 653 00:31:15,000 --> 00:31:17,000 and I look up on the wall 654 00:31:17,000 --> 00:31:20,000 and they have 655 00:31:20,000 --> 00:31:22,000 100 baskets, little 656 00:31:22,000 --> 00:31:24,000 wire baskets up on the wall 657 00:31:24,000 --> 00:31:27,000 and they're labeled 0 through 9 over here 658 00:31:27,000 --> 00:31:29,000 and 0 through 9 over there, 659 00:31:29,000 --> 00:31:33,000 and when I come in they ask me the same question. ''What's the last two digits of your phone number?'' 660 00:31:33,000 --> 00:31:36,000 Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then 661 00:31:36,000 --> 00:31:38,000 they walk over here, right, 662 00:31:38,000 --> 00:31:40,000 to the 21 basket here. 663 00:31:40,000 --> 00:31:42,000 It's like the top digit in this digit, 664 00:31:42,000 --> 00:31:45,000 and there's only one piece of paper in there, and they pull out my order thing, 665 00:31:45,000 --> 00:31:48,000 and then they get me my tent and everybody's happy. 666 00:31:48,000 --> 00:31:52,000 So while I'm there, I try to engage the cashier on how this is an example of a real world 667 00:31:52,000 --> 00:31:55,000 hashing system 668 00:31:55,000 --> 00:31:58,000 and I still got my tent, I'll have you know, but it was close, 669 00:31:58,000 --> 00:32:02,000 right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.'' 670 00:32:02,000 --> 00:32:06,000 I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a 671 00:32:06,000 --> 00:32:10,000 very good example of what the investment in the number of buckets is versus what 672 00:32:10,000 --> 00:32:12,000 tradeoff it gives you, right. Because there's this very physical 673 00:32:12,000 --> 00:32:13,000 sort of set up of this, it's like 674 00:32:13,000 --> 00:32:17,000 by choosing a larger number of buckets, right, you can more fine grain your 675 00:32:17,000 --> 00:32:18,000 access, 676 00:32:18,000 --> 00:32:22,000 but that investment in buckets means you're kind of permanently installing that story. So 677 00:32:22,000 --> 00:32:25,000 in this case, right, they didn't use just the last digit of my phone number. They could've done that and 678 00:32:25,000 --> 00:32:28,000 they would have only had ten buckets on the wall, but then they would expect, 679 00:32:28,000 --> 00:32:30,000 you know, ten times as many things in each bucket. 680 00:32:30,000 --> 00:32:33,000 And apparently their 681 00:32:33,000 --> 00:32:37,000 estimate was the number of active orders in this backorder category was enough 682 00:32:37,000 --> 00:32:38,000 to 683 00:32:38,000 --> 00:32:41,000 warrant being more than ten, you know, significantly more than ten, 684 00:32:41,000 --> 00:32:45,000 but not so much more than 100, then in fact 100 buckets was the right 685 00:32:45,000 --> 00:32:46,000 investment to make in their bucket strategy. 686 00:32:46,000 --> 00:32:49,000 They didn't put three buckets on the wall because, you know, like what are they 687 00:32:49,000 --> 00:32:52,000 gonna do, have this big 3D sort of thing. They didn't enjoy this 688 00:32:52,000 --> 00:32:56,000 discussion, I went on for hours with them and they were, like, really not amused. But it 689 00:32:56,000 --> 00:32:59,000 had, you know, the three digits, then you'd get this even more likelihood that each 690 00:32:59,000 --> 00:33:03,000 bucket would be empty, but ? or have a very small number of things, but given 691 00:33:03,000 --> 00:33:07,000 their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me 692 00:33:07,000 --> 00:33:09,000 the 693 00:33:09,000 --> 00:33:11,000 first two digits of my phone number. 694 00:33:11,000 --> 00:33:14,000 They asked me the last two. Why does that 695 00:33:14,000 --> 00:33:18,000 matter? Because you're using all [inaudible]. Yeah. It's hated a 696 00:33:18,000 --> 00:33:21,000 lot. If you ask, like Stanford students, right, like especially 697 00:33:21,000 --> 00:33:25,000 when, you know, like campus numbers used to all be 49's or 72's or something, so like, if 698 00:33:25,000 --> 00:33:27,000 you use the first two digits, right, you'd have everybody's in the 49 bucket or 699 00:33:27,000 --> 00:33:29,000 the 72 bucket or 700 00:33:29,000 --> 00:33:32,000 something, and a whole bunch of other ones not used. An example, even the first two digits are never 701 00:33:32,000 --> 00:33:35,000 00, like there's a whole bunch of buckets that don't even get used as the first two digits of a 702 00:33:35,000 --> 00:33:38,000 phone number. Phone numbers never start with zeros. 703 00:33:38,000 --> 00:33:42,000 That they rarely actually have zeros or ones in them at all. They try to use those for the 704 00:33:42,000 --> 00:33:44,000 area code and stuff in the leading digits. So I 705 00:33:44,000 --> 00:33:45,000 thought it was just a really 706 00:33:45,000 --> 00:33:48,000 interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of 707 00:33:48,000 --> 00:33:52,000 course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me 708 00:33:52,000 --> 00:33:56,000 my stuff, but I paid my money and they [inaudible], 709 00:33:56,000 --> 00:34:00,000 but it kinda shows you, okay what you're trying to do here is try to make that number of 710 00:34:00,000 --> 00:34:00,000 buckets, 711 00:34:00,000 --> 00:34:03,000 right, roughly equal to the number of elements 712 00:34:03,000 --> 00:34:04,000 so that 713 00:34:04,000 --> 00:34:09,000 if your hash function's dividing up across your whole world, 714 00:34:09,000 --> 00:34:11,000 you've got 715 00:34:11,000 --> 00:34:14,000 constant access to what 716 00:34:14,000 --> 00:34:17,000 you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value 717 00:34:17,000 --> 00:34:21,000 that without it, it assumes it's an int and that number is to big to fit in an int and it 718 00:34:21,000 --> 00:34:25,000 is twice as big as a long in terms of space and so the L needs to be there, otherwise it 719 00:34:25,000 --> 00:34:26,000 will 720 00:34:26,000 --> 00:34:30,000 try to read it as an int and through away part of the information I was trying to give it. So it's the 721 00:34:30,000 --> 00:34:31,000 way 722 00:34:31,000 --> 00:34:35,000 of identifying it along constant. 723 00:34:35,000 --> 00:34:37,000 So let me talk about this just a little bit more 724 00:34:37,000 --> 00:34:39,000 in terms of 725 00:34:39,000 --> 00:34:41,000 726 00:34:41,000 --> 00:34:43,000 actual performance, right, we've got N 727 00:34:43,000 --> 00:34:46,000 over B, right, so if the division is complete, right, the number of entries over 728 00:34:46,000 --> 00:34:48,000 the number of buckets, 729 00:34:48,000 --> 00:34:48,000 730 00:34:48,000 --> 00:34:51,000 if we make them kind of on the same order of 731 00:34:51,000 --> 00:34:52,000 magnitude, kind of in the same range, 732 00:34:52,000 --> 00:34:55,000 so having to sum this could be to make a little bit of an estimated guess about 733 00:34:55,000 --> 00:34:58,000 where we're going, 734 00:34:58,000 --> 00:35:01,000 and then there are some choices here in how we store each bucket, right. We could use 735 00:35:01,000 --> 00:35:05,000 a little array, we could use a link list, right, we could use a vector. 736 00:35:05,000 --> 00:35:07,000 We expect this to be very small, is the truth. 737 00:35:07,000 --> 00:35:09,000 We're hoping that it's gonna be one or two, 738 00:35:09,000 --> 00:35:13,000 and so there's not a lot of reason to buy any real expensive structure or even bother 739 00:35:13,000 --> 00:35:17,000 doing things like sorting. Like you could sort them, but like if you expect there to be two 740 00:35:17,000 --> 00:35:19,000 things, what's the point? You're gonna spend more time 741 00:35:19,000 --> 00:35:21,000 figuring out whether to put it in the front or the back 742 00:35:21,000 --> 00:35:25,000 then you would gain at all by being able to 743 00:35:25,000 --> 00:35:27,000 find it a little faster. 744 00:35:27,000 --> 00:35:28,000 So the link list 745 00:35:28,000 --> 00:35:32,000 is just gonna be easiest way to get kind of allocated without over capacity, 746 00:35:32,000 --> 00:35:35,000 excess capacity that's unused in the bucket. You know it's gonna be small, use the 747 00:35:35,000 --> 00:35:37,000 simple strategy. 748 00:35:37,000 --> 00:35:41,000 It's also, though, an important thing that the hash type is likely to do in kind of a 749 00:35:41,000 --> 00:35:44,000 industrial strength situation is it does have to deal with this idea, well what if that number of 750 00:35:44,000 --> 00:35:47,000 buckets that you predicted was wrong? 751 00:35:47,000 --> 00:35:51,000 And so the map as we have given it to you, right, has to take this sponge because 752 00:35:51,000 --> 00:35:54,000 it doesn't know in advance how many things are you gonna put in. 753 00:35:54,000 --> 00:35:57,000 The only way to know that is to wait and see as things get added, 754 00:35:57,000 --> 00:35:59,000 and then realize your buckets are getting full. 755 00:35:59,000 --> 00:36:02,000 So typically the industrial strength version of the hash table is gonna 756 00:36:02,000 --> 00:36:06,000 track what's called the load factor. And the load factor is just the number of total 757 00:36:06,000 --> 00:36:08,000 entries divided by the number of buckets. 758 00:36:08,000 --> 00:36:09,000 When that 759 00:36:09,000 --> 00:36:13,000 factor gets high, and high actually is actually quite small, in fact. If 760 00:36:13,000 --> 00:36:15,000 it's two, is often the trigger for 761 00:36:15,000 --> 00:36:19,000 causing a rehash situation, so 762 00:36:19,000 --> 00:36:22,000 not even letting it get to be three or four. Just saying as soon as you realize that 763 00:36:22,000 --> 00:36:22,000 you have 764 00:36:22,000 --> 00:36:26,000 exceeded by double the capacity you intended, you planned for 100 things, 765 00:36:26,000 --> 00:36:27,000 you got 200 things, 766 00:36:27,000 --> 00:36:28,000 go ahead 767 00:36:28,000 --> 00:36:31,000 and redo your whole bucket array. 768 00:36:31,000 --> 00:36:39,000 So in the case, for example, of our simple like 0 through 7 case, right, maybe 769 00:36:39,000 --> 00:36:43,000 it's 0 through 6. One, two, three, 770 00:36:43,000 --> 00:36:45,000 four, so I have one that had six buckets, right, 771 00:36:45,000 --> 00:36:48,000 and then I've gotten to where 772 00:36:48,000 --> 00:36:50,000 each of them has 773 00:36:50,000 --> 00:36:54,000 two or maybe three in some and one in another. 774 00:36:54,000 --> 00:36:57,000 The plan is to just take this whole bucket array and 775 00:36:57,000 --> 00:37:01,000 enlarge it by a factor of two, so grow it, and in this case, from 6 to 12, 776 00:37:01,000 --> 00:37:02,000 and then rehash 777 00:37:02,000 --> 00:37:04,000 to move everybody. The 778 00:37:04,000 --> 00:37:08,000 idea is that the earlier decision about where to place them was based on knowing that 779 00:37:08,000 --> 00:37:11,000 there was exactly six buckets and so where it was placed before 780 00:37:11,000 --> 00:37:15,000 is not likely to be the place it will place if you have, you know, 781 00:37:15,000 --> 00:37:18,000 12 choices to lay them down into. 782 00:37:18,000 --> 00:37:21,000 And ideal ? in fact you don't even want it to be that case. Like, if they all land into the 783 00:37:21,000 --> 00:37:23,000 same bucket, right, you would have 784 00:37:23,000 --> 00:37:25,000 the same clustering and then a big empty clump, 785 00:37:25,000 --> 00:37:28,000 and so you would rehash everything, run it through the new hash function having 786 00:37:28,000 --> 00:37:31,000 changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?'' 787 00:37:31,000 --> 00:37:34,000 And hopefully you'd end up getting a clean table, right, where you had 12 788 00:37:34,000 --> 00:37:37,000 items now with one in each bucket, 789 00:37:37,000 --> 00:37:40,000 ready to kinda give constant performance through 790 00:37:40,000 --> 00:37:42,000 that and then potentially, 791 00:37:42,000 --> 00:37:43,000 again if you overload 792 00:37:43,000 --> 00:37:47,000 your load factor, go back in and rearrange stuff again. 793 00:37:47,000 --> 00:37:50,000 So using this strategy a little bit like the one we talked about with the binary search tree of just 794 00:37:50,000 --> 00:37:51,000 wait and see. Rather 795 00:37:51,000 --> 00:37:55,000 than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any 796 00:37:55,000 --> 00:37:57,000 real alarm or crisis, 797 00:37:57,000 --> 00:38:00,000 it waits until there's kinda sufficiently clear demand 798 00:38:00,000 --> 00:38:03,000 that exceeds the capacity plan for, 799 00:38:03,000 --> 00:38:05,000 to do a big rearrangement. 800 00:38:05,000 --> 00:38:08,000 So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast. 801 00:38:08,000 --> 00:38:12,000 Starting to get just a hair slower, right, having to do a search through a few things as opposed 802 00:38:12,000 --> 00:38:13,000 to one, 803 00:38:13,000 --> 00:38:16,000 and then every now and then, right, you'd see an add which caused a big reshuffle of 804 00:38:16,000 --> 00:38:19,000 the table, which would be a totally linear operation in the number of 805 00:38:19,000 --> 00:38:20,000 elements 806 00:38:20,000 --> 00:38:22,000 and then they'd be fast again 807 00:38:22,000 --> 00:38:24,000 for another big clump of inserts until that 808 00:38:24,000 --> 00:38:27,000 same expansion might be complete or triggered by 809 00:38:27,000 --> 00:38:28,000 it growing even more. 810 00:38:28,000 --> 00:38:31,000 But dividing it sorta by the total number of inserts, so if you had 1,000 inserts 811 00:38:31,000 --> 00:38:35,000 before you overloaded and then you had to copy all 1,000 things to get ready for the 812 00:38:35,000 --> 00:38:36,000 next thousand, 813 00:38:36,000 --> 00:38:39,000 if you average that out, it still ends up 814 00:38:39,000 --> 00:38:43,000 being called a constant operation. So 815 00:38:43,000 --> 00:38:46,000 that's pretty neat. 816 00:38:46,000 --> 00:38:47,000 It's a ? 817 00:38:47,000 --> 00:38:50,000 really one of the, sort of, more, I think, 818 00:38:50,000 --> 00:38:53,000 easily understood and beautiful algorithmic techniques, right, for solving, right, 819 00:38:53,000 --> 00:38:55,000 this hard problem of 820 00:38:55,000 --> 00:38:57,000 search and retrieval 821 00:38:57,000 --> 00:38:58,000 that 822 00:38:58,000 --> 00:39:01,000 the vector and the sorting and the BST are all trying to get at, but ? and sometimes the 823 00:39:01,000 --> 00:39:04,000 sorter, vector and the BST are solving actually a much harder problem, right, 824 00:39:04,000 --> 00:39:09,000 which is keeping a total ordering of all the data and being able to flip it, 825 00:39:09,000 --> 00:39:12,000 traverse it and sorted order is one of the advantages that the sorted vector 826 00:39:12,000 --> 00:39:12,000 and 827 00:39:12,000 --> 00:39:14,000 the BST have that hash table doesn't have at all. 828 00:39:14,000 --> 00:39:17,000 In fact, it's jumbled it all up, and so if you think about how the iterator for 829 00:39:17,000 --> 00:39:21,000 the map works, our map is actually implemented using a hash table, 830 00:39:21,000 --> 00:39:24,000 it actually just iterates over the link list, and that explains why it almost appears 831 00:39:24,000 --> 00:39:27,000 completely random. If you put a bunch of things in the table and you go to iterate 832 00:39:27,000 --> 00:39:28,000 and 833 00:39:28,000 --> 00:39:29,000 visit them again, 834 00:39:29,000 --> 00:39:33,000 that the order you visit the keys seems to make no sense, and that's because it's based on 835 00:39:33,000 --> 00:39:35,000 their hash codes, right, which 836 00:39:35,000 --> 00:39:36,000 aren't designed 837 00:39:36,000 --> 00:39:37,000 to be 838 00:39:37,000 --> 00:39:39,000 any real sensible ordering 839 00:39:39,000 --> 00:39:42,000 to anybody not in the know of how the hash function works, 840 00:39:42,000 --> 00:39:46,000 whereas iterating over a vector that's sorted or iterating over a BST 841 00:39:46,000 --> 00:39:47,000 or a, 842 00:39:47,000 --> 00:39:51,000 in this case, our set, for example, is backed by a binary research tree 843 00:39:51,000 --> 00:39:54,000 will give you that sorted order. So it solves this harder problem, right, which cause there to 844 00:39:54,000 --> 00:39:57,000 be more kinda more investment in that problem, whereas hashing solves just exactly the 845 00:39:57,000 --> 00:39:58,000 one problem, which is 846 00:39:58,000 --> 00:40:02,000 I want to be able to find exactly this value again 847 00:40:02,000 --> 00:40:03,000 and update this value, 848 00:40:03,000 --> 00:40:08,000 and nothing more is needed than just identifying this, 849 00:40:08,000 --> 00:40:10,000 not finding the things that are less than this. For example, if you needed to find all 850 00:40:10,000 --> 00:40:13,000 the values that were less than this key alphabetically, 851 00:40:13,000 --> 00:40:15,000 right, the hash table makes that 852 00:40:15,000 --> 00:40:16,000 no easier 853 00:40:16,000 --> 00:40:18,000 than an unsorted vector, 854 00:40:18,000 --> 00:40:21,000 whereas things like assorted vector and binary search tree actually enable that kinda search, 855 00:40:21,000 --> 00:40:24,000 you could find the place in the tree and kind of work your way down the left side 856 00:40:24,000 --> 00:40:27,000 to find the things that were less than that. 857 00:40:27,000 --> 00:40:31,000 That's not actually being solved by the hash at all. Its use of 858 00:40:31,000 --> 00:40:34,000 memory is actually comparable to a 859 00:40:34,000 --> 00:40:35,000 binary search tree 860 00:40:35,000 --> 00:40:38,000 in that it has a four byte pointer per entry, which is the next field on the 861 00:40:38,000 --> 00:40:40,000 link list chain, and 862 00:40:40,000 --> 00:40:44,000 then there's a four byte pointer for each bucket, the head of the link list 863 00:40:44,000 --> 00:40:47,000 cell. Given that we expect that the buckets to be roughly equal to entry, 864 00:40:47,000 --> 00:40:51,000 than we can kinda just summarize that as well. Each entry 865 00:40:51,000 --> 00:40:53,000 represented an eight byte overhead, 866 00:40:53,000 --> 00:40:56,000 which is the same eight byte overhead, right, that the left and right pointers of 867 00:40:56,000 --> 00:40:58,000 the binary search 868 00:40:58,000 --> 00:40:59,000 tree add. 869 00:40:59,000 --> 00:41:01,000 Does it have any degenerate cases? That's a good 870 00:41:01,000 --> 00:41:06,000 question. So one of the things that made the binary search tree a little bit of a 871 00:41:06,000 --> 00:41:08,000 heavy investment was that to get that 872 00:41:08,000 --> 00:41:11,000 balancing behavior, even when we're getting data coming in in a way that's 873 00:41:11,000 --> 00:41:16,000 causing a lopsided construction, we have to go out of our way to do something special. 874 00:41:16,000 --> 00:41:19,000 What is the degenerate case for hash? Is there something that caused it to behave really, 875 00:41:19,000 --> 00:41:24,000 really badly? Anyone wanna help me out with 876 00:41:24,000 --> 00:41:28,000 that? Is it always good? Does it always give a good spread? Can we 877 00:41:28,000 --> 00:41:32,000 end up with everything in the same bucket somehow? [Inaudible] repeated it that way? 878 00:41:32,000 --> 00:41:35,000 So if you have repeated elements the way that ? 879 00:41:35,000 --> 00:41:37,000 both versions of the map work is they overwrite. 880 00:41:37,000 --> 00:41:41,000 So actually the nice thing is, yeah, if you did it again you just take that same cell and 881 00:41:41,000 --> 00:41:43,000 overwrite it, so in fact we wouldn't get a clustering based on the same entry 882 00:41:43,000 --> 00:41:45,000 going in and out. 883 00:41:45,000 --> 00:41:51,000 But how'd we end up with everything being in the same bucket? 884 00:41:51,000 --> 00:41:54,000 Mostly comes out, if you look at your hash function, when would a hash function collide? 885 00:41:54,000 --> 00:41:58,000 Right, and if you have a dumb hash function, right, you can definitely have some degenerate 886 00:41:58,000 --> 00:42:01,000 cases. My dumb hash function of return zero, right, being the 887 00:42:01,000 --> 00:42:03,000 worst example of that, 888 00:42:03,000 --> 00:42:07,000 but any hash function, right, that wasn't being particularly clever about using all of its 889 00:42:07,000 --> 00:42:09,000 information might actually have some clustering in 890 00:42:09,000 --> 00:42:12,000 it. It is possible, for example, even with a particularly good hash function that 891 00:42:12,000 --> 00:42:15,000 there are strings that will hash to the same thing, and it's like, if you somehow got 892 00:42:15,000 --> 00:42:18,000 really, really unlucky 893 00:42:18,000 --> 00:42:19,000 that you ? 894 00:42:19,000 --> 00:42:23,000 and had a hash function that wasn't doing the best job possible that you could get some 895 00:42:23,000 --> 00:42:25,000 clustering, but in general it doesn't require 896 00:42:25,000 --> 00:42:29,000 ? the sole responsibility, right, for the generate case comes down to your hash function. 897 00:42:29,000 --> 00:42:31,000 So as long as your hash function 898 00:42:31,000 --> 00:42:34,000 and your inputs match your expectation, 899 00:42:34,000 --> 00:42:35,000 you don't have any surprises about 900 00:42:35,000 --> 00:42:38,000 how they got inserted the way it was in a binary search tree, which is sometimes hard 901 00:42:38,000 --> 00:42:40,000 to control. 902 00:42:40,000 --> 00:42:43,000 Just as you could underestimate the number of buckets you need ? Yeah. ? could 903 00:42:43,000 --> 00:42:46,000 you over estimate the number of buckets you 904 00:42:46,000 --> 00:42:51,000 need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and 905 00:42:51,000 --> 00:42:52,000 do you then go through and take 906 00:42:52,000 --> 00:42:54,000 that memory back? Yeah. 907 00:42:54,000 --> 00:42:57,000 Yeah, so that's a great question. So a lot of data structures 908 00:42:57,000 --> 00:42:58,000 don't shrink, 909 00:42:58,000 --> 00:43:02,000 right, and for example, the vector often will only grow on demand and then even if you deleted a 910 00:43:02,000 --> 00:43:04,000 bunch of, or removed a bunch of elements, it doesn't 911 00:43:04,000 --> 00:43:06,000 necessarily consider shrinking a big deal, and so 912 00:43:06,000 --> 00:43:10,000 that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around 913 00:43:10,000 --> 00:43:12,000 you're not using 914 00:43:12,000 --> 00:43:15,000 doesn't have nearly the same penalty that not 915 00:43:15,000 --> 00:43:17,000 having the memory when you needed it and having to do a lot of extra work to find 916 00:43:17,000 --> 00:43:18,000 things. 917 00:43:18,000 --> 00:43:21,000 So typically, right, you would try to pick a size that you are 918 00:43:21,000 --> 00:43:26,000 willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.'' 919 00:43:26,000 --> 00:43:28,000 920 00:43:28,000 --> 00:43:30,000 But that, as you grow, you might not shrink back to that size. You might just 921 00:43:30,000 --> 00:43:34,000 let the capacity just kinda lay in waste. There are good 922 00:43:34,000 --> 00:43:37,000 reasons to actually be tidy about it, but again, some of this is hard to predict. 923 00:43:37,000 --> 00:43:40,000 It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows 924 00:43:40,000 --> 00:43:41,000 big again. Like, maybe you 925 00:43:41,000 --> 00:43:43,000 have a new freshman class come in, 926 00:43:43,000 --> 00:43:45,000 and then you graduate a whole bunch of people. Then 927 00:43:45,000 --> 00:43:47,000 at the point where you graduate them all, you're about ready to take in a new freshman 928 00:43:47,000 --> 00:43:51,000 class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on 929 00:43:51,000 --> 00:43:53,000 reusing in a minute anyway, and so maybe 930 00:43:53,000 --> 00:43:56,000 that clearing it out and totally releasing 931 00:43:56,000 --> 00:43:57,000 may actually be an exercise 932 00:43:57,000 --> 00:44:01,000 that was unimportant. You're planning on getting back to that size eventually, anyway. 933 00:44:01,000 --> 00:44:04,000 Most hash tables tend to grow is actually the truth. Right, you tend to 934 00:44:04,000 --> 00:44:07,000 be collecting data in there and 935 00:44:07,000 --> 00:44:11,000 they tend to only enlarge. It's kind of unusual that you take out a bunch of 936 00:44:11,000 --> 00:44:15,000 entries you already put in. 937 00:44:15,000 --> 00:44:18,000 And so I just put a little last note, which is like, okay, well I had said earlier that when we 938 00:44:18,000 --> 00:44:19,000 have the map, 939 00:44:19,000 --> 00:44:22,000 the key had to be string type and I was gonna at 940 00:44:22,000 --> 00:44:25,000 some point come back to this and talk about 941 00:44:25,000 --> 00:44:28,000 why that was the one restriction in all our template classes, right, you can put anything you 942 00:44:28,000 --> 00:44:31,000 want in statter or vector or stacker queue. 943 00:44:31,000 --> 00:44:34,000 That map lets you store any kind of value associated with that key, but that key was 944 00:44:34,000 --> 00:44:36,000 string type, and 945 00:44:36,000 --> 00:44:40,000 so given how our map is implemented as a hash table, that actually starts to make sense, that if 946 00:44:40,000 --> 00:44:42,000 you're gonna take the key and kind of jumble it up and map it to 947 00:44:42,000 --> 00:44:44,000 a bucket, 948 00:44:44,000 --> 00:44:47,000 you need to know something about how to extract information from the key, and so 949 00:44:47,000 --> 00:44:51,000 this case, knowing it's a string means you know it has characters, you know its length, and you can do 950 00:44:51,000 --> 00:44:52,000 those things. 951 00:44:52,000 --> 00:44:53,000 If you wanted to make a map 952 00:44:53,000 --> 00:44:57,000 that could take any kinda key, maybe integers is key or student structures is 953 00:44:57,000 --> 00:44:59,000 key or, you 954 00:44:59,000 --> 00:45:00,000 know, 955 00:45:00,000 --> 00:45:01,000 doubles is key, 956 00:45:01,000 --> 00:45:05,000 any of these things, it's like, well, how can you write a generic form 957 00:45:05,000 --> 00:45:06,000 that could use 958 00:45:06,000 --> 00:45:08,000 a hashing operation on that 959 00:45:08,000 --> 00:45:11,000 kind of key and map it to bucket numbers. 960 00:45:11,000 --> 00:45:14,000 You would build a two type template to kinda get this working in C++, so 961 00:45:14,000 --> 00:45:18,000 you can actually have a template that has a type name for the key type, a type name for 962 00:45:18,000 --> 00:45:19,000 the value type, 963 00:45:19,000 --> 00:45:20,000 and then 964 00:45:20,000 --> 00:45:25,000 in the member functions you'd refer to both key type and value type as these two 965 00:45:25,000 --> 00:45:26,000 distinct place holders. 966 00:45:26,000 --> 00:45:29,000 And then when the client set it up, right, they'd be saying, ''Okay, I want a map 967 00:45:29,000 --> 00:45:32,000 that has strings that map to integers.'' So maybe this is 968 00:45:32,000 --> 00:45:36,000 words and the page they appear in a document. I have a map of integers and a vector 969 00:45:36,000 --> 00:45:40,000 string. Maybe this is score on an exam and the names of the students who got that 970 00:45:40,000 --> 00:45:42,000 score, doing it that way, 971 00:45:42,000 --> 00:45:44,000 and to make this work, right, 972 00:45:44,000 --> 00:45:48,000 there has to be some kind of universal hashing that, given some generic type, 973 00:45:48,000 --> 00:45:53,000 can turn it into an integer in the range of buckets. The 974 00:45:53,000 --> 00:45:53,000 ? 975 00:45:53,000 --> 00:45:56,000 what it's gonna require here is that the client get involved. 976 00:45:56,000 --> 00:45:58,000 That you can not write this 977 00:45:58,000 --> 00:46:00,000 generic 978 00:46:00,000 --> 00:46:01,000 hash function 979 00:46:01,000 --> 00:46:04,000 that will work for all types, right, if it's a student structure, what are you gonna look at? The 980 00:46:04,000 --> 00:46:08,000 ID field, the names field, under like ? there's just no sensible way that you can 981 00:46:08,000 --> 00:46:10,000 talk about a generic type and talk about how to hash 982 00:46:10,000 --> 00:46:14,000 it, so what you would have to do is have some sort of client call back, so probably you would 983 00:46:14,000 --> 00:46:18,000 create this by passing out a call back that's given a key type thing, and a number of 984 00:46:18,000 --> 00:46:20,000 buckets would 985 00:46:20,000 --> 00:46:24,000 do the necessary machinations, right, to hash 986 00:46:24,000 --> 00:46:26,000 it into a right number. 987 00:46:26,000 --> 00:46:30,000 And so that would be kinda one of those coordination things between the client and the implementer 988 00:46:30,000 --> 00:46:34,000 about the client knows what the data is what the data is it's trying to store, and kinda how to 989 00:46:34,000 --> 00:46:35,000 990 00:46:35,000 --> 00:46:41,000 mash it up and then the map would know, okay, given that number, where to stick it. So 991 00:46:41,000 --> 00:46:42,000 that's what it would take, and then 992 00:46:42,000 --> 00:46:45,000 rather sorta load that onto our 993 00:46:45,000 --> 00:46:48,000 novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can 994 00:46:48,000 --> 00:46:53,000 do a hash internally without having to get involved.'' All right, 995 00:46:53,000 --> 00:46:55,000 any questions about hashing? I'm gonna 996 00:46:55,000 --> 00:46:59,000 give you like, a two second talk about set 997 00:46:59,000 --> 00:47:01,000 because it's the one 998 00:47:01,000 --> 00:47:04,000 ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking 999 00:47:04,000 --> 00:47:07,000 about, it doesn't add any new things that we haven't already done, so in fact I'm just 1000 00:47:07,000 --> 00:47:10,000 gonna tell you what it does, and 1001 00:47:10,000 --> 00:47:13,000 then your picture will be complete, right. 1002 00:47:13,000 --> 00:47:16,000 It wants to do fast searching, fast updating, add and remove, and it also has all 1003 00:47:16,000 --> 00:47:21,000 these high level operations, such as intersect, union and difference, 1004 00:47:21,000 --> 00:47:24,000 that are kind of its primary set of things to do. 1005 00:47:24,000 --> 00:47:25,000 1006 00:47:25,000 --> 00:47:27,000 A bunch of the things we've already seen, right, 1007 00:47:27,000 --> 00:47:30,000 would actually be a reasonable strategy for upholding stat, right, using some kind 1008 00:47:30,000 --> 00:47:31,000 of vector or array. 1009 00:47:31,000 --> 00:47:34,000 Most likely you'd want to keep it in sorted order because that would buy you 1010 00:47:34,000 --> 00:47:37,000 the fast lookup for contains, 1011 00:47:37,000 --> 00:47:41,000 but the add and remove, right, would necessarily be slow in those cases because of the 1012 00:47:41,000 --> 00:47:43,000 continuous memory. 1013 00:47:43,000 --> 00:47:44,000 1014 00:47:44,000 --> 00:47:47,000 The link list, not really probably too good of an option here 1015 00:47:47,000 --> 00:47:50,000 because it contains then becomes 1016 00:47:50,000 --> 00:47:51,000 linear and 1017 00:47:51,000 --> 00:47:54,000 add and remove similarly, right, liner, so it doesn't actually get you anywhere, in 1018 00:47:54,000 --> 00:47:57,000 fact it just adds the memory in pointers and gunk like that. 1019 00:47:57,000 --> 00:47:59,000 The binary search tree, probably a 1020 00:47:59,000 --> 00:48:00,000 pretty good call, 1021 00:48:00,000 --> 00:48:03,000 right, that's the ? 1022 00:48:03,000 --> 00:48:06,000 kinda meshes the advantage of Adamic structure with the sorted property to 1023 00:48:06,000 --> 00:48:11,000 be able to give it fast update adding and removing and searching all can be done 1024 00:48:11,000 --> 00:48:14,000 in logarithmic time if you have balancing. 1025 00:48:14,000 --> 00:48:17,000 And then it actually also enables the high level ops, so the idea of being able to do a union 1026 00:48:17,000 --> 00:48:19,000 intersection difference, and actually being able to 1027 00:48:19,000 --> 00:48:23,000 walk over both of the trees in sorted order, which is very easily done 1028 00:48:23,000 --> 00:48:24,000 with a [inaudible] reversal, 1029 00:48:24,000 --> 00:48:28,000 makes those operations actually more efficient to implement than actually kinda having to deal with 1030 00:48:28,000 --> 00:48:32,000 data coming at you all different ways. It 1031 00:48:32,000 --> 00:48:33,000 turns out hashing really 1032 00:48:33,000 --> 00:48:35,000 won't quite do it 1033 00:48:35,000 --> 00:48:36,000 1034 00:48:36,000 --> 00:48:37,000 very easily for us 1035 00:48:37,000 --> 00:48:40,000 because of the same need about the template situations, like, well you have to have a hash function that 1036 00:48:40,000 --> 00:48:43,000 can hash these things and do the right thing, and so the tree happens to be a 1037 00:48:43,000 --> 00:48:46,000 good of just ? if the client can just give you ordering information, you can 1038 00:48:46,000 --> 00:48:48,000 do all the hard work without 1039 00:48:48,000 --> 00:48:50,000 pushing hashing onto them. So in fact 1040 00:48:50,000 --> 00:48:53,000 the way our set really does work is it does use a binary search tree. It happens to 1041 00:48:53,000 --> 00:48:56,000 use it through another layered abstraction 1042 00:48:56,000 --> 00:48:59,000 that there actually is a BST class that we've never really 1043 00:48:59,000 --> 00:49:02,000 encountered directly, but the BST class just takes the idea of a binary search 1044 00:49:02,000 --> 00:49:06,000 tree, adds balancing, and 1045 00:49:06,000 --> 00:49:08,000 all the properties about finding and inserting and adding, 1046 00:49:08,000 --> 00:49:12,000 and packages it up, and then set actually just turns out to be one liners 1047 00:49:12,000 --> 00:49:14,000 making calls into the binary search tree. 1048 00:49:14,000 --> 00:49:17,000 This is where all the heavy lifting 1049 00:49:17,000 --> 00:49:18,000 goes, is down there. 1050 00:49:18,000 --> 00:49:23,000 That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our 1051 00:49:23,000 --> 00:49:26,000 BST uses ? or set uses the BST, which is the binary search tree, 1052 00:49:26,000 --> 00:49:28,000 right, our vector, right, using an array, 1053 00:49:28,000 --> 00:49:33,000 and then that stacks and queues having both viable strategies both for array 1054 00:49:33,000 --> 00:49:37,000 vector based backing as well as link list both getting good time performance. And then 1055 00:49:37,000 --> 00:49:38,000 you saw with the PQ, right, 1056 00:49:38,000 --> 00:49:40,000 even 1057 00:49:40,000 --> 00:49:44,000 yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the 1058 00:49:44,000 --> 00:49:46,000 kind of really classic things in which a lot of things get built, 1059 00:49:46,000 --> 00:49:51,000 and so having a chance to use them both as a client and kinda dig 1060 00:49:51,000 --> 00:49:53,000 around as an implementer kinda gives you this big picture of 1061 00:49:53,000 --> 00:49:55,000 some of the more common ways that 1062 00:49:55,000 --> 00:49:57,000 cool data structures get built. 1063 00:49:57,000 --> 00:50:00,000 So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman 1064 00:50:00,000 --> 00:50:03,000 cafe in a little bit, if you've got some time, please come, and have 1065 00:50:03,000 --> 00:50:04,000 a good weekend. 1066 00:50:04,000 --> 00:50:07,000 Work hard on pathfinder.