1
00:00:00,000 --> 00:00:12,000


2
00:00:12,000 --> 00:00:15,000
This presentation is delivered by the Stanford Center for Professional 

3
00:00:15,000 --> 00:00:28,000
Development. 

4
00:00:28,000 --> 00:00:29,000


5
00:00:29,000 --> 00:00:33,000
What are we doing today? We're gonna talk about hashing. Hashing's one of the coolest things you're ever 

6
00:00:33,000 --> 00:00:36,000
gonna learn in 106b, so it's a good day to be here and learn something really neat, kinda 

7
00:00:36,000 --> 00:00:39,000
clever, inventive idea for how you do search and retrieval in a very efficient 

8
00:00:39,000 --> 00:00:40,000
way. 

9
00:00:40,000 --> 00:00:44,000
So I'm gonna put us back to where we were a lecture ago, two lectures back to Monday, 

10
00:00:44,000 --> 00:00:48,000
where we had talked about ways we might implement the map abstraction, right, the key 

11
00:00:48,000 --> 00:00:49,000
value pairing 

12
00:00:49,000 --> 00:00:52,000
to allow for quick search and retrieval 

13
00:00:52,000 --> 00:00:56,000
and we had looked at vector and had just done a little thought experiment about 

14
00:00:56,000 --> 00:00:59,000
what sorting the vector would buy us, right, and that would give us the 

15
00:00:59,000 --> 00:01:02,000
ability to do searches faster because we could use binary search, 

16
00:01:02,000 --> 00:01:07,000
right. But in the case of add, right, we'd still be slow because the 

17
00:01:07,000 --> 00:01:10,000
need be kinda shuffled or rearranged in the continuous memory 

18
00:01:10,000 --> 00:01:13,000
and then that's what motivated our introduction to trees there, the binary search tree, 

19
00:01:13,000 --> 00:01:17,000
which is designed exactly to enable that kind of searching, right, that logarithmic 

20
00:01:17,000 --> 00:01:20,000
divide and conquer, divide in half and work your way down 

21
00:01:20,000 --> 00:01:22,000
and then because of the flexibility of the pointer base structure there, right, we 

22
00:01:22,000 --> 00:01:24,000
can also do the insert 

23
00:01:24,000 --> 00:01:26,000
in the same logarithmic 

24
00:01:26,000 --> 00:01:30,000
time because we're no longer constrained by that contiguous memory that the array 

25
00:01:30,000 --> 00:01:30,000
fact 

26
00:01:30,000 --> 00:01:32,000
data structures force us to. 

27
00:01:32,000 --> 00:01:36,000
And so, at that point what we have seen this logarithmic and both get valued at and so 

28
00:01:36,000 --> 00:01:38,000
on the PQ, right, which you were just finishing, 

29
00:01:38,000 --> 00:01:42,000
you'll notice that logarithmic is pretty fast. In fact logarithmic for most values of n that you 

30
00:01:42,000 --> 00:01:44,000
would normally run into in an 

31
00:01:44,000 --> 00:01:45,000
ordinary use, right, 

32
00:01:45,000 --> 00:01:49,000
it's un-measurable, right. So those of you who got your heap up and running and 

33
00:01:49,000 --> 00:01:51,000
correctly working logarithmic and time was discovering that you can put 10,000, 20,000, 

34
00:01:51,000 --> 00:01:54,000
50,000, 100,000 things in it and just never be able to measure 

35
00:01:54,000 --> 00:01:58,000
what it took to NQ or DQ from those structures because logarithmic is very, 

36
00:01:58,000 --> 00:02:00,000
very fast. 

37
00:02:00,000 --> 00:02:01,000
So that said, 

38
00:02:01,000 --> 00:02:04,000
that's actually a fine point to say, we have a great math notation, we don't have to look further, 

39
00:02:04,000 --> 00:02:07,000
but in fact I'm gonna push it a little bit today and we're gonna actually getting it down to where 

40
00:02:07,000 --> 00:02:10,000
we can do both those operations in constant time, which is to say that 

41
00:02:10,000 --> 00:02:13,000
no matter how large the table gets, right, we expect 

42
00:02:13,000 --> 00:02:16,000
we'll be able to insert and retrieve with 

43
00:02:16,000 --> 00:02:21,000
the exact same time required kind of for a 10,000 table as a million entry 

44
00:02:21,000 --> 00:02:22,000
table which is pretty neat. 

45
00:02:22,000 --> 00:02:26,000
To note that as we start to get to these more complicated data structures, 

46
00:02:26,000 --> 00:02:29,000
one thing we also need to keep in mind, right, is that there are other things 

47
00:02:29,000 --> 00:02:33,000
of trade offs, right, in adding that complexity of the code that 

48
00:02:33,000 --> 00:02:36,000
starting with something that's simple and 

49
00:02:36,000 --> 00:02:39,000
a little slower performer but that's easy to write might actually solve the problem 

50
00:02:39,000 --> 00:02:42,000
we have at hand. So maybe that we don't really wanna go to one of these fancier 

51
00:02:42,000 --> 00:02:45,000
things because we don't really need it to be that much faster 

52
00:02:45,000 --> 00:02:48,000
because the BST, right, adds a bunch of memory, right, so left and right pointers 

53
00:02:48,000 --> 00:02:53,000
for every cell, which is an 8 byte overhead on top of the entry itself. 

54
00:02:53,000 --> 00:02:56,000
Now we have to deal with pointers, dynamic memories, all this allocation, 

55
00:02:56,000 --> 00:02:59,000
deallocation, right, we've got a lot of recursion going on in there 

56
00:02:59,000 --> 00:03:01,000
and so just opportunities for error 

57
00:03:01,000 --> 00:03:03,000
and stuff to creep into 

58
00:03:03,000 --> 00:03:07,000
and if we take the steps to provide tree balancing, which we only kind of grazed 

59
00:03:07,000 --> 00:03:11,000
the surface of. We just mentioned the need for it. It actually adds quite a lot of 

60
00:03:11,000 --> 00:03:14,000
complication to the code to do the rotations 

61
00:03:14,000 --> 00:03:16,000
or the rebalancing 

62
00:03:16,000 --> 00:03:21,000
that from a simple binary search tree. So in fact the whole package, right, of 

63
00:03:21,000 --> 00:03:24,000
building a really guaranteed log n performing 

64
00:03:24,000 --> 00:03:27,000
binary search tree is actually a pretty big investment of human capital 

65
00:03:27,000 --> 00:03:29,000
that you 

66
00:03:29,000 --> 00:03:30,000
have to kinda 

67
00:03:30,000 --> 00:03:33,000
balance against the future hit off of it running faster. I'm 

68
00:03:33,000 --> 00:03:36,000
gonna show this strategy today that actually is 

69
00:03:36,000 --> 00:03:37,000
kind of simple to write. 

70
00:03:37,000 --> 00:03:40,000
It gets us solve one case and doesn't have the degenerate 

71
00:03:40,000 --> 00:03:43,000
looming that the binary search tree did. 

72
00:03:43,000 --> 00:03:46,000
So, let me kinda just give you a theoretical concept to think about how 

73
00:03:46,000 --> 00:03:49,000
you do stuff. 

74
00:03:49,000 --> 00:03:51,000
You have a big dictionary. You've got, you know, like a 

75
00:03:51,000 --> 00:03:54,000
many thousand page dictionary, right, and you want to look up a word. You certainly 

76
00:03:54,000 --> 00:03:57,000
don't do linear search. You're about to look up the word xylophone and you're actually not 

77
00:03:57,000 --> 00:04:00,000
going to start from the A's and kind of page your way through, you 

78
00:04:00,000 --> 00:04:03,000
know, 2,000 pages to get to the X's. You 

79
00:04:03,000 --> 00:04:07,000
don't even ? like we do binary search. You can do binary searches pretty fast, right? It 

80
00:04:07,000 --> 00:04:10,000
turns out like starting at the M's to get something that's in X, right? You 

81
00:04:10,000 --> 00:04:11,000
tend to know about where X 

82
00:04:11,000 --> 00:04:14,000
is and often dictionaries have little tabs 

83
00:04:14,000 --> 00:04:17,000
along the pages that give you an idea of kinda where the breaks for 

84
00:04:17,000 --> 00:04:22,000
certain words are. Maybe it says Xa is here and Xe is there or something, 

85
00:04:22,000 --> 00:04:25,000
that lets you kinda more quickly just illuminate all the surrounding noise 

86
00:04:25,000 --> 00:04:29,000
in the dictionary, kinda get straight to the right region where the X words are 

87
00:04:29,000 --> 00:04:30,000
and then you can do 

88
00:04:30,000 --> 00:04:33,000
a simple search from there, maybe a little bit of a linear or binary search to kind of zero 

89
00:04:33,000 --> 00:04:37,000
in on the page where the word is you wanna find is. 

90
00:04:37,000 --> 00:04:38,000
This is the idea 

91
00:04:38,000 --> 00:04:45,000
behind this concept called a hash table or based on the algorithmic technique of hashing 

92
00:04:45,000 --> 00:04:46,000
is that 

93
00:04:46,000 --> 00:04:49,000
rather than try to figure out how to kinda keep this very large collection sorted or organized 

94
00:04:49,000 --> 00:04:51,000
in a way that you can kind of 

95
00:04:51,000 --> 00:04:52,000
jump around within it, it says, 

96
00:04:52,000 --> 00:04:56,000
''Well how about we maintain a bunch of different collections.'' We break up the 

97
00:04:56,000 --> 00:04:59,000
data set into little tiny pieces 

98
00:04:59,000 --> 00:05:00,000
and if we can 

99
00:05:00,000 --> 00:05:02,000
tell, given a key, 

100
00:05:02,000 --> 00:05:05,000
like which piece of the data set it's supposed to be in and we'll call those pieces 

101
00:05:05,000 --> 00:05:08,000
buckets. I'm gonna put these in different buckets. 

102
00:05:08,000 --> 00:05:11,000
So if I had a whole bunch of students I might put all the freshmen in this bucket 

103
00:05:11,000 --> 00:05:13,000
and the sophomores in this bucket or something like that and then when I look 

104
00:05:13,000 --> 00:05:16,000
for a student I would say, ''Well which class are you?'' You say, ''Freshmen.'' I'd look in the 

105
00:05:16,000 --> 00:05:18,000
freshmen bucket. 

106
00:05:18,000 --> 00:05:19,000
Okay. You know, the idea 

107
00:05:19,000 --> 00:05:22,000
in a simple case would be like, ''Okay, well I ? it has some 

108
00:05:22,000 --> 00:05:26,000
criteria by which I can divide you into these buckets that don't overlap and everyone 

109
00:05:26,000 --> 00:05:28,000
has a unique bucket to go in 

110
00:05:28,000 --> 00:05:31,000
and if I can make up enough of those categorizations, 

111
00:05:31,000 --> 00:05:33,000
have as many of those buckets around 

112
00:05:33,000 --> 00:05:35,000
to divide you up into, 

113
00:05:35,000 --> 00:05:38,000
then I can hopefully keep the set that lands in that bucket pretty small, 

114
00:05:38,000 --> 00:05:42,000
so that I don't have a smaller of students. So maybe I did it by the 

115
00:05:42,000 --> 00:05:45,000
first letter of your last name. I could say, ''Well everybody whose last name begins with W 

116
00:05:45,000 --> 00:05:47,000
goes here and everybody who's B goes here and so on.'' Well, 

117
00:05:47,000 --> 00:05:50,000
when somebody comes to me then I don't have to look through the B's. 

118
00:05:50,000 --> 00:05:51,000
You know, and if you have 

119
00:05:51,000 --> 00:05:53,000
a class of 200 students 

120
00:05:53,000 --> 00:05:56,000
and you have 26 letters of the alphabet then you'd think that if it were 

121
00:05:56,000 --> 00:05:59,000
evenly divided across those letters, it's not going to be to be, it turns out in practice, 

122
00:05:59,000 --> 00:06:02,000
but as a thought experiment there'd be about 4 people in a bucket and 

123
00:06:02,000 --> 00:06:06,000
it'd be pretty fast to just look through the bucket and find the one who I was 

124
00:06:06,000 --> 00:06:08,000
looking for. Okay, 

125
00:06:08,000 --> 00:06:10,000
so let's take this idea. This idea of, like, 

126
00:06:10,000 --> 00:06:14,000
there's going to be some strategy for how we divide into buckets and we call 

127
00:06:14,000 --> 00:06:16,000
that strategy the hash function. 

128
00:06:16,000 --> 00:06:18,000
That, given a key, 

129
00:06:18,000 --> 00:06:23,000
and a buckets and we'll just label them with numbers. Zero to the number of buckets 

130
00:06:23,000 --> 00:06:26,000
minus one. So let's say I have 99 buckets, I've got zero to 

131
00:06:26,000 --> 00:06:27,000
98 

132
00:06:27,000 --> 00:06:28,000
that I will take the key 

133
00:06:28,000 --> 00:06:32,000
and the key in this case is the string key that we're using in the map, 

134
00:06:32,000 --> 00:06:33,000
and I have a hash function 

135
00:06:33,000 --> 00:06:37,000
which given a key is input produces a number as output 

136
00:06:37,000 --> 00:06:39,000
in the range zero to B minus one. 

137
00:06:39,000 --> 00:06:42,000
And that is what's called the hash value for that key 

138
00:06:42,000 --> 00:06:45,000
and that tells me which bucket to look at, 

139
00:06:45,000 --> 00:06:48,000
either find a match for that if I'm trying to do a get value or where to place a 

140
00:06:48,000 --> 00:06:52,000
new entry if I'm trying to add into the table. 

141
00:06:52,000 --> 00:06:57,000
So that's the basic idea. I'm 

142
00:06:57,000 --> 00:07:00,000
gonna talk a little bit about hash functions. 

143
00:07:00,000 --> 00:07:03,000
In terms of writing a really good hash function is actually pretty complicated 

144
00:07:03,000 --> 00:07:04,000
and so 

145
00:07:04,000 --> 00:07:07,000
we're gonna think about it in sort of a ? kind 

146
00:07:07,000 --> 00:07:10,000
of a big picture perspective and think about what qualities a hash 

147
00:07:10,000 --> 00:07:13,000
function has to have and then I'm actually not gonna go through 

148
00:07:13,000 --> 00:07:16,000
proving to you about what mathematically constitutes a good hash function. I'm just gonna give you 

149
00:07:16,000 --> 00:07:21,000
some intuitive sense of what kinda things matter when designing a hash function. 

150
00:07:21,000 --> 00:07:23,000
So it's given a key that maps it to a number. 

151
00:07:23,000 --> 00:07:25,000
Typically it's gonna say, you know, 

152
00:07:25,000 --> 00:07:27,000
zero to the number of buckets minus one. 

153
00:07:27,000 --> 00:07:30,000
Often that means that somehow it's gonna compute something and then use 

154
00:07:30,000 --> 00:07:33,000
the mod operation to bring it back in range. 

155
00:07:33,000 --> 00:07:35,000
So it's gonna take your number 

156
00:07:35,000 --> 00:07:37,000
and then say, ''Okay. Well, if you have five buckets to divide it in, I will give you 

157
00:07:37,000 --> 00:07:39,000
a number zero to four.'' 

158
00:07:39,000 --> 00:07:41,000
It's very important that it be 

159
00:07:41,000 --> 00:07:44,000
stable, that you put the key in, you get a number out. If you put that same key in 

160
00:07:44,000 --> 00:07:48,000
later you should get the same number out. So it has to be sort of reliable guaranteed 

161
00:07:48,000 --> 00:07:52,000
mapping. It can't be random. That 

162
00:07:52,000 --> 00:07:55,000
said, you want it to almost look as though it's random in terms of 

163
00:07:55,000 --> 00:07:59,000
how it maps things to buckets, that you don't want it to have a real systematic plan 

164
00:07:59,000 --> 00:08:02,000
where it puts a whole bunch of things in bucket one and very few things in bucket two 

165
00:08:02,000 --> 00:08:06,000
or somehow always uses the even buckets and not the odds ones or something like that. So 

166
00:08:06,000 --> 00:08:08,000
you wanna have a strategy that's gonna take those keys 

167
00:08:08,000 --> 00:08:11,000
and use your whole bucket range 

168
00:08:11,000 --> 00:08:12,000
in an equal manner. 

169
00:08:12,000 --> 00:08:15,000
So that if you put 100 keys in and you have 100 buckets you're hoping 

170
00:08:15,000 --> 00:08:16,000
that, 

171
00:08:16,000 --> 00:08:19,000
on average, each bucket will have one thing. That you won't have these clusters 

172
00:08:19,000 --> 00:08:22,000
where 25 things are in one bucket, ten here and then a whole bunch of buckets 

173
00:08:22,000 --> 00:08:24,000
are empty. So 

174
00:08:24,000 --> 00:08:27,000
you want this kinda divided across the range. 

175
00:08:27,000 --> 00:08:29,000
So given that we have string input, 

176
00:08:29,000 --> 00:08:32,000
likely what we're gonna be using is the characters in the string as part of 

177
00:08:32,000 --> 00:08:34,000
the ? how can I take these characters 

178
00:08:34,000 --> 00:08:40,000
and somehow kinda jungle them up in a way to sort of move them around in these 

179
00:08:40,000 --> 00:08:43,000
buckets in a way that's reliable so that I get that same string back, I get the 

180
00:08:43,000 --> 00:08:44,000
same one. 

181
00:08:44,000 --> 00:08:47,000
But that doesn't actually produce a lot of artifacts of 

182
00:08:47,000 --> 00:08:50,000
similar strings, let's say mapping to the same place, hopefully spreading them 

183
00:08:50,000 --> 00:08:52,000
around. 

184
00:08:52,000 --> 00:08:55,000
So for example, some really simple things you could use, you could just take the first 

185
00:08:55,000 --> 00:08:56,000
letter. So I 

186
00:08:56,000 --> 00:08:59,000
talked about this sorta with last names. If I was trying to divide up my students, right, I could divide 

187
00:08:59,000 --> 00:09:01,000
them by the first letter of their last name. 

188
00:09:01,000 --> 00:09:05,000
That would use exactly 26 buckets, no more, no less, 

189
00:09:05,000 --> 00:09:08,000
and if I happen to have 26 buckets 

190
00:09:08,000 --> 00:09:11,000
then at least there's some chance it would get some coverage across the range. 

191
00:09:11,000 --> 00:09:14,000
There is likely to be clustering, right, some 

192
00:09:14,000 --> 00:09:15,000
names are a lot more common 

193
00:09:15,000 --> 00:09:18,000
or letters are a lot more commonly used than others. 

194
00:09:18,000 --> 00:09:21,000
So it would tend to produce a little bit of an artifact, but it would be reliable and it would 

195
00:09:21,000 --> 00:09:24,000
use 26 buckets. If I had 100 buckets, 

196
00:09:24,000 --> 00:09:26,000
then this would be kind of a crummy function 

197
00:09:26,000 --> 00:09:28,000
because in fact it would use the first 26 buckets and none of the other ones, 

198
00:09:28,000 --> 00:09:30,000
right, if it was just taking 

199
00:09:30,000 --> 00:09:32,000
that letter and deciding. So, 

200
00:09:32,000 --> 00:09:36,000
you know, in some situations this would get me a little bit of the thing I wanted. 

201
00:09:36,000 --> 00:09:40,000
Things like using the length of the word. A 

202
00:09:40,000 --> 00:09:42,000
very, very bad clustering function here, 

203
00:09:42,000 --> 00:09:44,000
right because there'd be a bunch of strings that are, like, six characters, seven 

204
00:09:44,000 --> 00:09:46,000
characters, ten characters, 

205
00:09:46,000 --> 00:09:50,000
but very few that are one or two, and very few that are more than ten, or something. 

206
00:09:50,000 --> 00:09:53,000
You'd have this very tight little cluster in the middle, right, and would not use a 

207
00:09:53,000 --> 00:09:57,000
large bucket range effectively at all. 

208
00:09:57,000 --> 00:10:01,000
More likely what you're gonna try to use is kind of more of the information about the word, 

209
00:10:01,000 --> 00:10:04,000
the key that you have. Instead of just a single letter or the count of letters, it's 

210
00:10:04,000 --> 00:10:06,000
trying to use kinda all the data you have at hand, 

211
00:10:06,000 --> 00:10:08,000
which is to say all the letters, 

212
00:10:08,000 --> 00:10:12,000
and so one way to do it would be to take the ASCII values for the letters, so A is 

213
00:10:12,000 --> 00:10:13,000
97, you 

214
00:10:13,000 --> 00:10:15,000
know, B is 98, 

215
00:10:15,000 --> 00:10:17,000
is you would take a letter, a word like atoms and you'd add 

216
00:10:17,000 --> 00:10:18,000
96 

217
00:10:18,000 --> 00:10:22,000
plus 99 plus 96, you know, add 

218
00:10:22,000 --> 00:10:23,000
them together 

219
00:10:23,000 --> 00:10:26,000
and then you'd get some sum of them, 500, 600, whatever it is, 

220
00:10:26,000 --> 00:10:28,000
depending on how many letters there are 

221
00:10:28,000 --> 00:10:32,000
and then use mod to back it back into ? let's say your range was, you had 100 

222
00:10:32,000 --> 00:10:34,000
buckets, right, so you'd want it to be zero to 99, so 

223
00:10:34,000 --> 00:10:37,000
you'd take that thing and then mod it by 100, then take the last two digits of it. So it'd 

224
00:10:37,000 --> 00:10:42,000
be, okay, the sum to 524 then it goes into bucket 24. 

225
00:10:42,000 --> 00:10:47,000
So if you did that, right, and so if act 

226
00:10:47,000 --> 00:10:50,000
is the word that you have, right, you'd add these codes together and let's say this is 

227
00:10:50,000 --> 00:10:53,000
97 and then this is 99 and then 

228
00:10:53,000 --> 00:10:57,000
that's, you know, a hundred and something and so we end up with ? I'm making this 

229
00:10:57,000 --> 00:10:59,000
number up, right ? 283, then we might 

230
00:10:59,000 --> 00:11:03,000
mod to the last two digit and we'll say this one goes into bucket 83. 

231
00:11:03,000 --> 00:11:07,000
And if we have, you know, dog, 

232
00:11:07,000 --> 00:11:11,000
maybe that adds to 267 and we would 

233
00:11:11,000 --> 00:11:13,000
mod that down to 

234
00:11:13,000 --> 00:11:17,000
67. So this actually tries to use, kinda, all the information at hand, all the letters 

235
00:11:17,000 --> 00:11:18,000
that we have 

236
00:11:18,000 --> 00:11:19,000
and bring them down, 

237
00:11:19,000 --> 00:11:22,000
and so if you figure that across the space of all the numbers that are being added here 

238
00:11:22,000 --> 00:11:25,000
you are just as likely to get 210 as you are to get 299, you know 

239
00:11:25,000 --> 00:11:28,000
that any of the subsequent numbers in between, 

240
00:11:28,000 --> 00:11:32,000
then hopefully you'd get kind of a spread across your buckets. 

241
00:11:32,000 --> 00:11:36,000
There is one thing this clusters on. Does anybody see what words cluster in this 

242
00:11:36,000 --> 00:11:40,000
system? 

243
00:11:40,000 --> 00:11:41,000
That you'd think of as 

244
00:11:41,000 --> 00:11:47,000
maybe not being related, but would land in the same bucket? To 

245
00:11:47,000 --> 00:11:49,000
rearrange some numbers. Yes. Any 

246
00:11:49,000 --> 00:11:51,000
anagram, right, so 

247
00:11:51,000 --> 00:11:52,000
you 

248
00:11:52,000 --> 00:11:54,000
take cat which is just a ranagram ? 

249
00:11:54,000 --> 00:11:57,000
anagram of act, right, it's gonna add to the same thing, right, 

250
00:11:57,000 --> 00:12:00,000
and so it would be a little bit of clustering in that respect. 

251
00:12:00,000 --> 00:12:02,000
So anagrams come down to the same place. 

252
00:12:02,000 --> 00:12:05,000
And so the words that you expect to, kinda like, act and cat to 

253
00:12:05,000 --> 00:12:09,000
you seem like different words, they are gonna collide in this system and that might be a little 

254
00:12:09,000 --> 00:12:11,000
bit of something we worry about. So one way you can actually kind of avoid that in the 

255
00:12:11,000 --> 00:12:12,000
way that 

256
00:12:12,000 --> 00:12:15,000
the hash function we're going to eventually use will do, is 

257
00:12:15,000 --> 00:12:16,000
it'll do this thing where 

258
00:12:16,000 --> 00:12:21,000
not only does it add the letters together, but it also multiplies the letter by a 

259
00:12:21,000 --> 00:12:22,000
very large prime number 

260
00:12:22,000 --> 00:12:25,000
that helps to kinda say, ''Well the large prime number 

261
00:12:25,000 --> 00:12:26,000
times C plus 

262
00:12:26,000 --> 00:12:30,000
the large prime number squared times A plus the large prime number cubed times C and 

263
00:12:30,000 --> 00:12:33,000
then it just allows the, each position of the letter actually is also then 

264
00:12:33,000 --> 00:12:35,000
encoded in the number that comes out 

265
00:12:35,000 --> 00:12:37,000
and so something that's positionally different 

266
00:12:37,000 --> 00:12:40,000
won't necessarily land in the same bucket 

267
00:12:40,000 --> 00:12:43,000
without some other factors kinda coming into play. So a little 

268
00:12:43,000 --> 00:12:49,000
bit of trying to just take the information and jumble it up a little bit. 

269
00:12:49,000 --> 00:12:51,000
So let me show you a little bit about how this looks. Actually I'm gonna tell you about 

270
00:12:51,000 --> 00:12:54,000
the collisions and then I'll show you the data structure. 

271
00:12:54,000 --> 00:12:58,000
So certainly I said there's gonna be this problem where things collide, 

272
00:12:58,000 --> 00:13:01,000
that 83 and 83 in this simple system of adding the codes will 

273
00:13:01,000 --> 00:13:03,000
come out the same. 

274
00:13:03,000 --> 00:13:06,000
So it can't be that each of these buckets holds exactly one thing. 

275
00:13:06,000 --> 00:13:07,000
The hashing, 

276
00:13:07,000 --> 00:13:08,000
right, although we may 

277
00:13:08,000 --> 00:13:11,000
try to get those buckets as small as possible we have to account for the 

278
00:13:11,000 --> 00:13:14,000
possibility that two things will collide, that their hash function, even though 

279
00:13:14,000 --> 00:13:16,000
they were different keys, 

280
00:13:16,000 --> 00:13:18,000
will land in the same bucket 

281
00:13:18,000 --> 00:13:20,000
based on how I've jumbled it up 

282
00:13:20,000 --> 00:13:23,000
and so we're trying to avoid that, but when it does happen we also have to have a 

283
00:13:23,000 --> 00:13:25,000
strategy for it. So we have to have a way of saying, ''Well 

284
00:13:25,000 --> 00:13:28,000
there can be more than one thing in a bucket.'' Or some other way of when there's 

285
00:13:28,000 --> 00:13:33,000
a collision, deciding where to put the one that was placed in the table second. 

286
00:13:33,000 --> 00:13:36,000
And in the strategy, we're gonna use the one called chaning, 

287
00:13:36,000 --> 00:13:39,000
and the chaning basically says, ''Well if once we have something in a bucket like ACT, 

288
00:13:39,000 --> 00:13:43,000
if we hash something and it comes to the same bucket, we'll just tack it onto that bucket, 

289
00:13:43,000 --> 00:13:45,000
in this case using a link list.'' 

290
00:13:45,000 --> 00:13:48,000
So each of these buckets is actually gonna be a link list of 

291
00:13:48,000 --> 00:13:49,000
the entries 

292
00:13:49,000 --> 00:13:52,000
and so the entire table will be a 

293
00:13:52,000 --> 00:13:54,000
vector or array of those. 

294
00:13:54,000 --> 00:13:56,000
So let me show you, 

295
00:13:56,000 --> 00:13:59,000
a little 

296
00:13:59,000 --> 00:14:05,000
live action. 

297
00:14:05,000 --> 00:14:08,000
So this is a hash table that has seven buckets. 

298
00:14:08,000 --> 00:14:08,000
In 

299
00:14:08,000 --> 00:14:11,000
this case number 0 through 6 are assigned to each one. 

300
00:14:11,000 --> 00:14:14,000
They all start empty, and so these are all empty link lists illustrated by the 

301
00:14:14,000 --> 00:14:15,000
null, 

302
00:14:15,000 --> 00:14:17,000
and then the hash function 

303
00:14:17,000 --> 00:14:20,000
is shown here like a black box. You don't actually know how the workings of hash 

304
00:14:20,000 --> 00:14:23,000
function is, but you know what it's job is to do, which is you give it a key and it gives you 

305
00:14:23,000 --> 00:14:27,000
a number in the range 0 to 6, and that same key when passed, and again gives you 

306
00:14:27,000 --> 00:14:28,000
the same number. 

307
00:14:28,000 --> 00:14:31,000
So if I say 106B 

308
00:14:31,000 --> 00:14:34,000
equals Julie, it pushes 106B through the hash function and says, 

309
00:14:34,000 --> 00:14:35,000
''Okay, well 106B 

310
00:14:35,000 --> 00:14:37,000
comes out as one.'' 

311
00:14:37,000 --> 00:14:40,000
You add those numbers together, jumble them up, the letters, and you get one. 

312
00:14:40,000 --> 00:14:43,000
And so it went into the table, and it said, ''Okay, well that's bucket number one, make a new 

313
00:14:43,000 --> 00:14:46,000
cell for 106B Julie and stick it in the table.'' 

314
00:14:46,000 --> 00:14:48,000
So now there's a 

315
00:14:48,000 --> 00:14:51,000
non empty chain in one place, the rest are empty. 

316
00:14:51,000 --> 00:14:53,000
If I say 107 

317
00:14:53,000 --> 00:14:55,000
is being taught by 

318
00:14:55,000 --> 00:14:57,000
Jerry, 107 happens to get the same hash function. 

319
00:14:57,000 --> 00:15:00,000
107, 106B, different letters but it just happen to come out that way, 

320
00:15:00,000 --> 00:15:01,000
hash to one 

321
00:15:01,000 --> 00:15:04,000
and so it went ahead in this case just tacked it on to the front of the list. 

322
00:15:04,000 --> 00:15:05,000
If I do 

323
00:15:05,000 --> 00:15:08,000
106A 

324
00:15:08,000 --> 00:15:10,000
being taught by Patrick, 

325
00:15:10,000 --> 00:15:11,000
106A 

326
00:15:11,000 --> 00:15:14,000
happens to sum to zero in the hash function. 

327
00:15:14,000 --> 00:15:17,000
That's all we know about the black box that ends up loading it up into the zero's 

328
00:15:17,000 --> 00:15:17,000
slot. 

329
00:15:17,000 --> 00:15:21,000
I say, well wait, it's being taught by 

330
00:15:21,000 --> 00:15:25,000
Nick, it also come out to be zero and so 

331
00:15:25,000 --> 00:15:27,000
it ends up kinda pushing on the front of that. So I go 103 

332
00:15:27,000 --> 00:15:30,000
is being taught by Maron and it's in 

333
00:15:30,000 --> 00:15:33,000
bucket five and so as I kinda type things in I'm starting to see that things are just kinda 

334
00:15:33,000 --> 00:15:35,000
going around in difference place in the table. 

335
00:15:35,000 --> 00:15:38,000
A little bit of clustering in the top, a little bit of empty at the bottom, but as I continue 

336
00:15:38,000 --> 00:15:42,000
to type in strings for other course numbers, I'd hopefully kinda fill out the table 

337
00:15:42,000 --> 00:15:43,000
to where there was kind of an even, 

338
00:15:43,000 --> 00:15:46,000
roughly even spread across that. 

339
00:15:46,000 --> 00:15:47,000
When I want to look something up 

340
00:15:47,000 --> 00:15:51,000
because I wanna see who's teaching 106B, then 

341
00:15:51,000 --> 00:15:54,000
it will use the same process in reverse. It asks the hash function, 

342
00:15:54,000 --> 00:15:56,000
''Well which bucket should I look in?'' And it 

343
00:15:56,000 --> 00:15:59,000
says, ''Oh, you should look in bucket one because 106B's hash code is one.'' So 

344
00:15:59,000 --> 00:16:01,000
all the other buckets are eliminated 

345
00:16:01,000 --> 00:16:04,000
from consideration, so very quick kind of 

346
00:16:04,000 --> 00:16:07,000
focused down to the right place and then it actually just does a link list reversal to 

347
00:16:07,000 --> 00:16:10,000
find 106B in the bucket 

348
00:16:10,000 --> 00:16:11,000
where it has to be, 

349
00:16:11,000 --> 00:16:13,000
and so 

350
00:16:13,000 --> 00:16:16,000
the key to realizing what makes this so magical, right, 

351
00:16:16,000 --> 00:16:19,000
is that the choice of the number of buckets 

352
00:16:19,000 --> 00:16:21,000
is up to me. 

353
00:16:21,000 --> 00:16:24,000
I can pick the number of buckets to be whatever I want. 

354
00:16:24,000 --> 00:16:27,000
I can pick it to be five, I can pick it to be 10, I can be to be 

355
00:16:27,000 --> 00:16:29,000
1,000, I can pick it to be a million. If 

356
00:16:29,000 --> 00:16:34,000
I choose the number of buckets to be kind of roughly in the same order of magnitude 

357
00:16:34,000 --> 00:16:35,000
as the number of entries. So if I 

358
00:16:35,000 --> 00:16:38,000
wanna put a thousand things in my table, 

359
00:16:38,000 --> 00:16:40,000
if I make a thousand buckets to hold them in 

360
00:16:40,000 --> 00:16:43,000
and my hash function can divide stuff 

361
00:16:43,000 --> 00:16:45,000
pretty equally across those thousand buckets, 

362
00:16:45,000 --> 00:16:47,000
then I will expect that each of those link lists is about length one. 

363
00:16:47,000 --> 00:16:52,000
Some will be two, some will be zero, but if it's doing its job right then I have a 

364
00:16:52,000 --> 00:16:56,000
bunch of little tiny buckets, each of which holds a tiny fraction of 

365
00:16:56,000 --> 00:16:58,000
the data set, in this case one or two, 

366
00:16:58,000 --> 00:17:00,000
and it's very easy to search through them. 

367
00:17:00,000 --> 00:17:05,000
If I know I'm gonna put a million things in, then I make a million buckets, and then 

368
00:17:05,000 --> 00:17:08,000
I divide one out of a million things, I run it through the hash function, it gives me 

369
00:17:08,000 --> 00:17:09,000
this number 

370
00:17:09,000 --> 00:17:12,000
that says, ''Oh, it's 862,444.'' I go to 

371
00:17:12,000 --> 00:17:16,000
that bucket and the item is either there in that bucket or I have to look a little 

372
00:17:16,000 --> 00:17:18,000
bit to find it, 

373
00:17:18,000 --> 00:17:20,000
but there's no other place in the table I have to look. So the fact that there's 

374
00:17:20,000 --> 00:17:21,000
a 

375
00:17:21,000 --> 00:17:24,000
million other entries kinda near by in these other buckets is completely not 

376
00:17:24,000 --> 00:17:26,000
important to us, right, and so 

377
00:17:26,000 --> 00:17:28,000
the searching and 

378
00:17:28,000 --> 00:17:32,000
updating, adding, and replacing, and removing, and stuff all happens at the bucket level 

379
00:17:32,000 --> 00:17:35,000
and by choosing the buckets to be 

380
00:17:35,000 --> 00:17:36,000


381
00:17:36,000 --> 00:17:39,000
the number, total number are gonna be roughly equal to the size of the things, then 

382
00:17:39,000 --> 00:17:40,000
I have this 

383
00:17:40,000 --> 00:17:43,000
constant time access to 

384
00:17:43,000 --> 00:17:50,000
the tiny part of the subset of the set that matters. That's pretty cool. Let's write some 

385
00:17:50,000 --> 00:17:51,000
code. It's kinda 

386
00:17:51,000 --> 00:17:53,000
good to see the whole thing 

387
00:17:53,000 --> 00:18:00,000
doing what it needs to do. So I'm gonna go ahead and go back over here. 

388
00:18:00,000 --> 00:18:02,000
All right, so I've got my 

389
00:18:02,000 --> 00:18:05,000
map which has add and get value and 

390
00:18:05,000 --> 00:18:07,000
not much else there, right, 

391
00:18:07,000 --> 00:18:12,000
but kinda just set up and ready for us to go. What I'm gonna build here 

392
00:18:12,000 --> 00:18:13,000
is a 

393
00:18:13,000 --> 00:18:15,000
link list cell 

394
00:18:15,000 --> 00:18:20,000
that's gonna hold a string key, a val type value, and 

395
00:18:20,000 --> 00:18:24,000
a pointer to the next, and 

396
00:18:24,000 --> 00:18:27,000
so I plan on having each one of these cells, right, for each entry that goes in the 

397
00:18:27,000 --> 00:18:27,000
table 

398
00:18:27,000 --> 00:18:31,000
and then I'm going to have an array of 

399
00:18:31,000 --> 00:18:32,000
these, and 

400
00:18:32,000 --> 00:18:35,000
I'm gonna make a constant for it, and then I'll talk a little bit about the 

401
00:18:35,000 --> 00:18:42,000
consequences of that. So I'm gonna make a constant that is ? so 

402
00:18:42,000 --> 00:18:45,000
I make it 99. So then I have 99 buckets to start with. That's not 

403
00:18:45,000 --> 00:18:49,000
gonna solve all my problems, it's gonna show that it's kinda tuned for maps that 

404
00:18:49,000 --> 00:18:51,000
expect to hold about a hundred things, but 

405
00:18:51,000 --> 00:18:56,000
? and so I have an array in this case, so if you look at deciphering this 

406
00:18:56,000 --> 00:18:58,000
declaration is a little bit complicated here. 

407
00:18:58,000 --> 00:19:00,000
It says that buckets is the name of the variable, 

408
00:19:00,000 --> 00:19:06,000
it is an array of num buckets length, so a fixed size, 99 

409
00:19:06,000 --> 00:19:07,000
entry array here. 

410
00:19:07,000 --> 00:19:11,000
Each of those entries is a pointer to a cell, so ready to be the 

411
00:19:11,000 --> 00:19:16,000
head pointer of a link list for us. 

412
00:19:16,000 --> 00:19:19,000
I'm gonna add a ? 

413
00:19:19,000 --> 00:19:23,000
well, we'll get there in a second. I'm gonna go over here and work on the constructor first. 

414
00:19:23,000 --> 00:19:25,000
So 

415
00:19:25,000 --> 00:19:30,000
the first thing I'm gonna do is make sure that those pointers have good values. 

416
00:19:30,000 --> 00:19:32,000
If I do not do this, right, they will be junk 

417
00:19:32,000 --> 00:19:36,000
and there will be badness that will happen, so I can make sure that each of them starts off as an 

418
00:19:36,000 --> 00:19:37,000
empty list, 

419
00:19:37,000 --> 00:19:42,000
and then correspondently, right, I would need to be doing a delete all list 

420
00:19:42,000 --> 00:19:43,000
cells here, but 

421
00:19:43,000 --> 00:19:48,000
I've been lazy about doing that and I'm gonna continue to be lazy just knowing that I'd have to iterate through and then 

422
00:19:48,000 --> 00:19:52,000
delete all the cells in each of those lists that are there. 

423
00:19:52,000 --> 00:19:54,000
And then for add and get value 

424
00:19:54,000 --> 00:19:56,000
what I'm gonna ? else we need to do 

425
00:19:56,000 --> 00:19:57,000
is figure out 

426
00:19:57,000 --> 00:19:58,000
in both cases, right, which 

427
00:19:58,000 --> 00:20:00,000
is the appropriate list to operate on. 

428
00:20:00,000 --> 00:20:03,000
So running the hash function on my key, 

429
00:20:03,000 --> 00:20:08,000
seeing where it tells me to look, and then either searching it to find the match 

430
00:20:08,000 --> 00:20:09,000
for get value 

431
00:20:09,000 --> 00:20:13,000
to return the matching value or in the add case, to search for the matching 

432
00:20:13,000 --> 00:20:16,000
entry to overwrite if it exists and if not, then to create a new cell and tack it on the 

433
00:20:16,000 --> 00:20:19,000
front of the list. So 

434
00:20:19,000 --> 00:20:21,000
let's write a 

435
00:20:21,000 --> 00:20:25,000
helper function because they're both gonna do the same thing. I'm gonna need 

436
00:20:25,000 --> 00:20:28,000
the same thing twice. So I'm gonna put a hash function in here 

437
00:20:28,000 --> 00:20:31,000
that, given a key and a number of buckets, is gonna 

438
00:20:31,000 --> 00:20:35,000
return to me the right bucket to look in, 

439
00:20:35,000 --> 00:20:39,000
and then I'm gonna write a find cell function that, given a key and 

440
00:20:39,000 --> 00:20:39,000
the pointer to a list, 

441
00:20:39,000 --> 00:20:43,000
will find the cell that matches that key or return a null if it didn't find 

442
00:20:43,000 --> 00:20:52,000
it. Okay, let me put my hash 

443
00:20:52,000 --> 00:21:03,000
function in. I have a good hash function down there in a minute that I'm gonna pull out, but I just ? 

444
00:21:03,000 --> 00:21:05,000
so what I'm gonna do, I'm 

445
00:21:05,000 --> 00:21:07,000
gonna make an interesting choice here 

446
00:21:07,000 --> 00:21:12,000
and I'm gonna have my hash function just return zero, which 

447
00:21:12,000 --> 00:21:13,000
seems pretty surprising. 

448
00:21:13,000 --> 00:21:16,000
But I'm gonna talk about why that is. It turns out, when I'm first building it that 

449
00:21:16,000 --> 00:21:19,000
actually is an easy way to test that my code is working 

450
00:21:19,000 --> 00:21:22,000
and to not be dependent on the behavior of the hash function. At this point it says, ''We'll 

451
00:21:22,000 --> 00:21:23,000
just put 

452
00:21:23,000 --> 00:21:25,000
everything in the zero's bucket.'' 

453
00:21:25,000 --> 00:21:27,000
Now that's not a good strategy for my long term efficiency, 

454
00:21:27,000 --> 00:21:29,000
but it is a good way to test that 

455
00:21:29,000 --> 00:21:33,000
my handling of the bucket, and the searching, and the finding, and 

456
00:21:33,000 --> 00:21:36,000
inserting is correct, and the performance I should expect to be abysmal on this, right, it should be 

457
00:21:36,000 --> 00:21:37,000
totally linear 

458
00:21:37,000 --> 00:21:39,000
in terms of the number of elements to add or to get them because they'll all 

459
00:21:39,000 --> 00:21:41,000
be in one link list 

460
00:21:41,000 --> 00:21:45,000
with no easy access to them. And then eventually I can change this to divide across 

461
00:21:45,000 --> 00:21:48,000
the buckets and get better spread, but the correctness of the hash table 

462
00:21:48,000 --> 00:21:52,000
isn't affected by this choice. It's kind of an interesting way to think about designing 

463
00:21:52,000 --> 00:21:54,000
the code. 

464
00:21:54,000 --> 00:22:01,000
The other helper I have down here is gonna be my find cell that, given the 

465
00:22:01,000 --> 00:22:04,000
head pointer to a list and the key, 

466
00:22:04,000 --> 00:22:14,000
is gonna go find it, and it's basically just 

467
00:22:14,000 --> 00:22:17,000
if this ker cells key equals the key, 

468
00:22:17,000 --> 00:22:20,000
then we return it and then if we've gone through the whole list and haven't found it, we return 

469
00:22:20,000 --> 00:22:22,000
a null. 

470
00:22:22,000 --> 00:22:25,000
So this guy is returning a cell 

471
00:22:25,000 --> 00:22:27,000
T star, right, 

472
00:22:27,000 --> 00:22:30,000
as the matching cell within the thing, and this one is gonna provoke 

473
00:22:30,000 --> 00:22:31,000
that 

474
00:22:31,000 --> 00:22:35,000
little C++ compilation snafu that we encountered on the binary search 

475
00:22:35,000 --> 00:22:36,000
tree, as well, 

476
00:22:36,000 --> 00:22:40,000
which is that cell T, right, is a type that's declared in the private 

477
00:22:40,000 --> 00:22:43,000
section within the scope of my map. So 

478
00:22:43,000 --> 00:22:46,000
outside of the scope, which to say, before I've crossed the my map 

479
00:22:46,000 --> 00:22:51,000
angle bracket val type colon, colon, it doesn't actually believe we're in my map scope 

480
00:22:51,000 --> 00:22:54,000
yet. It hasn't seen that and it can't anticipate it, so I have to actually give 

481
00:22:54,000 --> 00:22:56,000
the full name 

482
00:22:56,000 --> 00:22:58,000
with the scope specifier on it, 

483
00:22:58,000 --> 00:23:03,000
and then because of this use of the template name outside of its class 

484
00:23:03,000 --> 00:23:09,000
is a special place for the C++ compiler, the type name keyword also 

485
00:23:09,000 --> 00:23:12,000
needs to go there. Okay, so there's the heavy lifting of these two things, right, getting it to hash to the 

486
00:23:12,000 --> 00:23:14,000
right place, which I kinda 

487
00:23:14,000 --> 00:23:15,000
short changed, 

488
00:23:15,000 --> 00:23:18,000
and then searching to find a cell. If 

489
00:23:18,000 --> 00:23:23,000
I go back up here to get value, 

490
00:23:23,000 --> 00:23:25,000
and I call the hash function, 

491
00:23:25,000 --> 00:23:31,000
given my key and the number of buckets to find out which bucket to work on, 

492
00:23:31,000 --> 00:23:35,000
I go looking for a match by calling find cell 

493
00:23:35,000 --> 00:23:40,000
of the key in buckets of which bucket, 

494
00:23:40,000 --> 00:23:43,000
and then if match does not equal null then 

495
00:23:43,000 --> 00:23:45,000
I can return match as val and 

496
00:23:45,000 --> 00:23:46,000
then 

497
00:23:46,000 --> 00:23:47,000
otherwise 

498
00:23:47,000 --> 00:23:52,000
we say, ''No such key found.'' So that's 

499
00:23:52,000 --> 00:23:58,000
the error case, right, for get value was that if it didn't find one, its behavior is to 

500
00:23:58,000 --> 00:23:59,000
raise an error 

501
00:23:59,000 --> 00:24:01,000
so that someone knows that they've 

502
00:24:01,000 --> 00:24:05,000
asked something it can't deal 

503
00:24:05,000 --> 00:24:09,000
with. Okay, so the key thing here usually is that the hash is some sort of constant operation that 

504
00:24:09,000 --> 00:24:12,000
just jumbles the key up, in this case it just returned zero, but could actually look at 

505
00:24:12,000 --> 00:24:14,000
the elements of the key, 

506
00:24:14,000 --> 00:24:16,000
and then it does its reversal down that link list, 

507
00:24:16,000 --> 00:24:20,000
which given a correct working hash function should have divided them up in 

508
00:24:20,000 --> 00:24:20,000
these 

509
00:24:20,000 --> 00:24:23,000
little tiny chains, then we can just find the one 

510
00:24:23,000 --> 00:24:26,000
quickly by looking through that link list. If we find it, right, we've got it, 

511
00:24:26,000 --> 00:24:28,000
otherwise, right, we're in 

512
00:24:28,000 --> 00:24:31,000
the error case. 

513
00:24:31,000 --> 00:24:37,000
Add looks almost exactly the same. I 

514
00:24:37,000 --> 00:24:40,000
start by getting the bucket, doing the find cell, 

515
00:24:40,000 --> 00:24:41,000
if it's not ? 

516
00:24:41,000 --> 00:24:43,000
if it did find it, right, 

517
00:24:43,000 --> 00:24:46,000
then we overwrite, 

518
00:24:46,000 --> 00:24:49,000
and then in the second case, right, where we didn't find it, right, then we need to 

519
00:24:49,000 --> 00:24:55,000
make ourselves a new cell, and 

520
00:24:55,000 --> 00:24:59,000
copy in the key, copy in the 

521
00:24:59,000 --> 00:25:01,000
value, 

522
00:25:01,000 --> 00:25:05,000
set it ? in this case, the easiest place, right, to add something to a link list is I should just do append 

523
00:25:05,000 --> 00:25:08,000
it to the front, right, no need to make it hard on ourselves, right, we're not 

524
00:25:08,000 --> 00:25:11,000
keeping them in any particular order, whatsoever, right, they're base on kind of order 

525
00:25:11,000 --> 00:25:12,000
of entry. 

526
00:25:12,000 --> 00:25:17,000
So if somebody gives us a new one, we might as well just stick it in the front, that'd make our job easy. So 

527
00:25:17,000 --> 00:25:18,000
it will 

528
00:25:18,000 --> 00:25:20,000
point to the 

529
00:25:20,000 --> 00:25:23,000
current front pointer of the cell, 

530
00:25:23,000 --> 00:25:26,000
and then we will update the 

531
00:25:26,000 --> 00:25:34,000
bucket pointer to point to this guy. So I think 

532
00:25:34,000 --> 00:25:35,000
we are 

533
00:25:35,000 --> 00:25:37,000
in okay shape here. Let me 

534
00:25:37,000 --> 00:25:40,000
take a look and make sure I like what I did, right, 

535
00:25:40,000 --> 00:25:44,000
so hashing, finding the match, if it found a non empty, so ? this is the 

536
00:25:44,000 --> 00:25:46,000
overwrite case of finding an existing cell, we just overwrite with that. Otherwise, 

537
00:25:46,000 --> 00:25:50,000
making a new cell and then attaching it to be the 

538
00:25:50,000 --> 00:25:50,000
front 

539
00:25:50,000 --> 00:25:52,000
of the bucket there. 

540
00:25:52,000 --> 00:25:55,000
So for example, in the case where the bucket's totally empty, it's good to kinda trace 

541
00:25:55,000 --> 00:25:59,000
that. That's the most common case when we start, right, if we hash to bucket zero 

542
00:25:59,000 --> 00:26:01,000
and we do a search of a null link list, we should get a null out, 

543
00:26:01,000 --> 00:26:03,000
so then we create a new cell, right, 

544
00:26:03,000 --> 00:26:06,000
and we set the next field to be what the current bucket's head pointer was, which 

545
00:26:06,000 --> 00:26:10,000
was null. So there should be a single to list and then we update the bucket 

546
00:26:10,000 --> 00:26:12,000
to point to this guy. So then we should have a 

547
00:26:12,000 --> 00:26:14,000
single list cell with null trailing it 

548
00:26:14,000 --> 00:26:18,000
or inserting into the empty bucket case. 

549
00:26:18,000 --> 00:26:20,000
I like it. Let's see if I have a little code over 

550
00:26:20,000 --> 00:26:24,000
here that adds 

551
00:26:24,000 --> 00:26:27,000
some things, car, car, cat, with some numbers. Okay, and that tries to 

552
00:26:27,000 --> 00:26:31,000
retrieve the value of car, which it wrote a three and then it wrote a four on top 

553
00:26:31,000 --> 00:26:39,000
of it, so hopefully the answer we should get out of this is four. Okay, if I ? be 

554
00:26:39,000 --> 00:26:41,000
interesting to do something, like ask for something that we know is not in the 

555
00:26:41,000 --> 00:26:45,000
table, right, to see that our error case gets handled correctly, 

556
00:26:45,000 --> 00:26:46,000
the switch 

557
00:26:46,000 --> 00:26:47,000
key's 

558
00:26:47,000 --> 00:26:50,000
found. And if I were to continue to do this, I could do a little bit of stress testing to see how 

559
00:26:50,000 --> 00:26:55,000
that zero is causing us some performance implications, but if I sat there and just put tons and 

560
00:26:55,000 --> 00:26:56,000
tons of things in there, right, 

561
00:26:56,000 --> 00:27:00,000
made up strings like crazy and stuffed them in that with numbers, right, I would just be creating one 

562
00:27:00,000 --> 00:27:04,000
long chain off the zero bucket, ignoring all the other ones, right, 

563
00:27:04,000 --> 00:27:08,000
but then it's kinda stress testing my link list handling about the adding and 

564
00:27:08,000 --> 00:27:09,000
searching that list, but 

565
00:27:09,000 --> 00:27:12,000
would also show us that we expect that as we got to be, have 100 entries, 200 

566
00:27:12,000 --> 00:27:16,000
entries, 300 entries that subsequent adds and get values, right, would 

567
00:27:16,000 --> 00:27:21,000
show a linear increase in performance because of the 

568
00:27:21,000 --> 00:27:27,000
cluster that we got going there. 

569
00:27:27,000 --> 00:27:30,000
So let me give you a real hash function. I'll talk 

570
00:27:30,000 --> 00:27:37,000
you through what it's doing and then I'll 

571
00:27:37,000 --> 00:27:39,000
leave you with the thought of, 

572
00:27:39,000 --> 00:27:41,000
that writing one of these actually correctly and reliably is 

573
00:27:41,000 --> 00:27:44,000
actually 

574
00:27:44,000 --> 00:27:47,000
more of an advance exercise than we're gonna look at, but this is actually a hash 

575
00:27:47,000 --> 00:27:52,000
function that's taken from Don Knuth's Art of Computer Science, 

576
00:27:52,000 --> 00:27:55,000
some of the work that guides a lot of computer scientists still to this day, 

577
00:27:55,000 --> 00:27:59,000
and it has the strategy I basically told you about of, 

578
00:27:59,000 --> 00:28:02,000
for all the characters that are in the string, 

579
00:28:02,000 --> 00:28:06,000
right, adding it into the thing, but having multiplied, 

580
00:28:06,000 --> 00:28:10,000
right, the previous hash code that's being built up by this large multiplier 

581
00:28:10,000 --> 00:28:12,000
which is in effect kinda taking 

582
00:28:12,000 --> 00:28:14,000
advantage of the positional information. 

583
00:28:14,000 --> 00:28:17,000
And so the multiplier happens to be a very large negative number 

584
00:28:17,000 --> 00:28:19,000
that's a prime that says, okay, 

585
00:28:19,000 --> 00:28:20,000
the first time through, right, 

586
00:28:20,000 --> 00:28:23,000
it'll take the hash code of zero and multiply it by and add the first character. So the first 

587
00:28:23,000 --> 00:28:26,000
character gets added in without any modification. 

588
00:28:26,000 --> 00:28:29,000
The next character, though, will take the previous one and multiply it by 

589
00:28:29,000 --> 00:28:31,000
one power of the multiplier 

590
00:28:31,000 --> 00:28:33,000
and then add in the next character, 

591
00:28:33,000 --> 00:28:36,000
and then that sum, right, on the sub [inaudible] gets multiplied again by the multiplier, so 

592
00:28:36,000 --> 00:28:39,000
raising it to the squared, and the third, and to the fourth powers, it works down the 

593
00:28:39,000 --> 00:28:40,000
string. The 

594
00:28:40,000 --> 00:28:43,000
effectiveness actually is that it's gonna do a lot of overflow. 

595
00:28:43,000 --> 00:28:45,000
That's a very large number 

596
00:28:45,000 --> 00:28:47,000
and it's adding 

597
00:28:47,000 --> 00:28:51,000
and multiplying in this very large space, so in effect we're actually counting on the 

598
00:28:51,000 --> 00:28:54,000
fact that the addition and multiplication 

599
00:28:54,000 --> 00:28:56,000
that is built into C++ 

600
00:28:56,000 --> 00:28:58,000
doesn't raise any errors when the numbers get so large that they can't be represented. It 

601
00:28:58,000 --> 00:29:03,000
just kind of truncates back down to what fits, and so it kind of wraps around using a modular 

602
00:29:03,000 --> 00:29:06,000
strategy here, like a ring. And so we'll actually end up kind of making a very big number out 

603
00:29:06,000 --> 00:29:10,000
of this. The longer the string is, the more those multiplies, the more those powers raise. We've got 

604
00:29:10,000 --> 00:29:13,000
this huge number, what we do is kinda truncating it back to what fits in to a long 

605
00:29:13,000 --> 00:29:16,000
and then when we're done, we take whatever 

606
00:29:16,000 --> 00:29:18,000
number fell out of this process 

607
00:29:18,000 --> 00:29:21,000
and we mod it by the number of buckets to give us a number from zero to buckets minus 

608
00:29:21,000 --> 00:29:22,000
one. 

609
00:29:22,000 --> 00:29:26,000
And so whatever gets stuffed into here, right, it's reliable in the sense that it's, 

610
00:29:26,000 --> 00:29:28,000
every time you put in the number you'll get the same thing back. 

611
00:29:28,000 --> 00:29:33,000
We know it will be in the range end buckets because of that last mod 

612
00:29:33,000 --> 00:29:34,000
call 

613
00:29:34,000 --> 00:29:36,000
that worked it out for us 

614
00:29:36,000 --> 00:29:37,000
and we 

615
00:29:37,000 --> 00:29:40,000
then use that to say which bucket to go look in when we're ready 

616
00:29:40,000 --> 00:29:41,000
to do our work. 

617
00:29:41,000 --> 00:29:44,000
So if I switch that in for my old hash 

618
00:29:44,000 --> 00:29:47,000
function, I shouldn't actually see any 

619
00:29:47,000 --> 00:29:47,000
change 

620
00:29:47,000 --> 00:29:48,000
from the outside, 

621
00:29:48,000 --> 00:29:51,000
other than the performance, right, should suddenly actually be a lot faster, which is 

622
00:29:51,000 --> 00:29:55,000
hard to tell when I have three things in there, but ? I'm gonna 

623
00:29:55,000 --> 00:30:01,000
take out that error case. In this case car 

624
00:30:01,000 --> 00:30:03,000
and cat now 

625
00:30:03,000 --> 00:30:06,000
are probably not very likely to be going to the same bucket. In fact I can 

626
00:30:06,000 --> 00:30:09,000
guarantee they don't go in the same bucket, whereas 

627
00:30:09,000 --> 00:30:16,000
before they were kind of overlapping. 

628
00:30:16,000 --> 00:30:18,000
I'll leave that code there for a second, although 

629
00:30:18,000 --> 00:30:19,000
really I would say that point of 

630
00:30:19,000 --> 00:30:22,000
hashing really is not to get so worried about how it is that the hash code does it's 

631
00:30:22,000 --> 00:30:25,000
work, but to understand the general idea of what a hashing function has to do, 

632
00:30:25,000 --> 00:30:29,000
right, what its role is and how it relates to getting this constant time 

633
00:30:29,000 --> 00:30:30,000
performance. 

634
00:30:30,000 --> 00:30:33,000
So I'll give you an example of hashing in the real world, I think is really interesting 

635
00:30:33,000 --> 00:30:37,000
to see it actually happen and people not even realizing how hashing is so useful. 

636
00:30:37,000 --> 00:30:39,000
So I ordered something from REI, which is 

637
00:30:39,000 --> 00:30:41,000
this camping store and 

638
00:30:41,000 --> 00:30:42,000
it ? 

639
00:30:42,000 --> 00:30:43,000
they didn't have it in stock, so they had to 

640
00:30:43,000 --> 00:30:46,000
place an order for it. This is actually kinda awhile ago, but I ? 

641
00:30:46,000 --> 00:30:50,000
so then when I called to see if it's in, I call them on the phone, I say, ''Oh, is my order in?'' And they don't 

642
00:30:50,000 --> 00:30:52,000
ask me my name. I thought this was very funny. Like, 

643
00:30:52,000 --> 00:30:55,000
''Oh, it's ? I'm Julie. I wanna know about my order.'' They're like, 

644
00:30:55,000 --> 00:30:59,000
''Okay, what are the last two digits of your phone number?'' And I'm 

645
00:30:59,000 --> 00:31:02,000
like ? you know, it's not one of those things you can answer right off the top of your head. Okay, 21. Turns out last 

646
00:31:02,000 --> 00:31:04,000
two digits of my phone number are 21. 

647
00:31:04,000 --> 00:31:06,000
And they say ? and they go and they look in the 21 basket and then they say, ''What's your 

648
00:31:06,000 --> 00:31:09,000
name?'' Now they want to know my name. I'm like, ''Okay, my name's Julie.'' Okay, they look it up and they're 

649
00:31:09,000 --> 00:31:11,000
like, ''Okay, it's here.'' I'm like, 

650
00:31:11,000 --> 00:31:12,000
''Okay.'' So then I go to get it, like 

651
00:31:12,000 --> 00:31:13,000
a couple days later, and I go to the store 

652
00:31:13,000 --> 00:31:15,000
and I'm standing in the line waiting for them, 

653
00:31:15,000 --> 00:31:17,000
and I look up on the wall 

654
00:31:17,000 --> 00:31:20,000
and they have 

655
00:31:20,000 --> 00:31:22,000
100 baskets, little 

656
00:31:22,000 --> 00:31:24,000
wire baskets up on the wall 

657
00:31:24,000 --> 00:31:27,000
and they're labeled 0 through 9 over here 

658
00:31:27,000 --> 00:31:29,000
and 0 through 9 over there, 

659
00:31:29,000 --> 00:31:33,000
and when I come in they ask me the same question. ''What's the last two digits of your phone number?'' 

660
00:31:33,000 --> 00:31:36,000
Now I'm all prepared because I've already been primed on this question. I say, ''21.'' Then 

661
00:31:36,000 --> 00:31:38,000
they walk over here, right, 

662
00:31:38,000 --> 00:31:40,000
to the 21 basket here. 

663
00:31:40,000 --> 00:31:42,000
It's like the top digit in this digit, 

664
00:31:42,000 --> 00:31:45,000
and there's only one piece of paper in there, and they pull out my order thing, 

665
00:31:45,000 --> 00:31:48,000
and then they get me my tent and everybody's happy. 

666
00:31:48,000 --> 00:31:52,000
So while I'm there, I try to engage the cashier on how this is an example of a real world 

667
00:31:52,000 --> 00:31:55,000
hashing system 

668
00:31:55,000 --> 00:31:58,000
and I still got my tent, I'll have you know, but it was close, 

669
00:31:58,000 --> 00:32:02,000
right. They're like, ''Okay, yeah, you crazy crackpot. You can now leave now.'' 

670
00:32:02,000 --> 00:32:06,000
I was like looking at it and them and I tried to talk to them about the number of digits because in some sets, right, this is a 

671
00:32:06,000 --> 00:32:10,000
very good example of what the investment in the number of buckets is versus what 

672
00:32:10,000 --> 00:32:12,000
tradeoff it gives you, right. Because there's this very physical 

673
00:32:12,000 --> 00:32:13,000
sort of set up of this, it's like 

674
00:32:13,000 --> 00:32:17,000
by choosing a larger number of buckets, right, you can more fine grain your 

675
00:32:17,000 --> 00:32:18,000
access, 

676
00:32:18,000 --> 00:32:22,000
but that investment in buckets means you're kind of permanently installing that story. So 

677
00:32:22,000 --> 00:32:25,000
in this case, right, they didn't use just the last digit of my phone number. They could've done that and 

678
00:32:25,000 --> 00:32:28,000
they would have only had ten buckets on the wall, but then they would expect, 

679
00:32:28,000 --> 00:32:30,000
you know, ten times as many things in each bucket. 

680
00:32:30,000 --> 00:32:33,000
And apparently their 

681
00:32:33,000 --> 00:32:37,000
estimate was the number of active orders in this backorder category was enough 

682
00:32:37,000 --> 00:32:38,000
to 

683
00:32:38,000 --> 00:32:41,000
warrant being more than ten, you know, significantly more than ten, 

684
00:32:41,000 --> 00:32:45,000
but not so much more than 100, then in fact 100 buckets was the right 

685
00:32:45,000 --> 00:32:46,000
investment to make in their bucket strategy. 

686
00:32:46,000 --> 00:32:49,000
They didn't put three buckets on the wall because, you know, like what are they 

687
00:32:49,000 --> 00:32:52,000
gonna do, have this big 3D sort of thing. They didn't enjoy this 

688
00:32:52,000 --> 00:32:56,000
discussion, I went on for hours with them and they were, like, really not amused. But it 

689
00:32:56,000 --> 00:32:59,000
had, you know, the three digits, then you'd get this even more likelihood that each 

690
00:32:59,000 --> 00:33:03,000
bucket would be empty, but ? or have a very small number of things, but given 

691
00:33:03,000 --> 00:33:07,000
their set up, that seemed to be the right strategy was to say 100 buckets and now ? they didn't ask me 

692
00:33:07,000 --> 00:33:09,000
the 

693
00:33:09,000 --> 00:33:11,000
first two digits of my phone number. 

694
00:33:11,000 --> 00:33:14,000
They asked me the last two. Why does that 

695
00:33:14,000 --> 00:33:18,000
matter? Because you're using all [inaudible]. Yeah. It's hated a 

696
00:33:18,000 --> 00:33:21,000
lot. If you ask, like Stanford students, right, like especially 

697
00:33:21,000 --> 00:33:25,000
when, you know, like campus numbers used to all be 49's or 72's or something, so like, if 

698
00:33:25,000 --> 00:33:27,000
you use the first two digits, right, you'd have everybody's in the 49 bucket or 

699
00:33:27,000 --> 00:33:29,000
the 72 bucket or 

700
00:33:29,000 --> 00:33:32,000
something, and a whole bunch of other ones not used. An example, even the first two digits are never 

701
00:33:32,000 --> 00:33:35,000
00, like there's a whole bunch of buckets that don't even get used as the first two digits of a 

702
00:33:35,000 --> 00:33:38,000
phone number. Phone numbers never start with zeros. 

703
00:33:38,000 --> 00:33:42,000
That they rarely actually have zeros or ones in them at all. They try to use those for the 

704
00:33:42,000 --> 00:33:44,000
area code and stuff in the leading digits. So I 

705
00:33:44,000 --> 00:33:45,000
thought it was just a really 

706
00:33:45,000 --> 00:33:48,000
interesting exercise and like, oh yeah, they exactly had kinda thought through hashing. Of 

707
00:33:48,000 --> 00:33:52,000
course, they did not think it was hashing, and they thought I was a crackpot, and they didn't want to give me 

708
00:33:52,000 --> 00:33:56,000
my stuff, but I paid my money and they [inaudible], 

709
00:33:56,000 --> 00:34:00,000
but it kinda shows you, okay what you're trying to do here is try to make that number of 

710
00:34:00,000 --> 00:34:00,000
buckets, 

711
00:34:00,000 --> 00:34:03,000
right, roughly equal to the number of elements 

712
00:34:03,000 --> 00:34:04,000
so that 

713
00:34:04,000 --> 00:34:09,000
if your hash function's dividing up across your whole world, 

714
00:34:09,000 --> 00:34:11,000
you've got 

715
00:34:11,000 --> 00:34:14,000
constant access to what 

716
00:34:14,000 --> 00:34:17,000
you have. Why is there an L at the end of one item? You know, that's actually just a ? C++ isn't for this. This is a long value 

717
00:34:17,000 --> 00:34:21,000
that without it, it assumes it's an int and that number is to big to fit in an int and it 

718
00:34:21,000 --> 00:34:25,000
is twice as big as a long in terms of space and so the L needs to be there, otherwise it 

719
00:34:25,000 --> 00:34:26,000
will 

720
00:34:26,000 --> 00:34:30,000
try to read it as an int and through away part of the information I was trying to give it. So it's the 

721
00:34:30,000 --> 00:34:31,000
way 

722
00:34:31,000 --> 00:34:35,000
of identifying it along constant. 

723
00:34:35,000 --> 00:34:37,000
So let me talk about this just a little bit more 

724
00:34:37,000 --> 00:34:39,000
in terms of 

725
00:34:39,000 --> 00:34:41,000


726
00:34:41,000 --> 00:34:43,000
actual performance, right, we've got N 

727
00:34:43,000 --> 00:34:46,000
over B, right, so if the division is complete, right, the number of entries over 

728
00:34:46,000 --> 00:34:48,000
the number of buckets, 

729
00:34:48,000 --> 00:34:48,000


730
00:34:48,000 --> 00:34:51,000
if we make them kind of on the same order of 

731
00:34:51,000 --> 00:34:52,000
magnitude, kind of in the same range, 

732
00:34:52,000 --> 00:34:55,000
so having to sum this could be to make a little bit of an estimated guess about 

733
00:34:55,000 --> 00:34:58,000
where we're going, 

734
00:34:58,000 --> 00:35:01,000
and then there are some choices here in how we store each bucket, right. We could use 

735
00:35:01,000 --> 00:35:05,000
a little array, we could use a link list, right, we could use a vector. 

736
00:35:05,000 --> 00:35:07,000
We expect this to be very small, is the truth. 

737
00:35:07,000 --> 00:35:09,000
We're hoping that it's gonna be one or two, 

738
00:35:09,000 --> 00:35:13,000
and so there's not a lot of reason to buy any real expensive structure or even bother 

739
00:35:13,000 --> 00:35:17,000
doing things like sorting. Like you could sort them, but like if you expect there to be two 

740
00:35:17,000 --> 00:35:19,000
things, what's the point? You're gonna spend more time 

741
00:35:19,000 --> 00:35:21,000
figuring out whether to put it in the front or the back 

742
00:35:21,000 --> 00:35:25,000
then you would gain at all by being able to 

743
00:35:25,000 --> 00:35:27,000
find it a little faster. 

744
00:35:27,000 --> 00:35:28,000
So the link list 

745
00:35:28,000 --> 00:35:32,000
is just gonna be easiest way to get kind of allocated without over capacity, 

746
00:35:32,000 --> 00:35:35,000
excess capacity that's unused in the bucket. You know it's gonna be small, use the 

747
00:35:35,000 --> 00:35:37,000
simple strategy. 

748
00:35:37,000 --> 00:35:41,000
It's also, though, an important thing that the hash type is likely to do in kind of a 

749
00:35:41,000 --> 00:35:44,000
industrial strength situation is it does have to deal with this idea, well what if that number of 

750
00:35:44,000 --> 00:35:47,000
buckets that you predicted was wrong? 

751
00:35:47,000 --> 00:35:51,000
And so the map as we have given it to you, right, has to take this sponge because 

752
00:35:51,000 --> 00:35:54,000
it doesn't know in advance how many things are you gonna put in. 

753
00:35:54,000 --> 00:35:57,000
The only way to know that is to wait and see as things get added, 

754
00:35:57,000 --> 00:35:59,000
and then realize your buckets are getting full. 

755
00:35:59,000 --> 00:36:02,000
So typically the industrial strength version of the hash table is gonna 

756
00:36:02,000 --> 00:36:06,000
track what's called the load factor. And the load factor is just the number of total 

757
00:36:06,000 --> 00:36:08,000
entries divided by the number of buckets. 

758
00:36:08,000 --> 00:36:09,000
When that 

759
00:36:09,000 --> 00:36:13,000
factor gets high, and high actually is actually quite small, in fact. If 

760
00:36:13,000 --> 00:36:15,000
it's two, is often the trigger for 

761
00:36:15,000 --> 00:36:19,000
causing a rehash situation, so 

762
00:36:19,000 --> 00:36:22,000
not even letting it get to be three or four. Just saying as soon as you realize that 

763
00:36:22,000 --> 00:36:22,000
you have 

764
00:36:22,000 --> 00:36:26,000
exceeded by double the capacity you intended, you planned for 100 things, 

765
00:36:26,000 --> 00:36:27,000
you got 200 things, 

766
00:36:27,000 --> 00:36:28,000
go ahead 

767
00:36:28,000 --> 00:36:31,000
and redo your whole bucket array. 

768
00:36:31,000 --> 00:36:39,000
So in the case, for example, of our simple like 0 through 7 case, right, maybe 

769
00:36:39,000 --> 00:36:43,000
it's 0 through 6. One, two, three, 

770
00:36:43,000 --> 00:36:45,000
four, so I have one that had six buckets, right, 

771
00:36:45,000 --> 00:36:48,000
and then I've gotten to where 

772
00:36:48,000 --> 00:36:50,000
each of them has 

773
00:36:50,000 --> 00:36:54,000
two or maybe three in some and one in another. 

774
00:36:54,000 --> 00:36:57,000
The plan is to just take this whole bucket array and 

775
00:36:57,000 --> 00:37:01,000
enlarge it by a factor of two, so grow it, and in this case, from 6 to 12, 

776
00:37:01,000 --> 00:37:02,000
and then rehash 

777
00:37:02,000 --> 00:37:04,000
to move everybody. The 

778
00:37:04,000 --> 00:37:08,000
idea is that the earlier decision about where to place them was based on knowing that 

779
00:37:08,000 --> 00:37:11,000
there was exactly six buckets and so where it was placed before 

780
00:37:11,000 --> 00:37:15,000
is not likely to be the place it will place if you have, you know, 

781
00:37:15,000 --> 00:37:18,000
12 choices to lay them down into. 

782
00:37:18,000 --> 00:37:21,000
And ideal ? in fact you don't even want it to be that case. Like, if they all land into the 

783
00:37:21,000 --> 00:37:23,000
same bucket, right, you would have 

784
00:37:23,000 --> 00:37:25,000
the same clustering and then a big empty clump, 

785
00:37:25,000 --> 00:37:28,000
and so you would rehash everything, run it through the new hash function having 

786
00:37:28,000 --> 00:37:31,000
changed the number of buckets, and say, ''Well, now where does it go in the hashing scheme?'' 

787
00:37:31,000 --> 00:37:34,000
And hopefully you'd end up getting a clean table, right, where you had 12 

788
00:37:34,000 --> 00:37:37,000
items now with one in each bucket, 

789
00:37:37,000 --> 00:37:40,000
ready to kinda give constant performance through 

790
00:37:40,000 --> 00:37:42,000
that and then potentially, 

791
00:37:42,000 --> 00:37:43,000
again if you overload 

792
00:37:43,000 --> 00:37:47,000
your load factor, go back in and rearrange stuff again. 

793
00:37:47,000 --> 00:37:50,000
So using this strategy a little bit like the one we talked about with the binary search tree of just 

794
00:37:50,000 --> 00:37:51,000
wait and see. Rather 

795
00:37:51,000 --> 00:37:55,000
than if you got, you know, just because you get two items in a bucket it doesn't immediately cause any 

796
00:37:55,000 --> 00:37:57,000
real alarm or crisis, 

797
00:37:57,000 --> 00:38:00,000
it waits until there's kinda sufficiently clear demand 

798
00:38:00,000 --> 00:38:03,000
that exceeds the capacity plan for, 

799
00:38:03,000 --> 00:38:05,000
to do a big rearrangement. 

800
00:38:05,000 --> 00:38:08,000
So you'd end up saying that adds, adds, adds would be fast, fast, fast, fast, fast. 

801
00:38:08,000 --> 00:38:12,000
Starting to get just a hair slower, right, having to do a search through a few things as opposed 

802
00:38:12,000 --> 00:38:13,000
to one, 

803
00:38:13,000 --> 00:38:16,000
and then every now and then, right, you'd see an add which caused a big reshuffle of 

804
00:38:16,000 --> 00:38:19,000
the table, which would be a totally linear operation in the number of 

805
00:38:19,000 --> 00:38:20,000
elements 

806
00:38:20,000 --> 00:38:22,000
and then they'd be fast again 

807
00:38:22,000 --> 00:38:24,000
for another big clump of inserts until that 

808
00:38:24,000 --> 00:38:27,000
same expansion might be complete or triggered by 

809
00:38:27,000 --> 00:38:28,000
it growing even more. 

810
00:38:28,000 --> 00:38:31,000
But dividing it sorta by the total number of inserts, so if you had 1,000 inserts 

811
00:38:31,000 --> 00:38:35,000
before you overloaded and then you had to copy all 1,000 things to get ready for the 

812
00:38:35,000 --> 00:38:36,000
next thousand, 

813
00:38:36,000 --> 00:38:39,000
if you average that out, it still ends up 

814
00:38:39,000 --> 00:38:43,000
being called a constant operation. So 

815
00:38:43,000 --> 00:38:46,000
that's pretty neat. 

816
00:38:46,000 --> 00:38:47,000
It's a ? 

817
00:38:47,000 --> 00:38:50,000
really one of the, sort of, more, I think, 

818
00:38:50,000 --> 00:38:53,000
easily understood and beautiful algorithmic techniques, right, for solving, right, 

819
00:38:53,000 --> 00:38:55,000
this hard problem of 

820
00:38:55,000 --> 00:38:57,000
search and retrieval 

821
00:38:57,000 --> 00:38:58,000
that 

822
00:38:58,000 --> 00:39:01,000
the vector and the sorting and the BST are all trying to get at, but ? and sometimes the 

823
00:39:01,000 --> 00:39:04,000
sorter, vector and the BST are solving actually a much harder problem, right, 

824
00:39:04,000 --> 00:39:09,000
which is keeping a total ordering of all the data and being able to flip it, 

825
00:39:09,000 --> 00:39:12,000
traverse it and sorted order is one of the advantages that the sorted vector 

826
00:39:12,000 --> 00:39:12,000
and 

827
00:39:12,000 --> 00:39:14,000
the BST have that hash table doesn't have at all. 

828
00:39:14,000 --> 00:39:17,000
In fact, it's jumbled it all up, and so if you think about how the iterator for 

829
00:39:17,000 --> 00:39:21,000
the map works, our map is actually implemented using a hash table, 

830
00:39:21,000 --> 00:39:24,000
it actually just iterates over the link list, and that explains why it almost appears 

831
00:39:24,000 --> 00:39:27,000
completely random. If you put a bunch of things in the table and you go to iterate 

832
00:39:27,000 --> 00:39:28,000
and 

833
00:39:28,000 --> 00:39:29,000
visit them again, 

834
00:39:29,000 --> 00:39:33,000
that the order you visit the keys seems to make no sense, and that's because it's based on 

835
00:39:33,000 --> 00:39:35,000
their hash codes, right, which 

836
00:39:35,000 --> 00:39:36,000
aren't designed 

837
00:39:36,000 --> 00:39:37,000
to be 

838
00:39:37,000 --> 00:39:39,000
any real sensible ordering 

839
00:39:39,000 --> 00:39:42,000
to anybody not in the know of how the hash function works, 

840
00:39:42,000 --> 00:39:46,000
whereas iterating over a vector that's sorted or iterating over a BST 

841
00:39:46,000 --> 00:39:47,000
or a, 

842
00:39:47,000 --> 00:39:51,000
in this case, our set, for example, is backed by a binary research tree 

843
00:39:51,000 --> 00:39:54,000
will give you that sorted order. So it solves this harder problem, right, which cause there to 

844
00:39:54,000 --> 00:39:57,000
be more kinda more investment in that problem, whereas hashing solves just exactly the 

845
00:39:57,000 --> 00:39:58,000
one problem, which is 

846
00:39:58,000 --> 00:40:02,000
I want to be able to find exactly this value again 

847
00:40:02,000 --> 00:40:03,000
and update this value, 

848
00:40:03,000 --> 00:40:08,000
and nothing more is needed than just identifying this, 

849
00:40:08,000 --> 00:40:10,000
not finding the things that are less than this. For example, if you needed to find all 

850
00:40:10,000 --> 00:40:13,000
the values that were less than this key alphabetically, 

851
00:40:13,000 --> 00:40:15,000
right, the hash table makes that 

852
00:40:15,000 --> 00:40:16,000
no easier 

853
00:40:16,000 --> 00:40:18,000
than an unsorted vector, 

854
00:40:18,000 --> 00:40:21,000
whereas things like assorted vector and binary search tree actually enable that kinda search, 

855
00:40:21,000 --> 00:40:24,000
you could find the place in the tree and kind of work your way down the left side 

856
00:40:24,000 --> 00:40:27,000
to find the things that were less than that. 

857
00:40:27,000 --> 00:40:31,000
That's not actually being solved by the hash at all. Its use of 

858
00:40:31,000 --> 00:40:34,000
memory is actually comparable to a 

859
00:40:34,000 --> 00:40:35,000
binary search tree 

860
00:40:35,000 --> 00:40:38,000
in that it has a four byte pointer per entry, which is the next field on the 

861
00:40:38,000 --> 00:40:40,000
link list chain, and 

862
00:40:40,000 --> 00:40:44,000
then there's a four byte pointer for each bucket, the head of the link list 

863
00:40:44,000 --> 00:40:47,000
cell. Given that we expect that the buckets to be roughly equal to entry, 

864
00:40:47,000 --> 00:40:51,000
than we can kinda just summarize that as well. Each entry 

865
00:40:51,000 --> 00:40:53,000
represented an eight byte overhead, 

866
00:40:53,000 --> 00:40:56,000
which is the same eight byte overhead, right, that the left and right pointers of 

867
00:40:56,000 --> 00:40:58,000
the binary search 

868
00:40:58,000 --> 00:40:59,000
tree add. 

869
00:40:59,000 --> 00:41:01,000
Does it have any degenerate cases? That's a good 

870
00:41:01,000 --> 00:41:06,000
question. So one of the things that made the binary search tree a little bit of a 

871
00:41:06,000 --> 00:41:08,000
heavy investment was that to get that 

872
00:41:08,000 --> 00:41:11,000
balancing behavior, even when we're getting data coming in in a way that's 

873
00:41:11,000 --> 00:41:16,000
causing a lopsided construction, we have to go out of our way to do something special. 

874
00:41:16,000 --> 00:41:19,000
What is the degenerate case for hash? Is there something that caused it to behave really, 

875
00:41:19,000 --> 00:41:24,000
really badly? Anyone wanna help me out with 

876
00:41:24,000 --> 00:41:28,000
that? Is it always good? Does it always give a good spread? Can we 

877
00:41:28,000 --> 00:41:32,000
end up with everything in the same bucket somehow? [Inaudible] repeated it that way? 

878
00:41:32,000 --> 00:41:35,000
So if you have repeated elements the way that ? 

879
00:41:35,000 --> 00:41:37,000
both versions of the map work is they overwrite. 

880
00:41:37,000 --> 00:41:41,000
So actually the nice thing is, yeah, if you did it again you just take that same cell and 

881
00:41:41,000 --> 00:41:43,000
overwrite it, so in fact we wouldn't get a clustering based on the same entry 

882
00:41:43,000 --> 00:41:45,000
going in and out. 

883
00:41:45,000 --> 00:41:51,000
But how'd we end up with everything being in the same bucket? 

884
00:41:51,000 --> 00:41:54,000
Mostly comes out, if you look at your hash function, when would a hash function collide? 

885
00:41:54,000 --> 00:41:58,000
Right, and if you have a dumb hash function, right, you can definitely have some degenerate 

886
00:41:58,000 --> 00:42:01,000
cases. My dumb hash function of return zero, right, being the 

887
00:42:01,000 --> 00:42:03,000
worst example of that, 

888
00:42:03,000 --> 00:42:07,000
but any hash function, right, that wasn't being particularly clever about using all of its 

889
00:42:07,000 --> 00:42:09,000
information might actually have some clustering in 

890
00:42:09,000 --> 00:42:12,000
it. It is possible, for example, even with a particularly good hash function that 

891
00:42:12,000 --> 00:42:15,000
there are strings that will hash to the same thing, and it's like, if you somehow got 

892
00:42:15,000 --> 00:42:18,000
really, really unlucky 

893
00:42:18,000 --> 00:42:19,000
that you ? 

894
00:42:19,000 --> 00:42:23,000
and had a hash function that wasn't doing the best job possible that you could get some 

895
00:42:23,000 --> 00:42:25,000
clustering, but in general it doesn't require 

896
00:42:25,000 --> 00:42:29,000
? the sole responsibility, right, for the generate case comes down to your hash function. 

897
00:42:29,000 --> 00:42:31,000
So as long as your hash function 

898
00:42:31,000 --> 00:42:34,000
and your inputs match your expectation, 

899
00:42:34,000 --> 00:42:35,000
you don't have any surprises about 

900
00:42:35,000 --> 00:42:38,000
how they got inserted the way it was in a binary search tree, which is sometimes hard 

901
00:42:38,000 --> 00:42:40,000
to control. 

902
00:42:40,000 --> 00:42:43,000
Just as you could underestimate the number of buckets you need ? Yeah. ? could 

903
00:42:43,000 --> 00:42:46,000
you over estimate the number of buckets you 

904
00:42:46,000 --> 00:42:51,000
need by a large amount ? Um hm. ? that's wasting a lot of it ? Exactly. ? memory, and 

905
00:42:51,000 --> 00:42:52,000
do you then go through and take 

906
00:42:52,000 --> 00:42:54,000
that memory back? Yeah. 

907
00:42:54,000 --> 00:42:57,000
Yeah, so that's a great question. So a lot of data structures 

908
00:42:57,000 --> 00:42:58,000
don't shrink, 

909
00:42:58,000 --> 00:43:02,000
right, and for example, the vector often will only grow on demand and then even if you deleted a 

910
00:43:02,000 --> 00:43:04,000
bunch of, or removed a bunch of elements, it doesn't 

911
00:43:04,000 --> 00:43:06,000
necessarily consider shrinking a big deal, and so 

912
00:43:06,000 --> 00:43:10,000
that's often a ? maybe it's a little bit lazy, but it turns out, right, that having a bunch of memory around 

913
00:43:10,000 --> 00:43:12,000
you're not using 

914
00:43:12,000 --> 00:43:15,000
doesn't have nearly the same penalty that not 

915
00:43:15,000 --> 00:43:17,000
having the memory when you needed it and having to do a lot of extra work to find 

916
00:43:17,000 --> 00:43:18,000
things. 

917
00:43:18,000 --> 00:43:21,000
So typically, right, you would try to pick a size that you are 

918
00:43:21,000 --> 00:43:26,000
willing to commit to, and say, ''Well, no matter what, I'll stay at this size. I won't get any smaller.'' 

919
00:43:26,000 --> 00:43:28,000


920
00:43:28,000 --> 00:43:30,000
But that, as you grow, you might not shrink back to that size. You might just 

921
00:43:30,000 --> 00:43:34,000
let the capacity just kinda lay in waste. There are good 

922
00:43:34,000 --> 00:43:37,000
reasons to actually be tidy about it, but again, some of this is hard to predict. 

923
00:43:37,000 --> 00:43:40,000
It might be that your table kinda grows big and then a bunch of things get taken out, and then it grows 

924
00:43:40,000 --> 00:43:41,000
big again. Like, maybe you 

925
00:43:41,000 --> 00:43:43,000
have a new freshman class come in, 

926
00:43:43,000 --> 00:43:45,000
and then you graduate a whole bunch of people. Then 

927
00:43:45,000 --> 00:43:47,000
at the point where you graduate them all, you're about ready to take in a new freshman 

928
00:43:47,000 --> 00:43:51,000
class and so it might be that in fact, right, the capacity you just got rid of, you're gonna plan on 

929
00:43:51,000 --> 00:43:53,000
reusing in a minute anyway, and so maybe 

930
00:43:53,000 --> 00:43:56,000
that clearing it out and totally releasing 

931
00:43:56,000 --> 00:43:57,000
may actually be an exercise 

932
00:43:57,000 --> 00:44:01,000
that was unimportant. You're planning on getting back to that size eventually, anyway. 

933
00:44:01,000 --> 00:44:04,000
Most hash tables tend to grow is actually the truth. Right, you tend to 

934
00:44:04,000 --> 00:44:07,000
be collecting data in there and 

935
00:44:07,000 --> 00:44:11,000
they tend to only enlarge. It's kind of unusual that you take out a bunch of 

936
00:44:11,000 --> 00:44:15,000
entries you already put in. 

937
00:44:15,000 --> 00:44:18,000
And so I just put a little last note, which is like, okay, well I had said earlier that when we 

938
00:44:18,000 --> 00:44:19,000
have the map, 

939
00:44:19,000 --> 00:44:22,000
the key had to be string type and I was gonna at 

940
00:44:22,000 --> 00:44:25,000
some point come back to this and talk about 

941
00:44:25,000 --> 00:44:28,000
why that was the one restriction in all our template classes, right, you can put anything you 

942
00:44:28,000 --> 00:44:31,000
want in statter or vector or stacker queue. 

943
00:44:31,000 --> 00:44:34,000
That map lets you store any kind of value associated with that key, but that key was 

944
00:44:34,000 --> 00:44:36,000
string type, and 

945
00:44:36,000 --> 00:44:40,000
so given how our map is implemented as a hash table, that actually starts to make sense, that if 

946
00:44:40,000 --> 00:44:42,000
you're gonna take the key and kind of jumble it up and map it to 

947
00:44:42,000 --> 00:44:44,000
a bucket, 

948
00:44:44,000 --> 00:44:47,000
you need to know something about how to extract information from the key, and so 

949
00:44:47,000 --> 00:44:51,000
this case, knowing it's a string means you know it has characters, you know its length, and you can do 

950
00:44:51,000 --> 00:44:52,000
those things. 

951
00:44:52,000 --> 00:44:53,000
If you wanted to make a map 

952
00:44:53,000 --> 00:44:57,000
that could take any kinda key, maybe integers is key or student structures is 

953
00:44:57,000 --> 00:44:59,000
key or, you 

954
00:44:59,000 --> 00:45:00,000
know, 

955
00:45:00,000 --> 00:45:01,000
doubles is key, 

956
00:45:01,000 --> 00:45:05,000
any of these things, it's like, well, how can you write a generic form 

957
00:45:05,000 --> 00:45:06,000
that could use 

958
00:45:06,000 --> 00:45:08,000
a hashing operation on that 

959
00:45:08,000 --> 00:45:11,000
kind of key and map it to bucket numbers. 

960
00:45:11,000 --> 00:45:14,000
You would build a two type template to kinda get this working in C++, so 

961
00:45:14,000 --> 00:45:18,000
you can actually have a template that has a type name for the key type, a type name for 

962
00:45:18,000 --> 00:45:19,000
the value type, 

963
00:45:19,000 --> 00:45:20,000
and then 

964
00:45:20,000 --> 00:45:25,000
in the member functions you'd refer to both key type and value type as these two 

965
00:45:25,000 --> 00:45:26,000
distinct place holders. 

966
00:45:26,000 --> 00:45:29,000
And then when the client set it up, right, they'd be saying, ''Okay, I want a map 

967
00:45:29,000 --> 00:45:32,000
that has strings that map to integers.'' So maybe this is 

968
00:45:32,000 --> 00:45:36,000
words and the page they appear in a document. I have a map of integers and a vector 

969
00:45:36,000 --> 00:45:40,000
string. Maybe this is score on an exam and the names of the students who got that 

970
00:45:40,000 --> 00:45:42,000
score, doing it that way, 

971
00:45:42,000 --> 00:45:44,000
and to make this work, right, 

972
00:45:44,000 --> 00:45:48,000
there has to be some kind of universal hashing that, given some generic type, 

973
00:45:48,000 --> 00:45:53,000
can turn it into an integer in the range of buckets. The 

974
00:45:53,000 --> 00:45:53,000
? 

975
00:45:53,000 --> 00:45:56,000
what it's gonna require here is that the client get involved. 

976
00:45:56,000 --> 00:45:58,000
That you can not write this 

977
00:45:58,000 --> 00:46:00,000
generic 

978
00:46:00,000 --> 00:46:01,000
hash function 

979
00:46:01,000 --> 00:46:04,000
that will work for all types, right, if it's a student structure, what are you gonna look at? The 

980
00:46:04,000 --> 00:46:08,000
ID field, the names field, under like ? there's just no sensible way that you can 

981
00:46:08,000 --> 00:46:10,000
talk about a generic type and talk about how to hash 

982
00:46:10,000 --> 00:46:14,000
it, so what you would have to do is have some sort of client call back, so probably you would 

983
00:46:14,000 --> 00:46:18,000
create this by passing out a call back that's given a key type thing, and a number of 

984
00:46:18,000 --> 00:46:20,000
buckets would 

985
00:46:20,000 --> 00:46:24,000
do the necessary machinations, right, to hash 

986
00:46:24,000 --> 00:46:26,000
it into a right number. 

987
00:46:26,000 --> 00:46:30,000
And so that would be kinda one of those coordination things between the client and the implementer 

988
00:46:30,000 --> 00:46:34,000
about the client knows what the data is what the data is it's trying to store, and kinda how to 

989
00:46:34,000 --> 00:46:35,000


990
00:46:35,000 --> 00:46:41,000
mash it up and then the map would know, okay, given that number, where to stick it. So 

991
00:46:41,000 --> 00:46:42,000
that's what it would take, and then 

992
00:46:42,000 --> 00:46:45,000
rather sorta load that onto our 

993
00:46:45,000 --> 00:46:48,000
novice heads, we went ahead just said, ''Okay, it'll always be a string, so that we can 

994
00:46:48,000 --> 00:46:53,000
do a hash internally without having to get involved.'' All right, 

995
00:46:53,000 --> 00:46:55,000
any questions about hashing? I'm gonna 

996
00:46:55,000 --> 00:46:59,000
give you like, a two second talk about set 

997
00:46:59,000 --> 00:47:01,000
because it's the one 

998
00:47:01,000 --> 00:47:04,000
ADT I never said, ''Well how does it work? How does it work? What do we do?'' We're talking 

999
00:47:04,000 --> 00:47:07,000
about, it doesn't add any new things that we haven't already done, so in fact I'm just 

1000
00:47:07,000 --> 00:47:10,000
gonna tell you what it does, and 

1001
00:47:10,000 --> 00:47:13,000
then your picture will be complete, right. 

1002
00:47:13,000 --> 00:47:16,000
It wants to do fast searching, fast updating, add and remove, and it also has all 

1003
00:47:16,000 --> 00:47:21,000
these high level operations, such as intersect, union and difference, 

1004
00:47:21,000 --> 00:47:24,000
that are kind of its primary set of things to do. 

1005
00:47:24,000 --> 00:47:25,000


1006
00:47:25,000 --> 00:47:27,000
A bunch of the things we've already seen, right, 

1007
00:47:27,000 --> 00:47:30,000
would actually be a reasonable strategy for upholding stat, right, using some kind 

1008
00:47:30,000 --> 00:47:31,000
of vector or array. 

1009
00:47:31,000 --> 00:47:34,000
Most likely you'd want to keep it in sorted order because that would buy you 

1010
00:47:34,000 --> 00:47:37,000
the fast lookup for contains, 

1011
00:47:37,000 --> 00:47:41,000
but the add and remove, right, would necessarily be slow in those cases because of the 

1012
00:47:41,000 --> 00:47:43,000
continuous memory. 

1013
00:47:43,000 --> 00:47:44,000


1014
00:47:44,000 --> 00:47:47,000
The link list, not really probably too good of an option here 

1015
00:47:47,000 --> 00:47:50,000
because it contains then becomes 

1016
00:47:50,000 --> 00:47:51,000
linear and 

1017
00:47:51,000 --> 00:47:54,000
add and remove similarly, right, liner, so it doesn't actually get you anywhere, in 

1018
00:47:54,000 --> 00:47:57,000
fact it just adds the memory in pointers and gunk like that. 

1019
00:47:57,000 --> 00:47:59,000
The binary search tree, probably a 

1020
00:47:59,000 --> 00:48:00,000
pretty good call, 

1021
00:48:00,000 --> 00:48:03,000
right, that's the ? 

1022
00:48:03,000 --> 00:48:06,000
kinda meshes the advantage of Adamic structure with the sorted property to 

1023
00:48:06,000 --> 00:48:11,000
be able to give it fast update adding and removing and searching all can be done 

1024
00:48:11,000 --> 00:48:14,000
in logarithmic time if you have balancing. 

1025
00:48:14,000 --> 00:48:17,000
And then it actually also enables the high level ops, so the idea of being able to do a union 

1026
00:48:17,000 --> 00:48:19,000
intersection difference, and actually being able to 

1027
00:48:19,000 --> 00:48:23,000
walk over both of the trees in sorted order, which is very easily done 

1028
00:48:23,000 --> 00:48:24,000
with a [inaudible] reversal, 

1029
00:48:24,000 --> 00:48:28,000
makes those operations actually more efficient to implement than actually kinda having to deal with 

1030
00:48:28,000 --> 00:48:32,000
data coming at you all different ways. It 

1031
00:48:32,000 --> 00:48:33,000
turns out hashing really 

1032
00:48:33,000 --> 00:48:35,000
won't quite do it 

1033
00:48:35,000 --> 00:48:36,000


1034
00:48:36,000 --> 00:48:37,000
very easily for us 

1035
00:48:37,000 --> 00:48:40,000
because of the same need about the template situations, like, well you have to have a hash function that 

1036
00:48:40,000 --> 00:48:43,000
can hash these things and do the right thing, and so the tree happens to be a 

1037
00:48:43,000 --> 00:48:46,000
good of just ? if the client can just give you ordering information, you can 

1038
00:48:46,000 --> 00:48:48,000
do all the hard work without 

1039
00:48:48,000 --> 00:48:50,000
pushing hashing onto them. So in fact 

1040
00:48:50,000 --> 00:48:53,000
the way our set really does work is it does use a binary search tree. It happens to 

1041
00:48:53,000 --> 00:48:56,000
use it through another layered abstraction 

1042
00:48:56,000 --> 00:48:59,000
that there actually is a BST class that we've never really 

1043
00:48:59,000 --> 00:49:02,000
encountered directly, but the BST class just takes the idea of a binary search 

1044
00:49:02,000 --> 00:49:06,000
tree, adds balancing, and 

1045
00:49:06,000 --> 00:49:08,000
all the properties about finding and inserting and adding, 

1046
00:49:08,000 --> 00:49:12,000
and packages it up, and then set actually just turns out to be one liners 

1047
00:49:12,000 --> 00:49:14,000
making calls into the binary search tree. 

1048
00:49:14,000 --> 00:49:17,000
This is where all the heavy lifting 

1049
00:49:17,000 --> 00:49:18,000
goes, is down there. 

1050
00:49:18,000 --> 00:49:23,000
That kinda completes the big picture, so some of stuff, right, so our map uses our hash, right, our 

1051
00:49:23,000 --> 00:49:26,000
BST uses ? or set uses the BST, which is the binary search tree, 

1052
00:49:26,000 --> 00:49:28,000
right, our vector, right, using an array, 

1053
00:49:28,000 --> 00:49:33,000
and then that stacks and queues having both viable strategies both for array 

1054
00:49:33,000 --> 00:49:37,000
vector based backing as well as link list both getting good time performance. And then 

1055
00:49:37,000 --> 00:49:38,000
you saw with the PQ, right, 

1056
00:49:38,000 --> 00:49:40,000
even 

1057
00:49:40,000 --> 00:49:44,000
yet another structure, the heap, right, which is kinda a variant of the tree, and so those tend to be the 

1058
00:49:44,000 --> 00:49:46,000
kind of really classic things in which a lot of things get built, 

1059
00:49:46,000 --> 00:49:51,000
and so having a chance to use them both as a client and kinda dig 

1060
00:49:51,000 --> 00:49:53,000
around as an implementer kinda gives you this big picture of 

1061
00:49:53,000 --> 00:49:55,000
some of the more common ways that 

1062
00:49:55,000 --> 00:49:57,000
cool data structures get built. 

1063
00:49:57,000 --> 00:50:00,000
So that's all for Friday. We're gonna do some more good stuff next week. I'm gonna go to the Terman 

1064
00:50:00,000 --> 00:50:03,000
cafe in a little bit, if you've got some time, please come, and have 

1065
00:50:03,000 --> 00:50:04,000
a good weekend. 

1066
00:50:04,000 --> 00:50:07,000
Work hard on pathfinder.