1
00:00:00,000 --> 00:00:12,000


2
00:00:12,000 --> 00:00:16,000


3
00:00:16,000 --> 00:00:25,000
This presentation is delivered by the Stanford Center for Professional 

4
00:00:25,000 --> 00:00:28,000
Development. 

5
00:00:28,000 --> 00:00:32,000
The piece I need is not here, so I'm going to actually just talk to you about what 

6
00:00:32,000 --> 00:00:34,000
it says on my slides and then when Jason arrives we'll plug it back in and see what's going on. 

7
00:00:34,000 --> 00:00:36,000
Oh, there's Jason. There we go. 

8
00:00:36,000 --> 00:00:40,000
So what I'm going to do today is actually something kinda fun. And so I'm 

9
00:00:40,000 --> 00:00:41,000
gonna take a case study of a particular data structure 

10
00:00:41,000 --> 00:00:45,000
design, 

11
00:00:45,000 --> 00:00:49,000
and the one I'm looking at there is the lexicon. 

12
00:00:49,000 --> 00:00:53,000
So how many of you actually ever decided to just open up the lexicon file, just to see 

13
00:00:53,000 --> 00:00:55,000
what was in there at some point during the quarter. We used it a couple of times. 

14
00:00:55,000 --> 00:01:00,000
Anybody want to tell us what they learned by opening up that file? 

15
00:01:00,000 --> 00:01:04,000
Anything interesting? 

16
00:01:04,000 --> 00:01:05,000
It's a data structure called a DAWG. You're like, well that's a pretty funny name for something. 

17
00:01:05,000 --> 00:01:10,000
And 

18
00:01:10,000 --> 00:01:14,000
was the code particularly well-[inaudible], easy to read, easy to understand? 

19
00:01:14,000 --> 00:01:18,000
No. Who wrote that code? She should totally be fired. 

20
00:01:18,000 --> 00:01:22,000
Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through 

21
00:01:22,000 --> 00:01:28,000
kinda how it is I came to discover the DAWG and why it is the DAWG was the 

22
00:01:28,000 --> 00:01:29,000
choice that made sense for what the need pattern we had for the lexicon was. 

23
00:01:29,000 --> 00:01:30,000
So let me just refresh what 

24
00:01:30,000 --> 00:01:34,000
the lexicon does. 

25
00:01:34,000 --> 00:01:36,000
It is a special case of kinda set or map. It is a container where you're going stick things 

26
00:01:36,000 --> 00:01:38,000
in, they're going to be unique, right? There's no 

27
00:01:38,000 --> 00:01:41,000
need for any sort of duplicate. 

28
00:01:41,000 --> 00:01:43,000
The keys are always strings. And in fact, they're English words, so that's going to be an important 

29
00:01:43,000 --> 00:01:47,000
thing to remember, that 

30
00:01:47,000 --> 00:01:48,000
they're not just arbitrary sequences of letters, right? They have certain patterns to 

31
00:01:48,000 --> 00:01:51,000
them 

32
00:01:51,000 --> 00:01:54,000
and certain sizes that might actually be something we could capitalize on. When we know 

33
00:01:54,000 --> 00:01:57,000
something about the domain, we're able to kinda special case or tune or 

34
00:01:57,000 --> 00:01:58,000
data structure to solve that problem 

35
00:01:58,000 --> 00:01:59,000
very well, 

36
00:01:59,000 --> 00:02:03,000
in particular. 

37
00:02:03,000 --> 00:02:07,000
It has no associated value, so it doesn't have dictionary definitions or synonyms or anything 

38
00:02:07,000 --> 00:02:08,000
like that, right? So it really just is exists or doesn't exist 

39
00:02:08,000 --> 00:02:12,000
that we were interested in supporting. 

40
00:02:12,000 --> 00:02:15,000
And it also has, in addition to the kind of being able to add words and check if a 

41
00:02:15,000 --> 00:02:16,000
word is contained, it's also this prefixing operation that we had talked a 

42
00:02:16,000 --> 00:02:19,000
little about 

43
00:02:19,000 --> 00:02:23,000
in Boggle, and that came back on the midterm. Like, why was prefixing important, right? 

44
00:02:23,000 --> 00:02:25,000
It was critical for pruning these dead-end paths, these searches on the 

45
00:02:25,000 --> 00:02:26,000
Boggle board that otherwise 

46
00:02:26,000 --> 00:02:32,000
would really have 

47
00:02:32,000 --> 00:02:32,000
taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you 

48
00:02:32,000 --> 00:02:36,000
through 

49
00:02:36,000 --> 00:02:40,000
is some of the obvious choices based on data structures we already know about, 

50
00:02:40,000 --> 00:02:43,000
and see where they would go. So this is a thought experiment. It's not that I wrote all 

51
00:02:43,000 --> 00:02:45,000
these to find out. Actually, I mostly did a lot of this on paper, trying to figure 

52
00:02:45,000 --> 00:02:48,000
out what options we had 

53
00:02:48,000 --> 00:02:49,000
and how they would play out and what kind of trade-offs I'd have to make to make them 

54
00:02:49,000 --> 00:02:52,000
work. 

55
00:02:52,000 --> 00:02:55,000
So the simplest case of any kind of collection is to think about 

56
00:02:55,000 --> 00:02:58,000
whether vector or array can do the job for you. 

57
00:02:58,000 --> 00:03:00,000
In this case, having a vector that's kept in sorted - alphabetically sorted - order seems 

58
00:03:00,000 --> 00:03:03,000
like a pretty good, 

59
00:03:03,000 --> 00:03:05,000
easy way to get this set up and running and how would it behave? 

60
00:03:05,000 --> 00:03:06,000
So in order to do a contains 

61
00:03:06,000 --> 00:03:07,000
(word) 

62
00:03:07,000 --> 00:03:15,000
on this data structure, 

63
00:03:15,000 --> 00:03:17,000
what algorithm are you going to use to do your searching? Binary search, 

64
00:03:17,000 --> 00:03:21,000
right? It's in sorted order, right? We've got to take advantage of that, right? If we do 

65
00:03:21,000 --> 00:03:25,000
a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through, 

66
00:03:25,000 --> 00:03:28,000
you know, thousands of words before I find it. So definitely want to be using binary search. 

67
00:03:28,000 --> 00:03:29,000
Equals-equals, comparing my strings, less than, greater than, dividing in half. And so 

68
00:03:29,000 --> 00:03:32,000
it should run a logarithm time. 

69
00:03:32,000 --> 00:03:35,000
Logarithm time, one of those greater performers that you 

70
00:03:35,000 --> 00:03:38,000
learned in the PQ assignment, hopefully. It's just basically not measurable on 

71
00:03:38,000 --> 00:03:41,000
today's computer. So even for entries of 

72
00:03:41,000 --> 00:03:42,000
100,000, 500,000, basically 

73
00:03:42,000 --> 00:03:46,000
for free. 

74
00:03:46,000 --> 00:03:50,000
How do we do contains (prefix)? Can we also do a fast prefix search given this 

75
00:03:50,000 --> 00:03:51,000
data structure? 

76
00:03:51,000 --> 00:03:54,000
I hope so right? Thinking about 

77
00:03:54,000 --> 00:03:57,000
the same kind of binary search. In this case, though, looking at substring, right? So 

78
00:03:57,000 --> 00:03:59,000
comparing I'm looking for a three-character substring, only looking at the 

79
00:03:59,000 --> 00:04:02,000
first three characters, decide which way to go. 

80
00:04:02,000 --> 00:04:05,000
Once I find something that matches in the first three characters, I'm done. 

81
00:04:05,000 --> 00:04:07,000
So again, still using the sorted property to quickly narrow in on where 

82
00:04:07,000 --> 00:04:10,000
it could be. 

83
00:04:10,000 --> 00:04:14,000
When it's time to add a new word to this, 

84
00:04:14,000 --> 00:04:17,000
if someone asked you to put in the word ''abalone,'' I gotta 

85
00:04:17,000 --> 00:04:21,000
use that binary search to find out where to go. So in logarithmic time I can 

86
00:04:21,000 --> 00:04:23,000
tell you where the new word goes, but then to put it into position, 

87
00:04:23,000 --> 00:04:26,000
there's gonna be some shuffling. 

88
00:04:26,000 --> 00:04:28,000
And that's where this operation 

89
00:04:28,000 --> 00:04:32,000
starts to bog down, 

90
00:04:32,000 --> 00:04:33,000
given the idea that you might be moving half or more of your items, 

91
00:04:33,000 --> 00:04:36,000
on average, 

92
00:04:36,000 --> 00:04:40,000
to make that space for that one to open up. 

93
00:04:40,000 --> 00:04:42,000
It is going to require a linear time operation to get new words into the data 

94
00:04:42,000 --> 00:04:43,000
structure. Okay, 

95
00:04:43,000 --> 00:04:48,000
probably still, though, 

96
00:04:48,000 --> 00:04:51,000
a reasonable approach, right? These are the operations that get done 

97
00:04:51,000 --> 00:04:54,000
immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s 

98
00:04:54,000 --> 00:04:58,000
as you're exhaustively searching that big board. 

99
00:04:58,000 --> 00:05:01,000
But the add's are done kinda once at the beginning. You build that big dictionary and 

100
00:05:01,000 --> 00:05:02,000
then you subsequently use it. And so if it took a little bit more time to built 

101
00:05:02,000 --> 00:05:06,000
that thing, 

102
00:05:06,000 --> 00:05:08,000
sorta an n-squared operation to kinda load it up, well, 

103
00:05:08,000 --> 00:05:11,000
maybe that would be okay. 

104
00:05:11,000 --> 00:05:14,000
The other we want to look at is space usage, and that's - and some of this is a little 

105
00:05:14,000 --> 00:05:18,000
bit of an interesting artifact of the time I was working on this. It turns 

106
00:05:18,000 --> 00:05:21,000
out space was actually a limiting factor in what we had available 

107
00:05:21,000 --> 00:05:22,000
to us, in terms of core memories of our machines and 

108
00:05:22,000 --> 00:05:25,000
what we could take up for our dictionary. 

109
00:05:25,000 --> 00:05:28,000
In this case, it has per entry the size of the string, 

110
00:05:28,000 --> 00:05:29,000
which we don't exactly know the string it represented. It's an abstraction 

111
00:05:29,000 --> 00:05:32,000
to us, but 

112
00:05:32,000 --> 00:05:35,000
we can make the assumption of kinda a simplification, that 

113
00:05:35,000 --> 00:05:37,000
it's probably about the length of the 

114
00:05:37,000 --> 00:05:39,000
string times of the size of the character. And there might be a little bit more 

115
00:05:39,000 --> 00:05:40,000
housekeeping associated with it, but 

116
00:05:40,000 --> 00:05:42,000


117
00:05:42,000 --> 00:05:44,000
that's a pretty good estimate of what it's taking. So in fact, it's the string data 

118
00:05:44,000 --> 00:05:46,000
itself 

119
00:05:46,000 --> 00:05:50,000
that's being represented in this vector 

120
00:05:50,000 --> 00:05:51,000
without much else overhead. So if we say words are about average of about eight characters, 

121
00:05:51,000 --> 00:05:54,000
about eight bytes per word. 

122
00:05:54,000 --> 00:05:57,000
So if we have 100,000 we expect to have 800,000 bytes 

123
00:05:57,000 --> 00:06:00,000
invested in this sort of vector arrangement. 

124
00:06:00,000 --> 00:06:02,000
Excess capacity, a couple other things that might tweak that a little bit, but that 

125
00:06:02,000 --> 00:06:04,000
gives us at least a good starting point to 

126
00:06:04,000 --> 00:06:07,000
think about how this 

127
00:06:07,000 --> 00:06:09,000
compares to alternatives. Okay, 

128
00:06:09,000 --> 00:06:12,000
so then if we said 

129
00:06:12,000 --> 00:06:16,000
we know that sortedness - that binary search is the 

130
00:06:16,000 --> 00:06:17,000
real advantage of keeping it in sorted order, but the contiguous nature of the array 

131
00:06:17,000 --> 00:06:19,000
kinda 

132
00:06:19,000 --> 00:06:23,000
bites us in terms of insert. 

133
00:06:23,000 --> 00:06:26,000
If we reorganize into that binary search tree to give ourselves the dynamic flexibility of 

134
00:06:26,000 --> 00:06:27,000
wiring in the left and right halves through pointers, as opposed to 

135
00:06:27,000 --> 00:06:29,000
contiguous memory, 

136
00:06:29,000 --> 00:06:31,000
then we know we can also get add to run 

137
00:06:31,000 --> 00:06:35,000
in logarithmic time. So 

138
00:06:35,000 --> 00:06:36,000
looking at these operations, contains (word) is a tree search that's using equals-equals, 

139
00:06:36,000 --> 00:06:40,000
left-right, deciding which way to go, 

140
00:06:40,000 --> 00:06:43,000
and also performs logarithmically if the tree is balanced. 

141
00:06:43,000 --> 00:06:45,000
Contains (prefix), same deal. Keep 

142
00:06:45,000 --> 00:06:46,000
looking at the [inaudible], 

143
00:06:46,000 --> 00:06:48,000
working our way down, knowing which way to go. 

144
00:06:48,000 --> 00:06:51,000
And then 

145
00:06:51,000 --> 00:06:54,000
add can also perform in logarithmic time 

146
00:06:54,000 --> 00:06:57,000
because it does the same search, falling off the bottom and 

147
00:06:57,000 --> 00:06:58,000
then inserting the new word there, or finding it along the way and not having any 

148
00:06:58,000 --> 00:07:01,000
work to do. 

149
00:07:01,000 --> 00:07:03,000
So we can get all our operations done in logarithmic time, which we know to be kinda 

150
00:07:03,000 --> 00:07:05,000
fast enough to actually not be worth 

151
00:07:05,000 --> 00:07:08,000
fighting any harder for. 

152
00:07:08,000 --> 00:07:11,000
Where did we pay for this? 

153
00:07:11,000 --> 00:07:13,000
Nothing really comes for free, right? 

154
00:07:13,000 --> 00:07:15,000
Overhead, right? And overhead, in this case, memory is certainly one of the places 

155
00:07:15,000 --> 00:07:20,000
where there's going to be a 

156
00:07:20,000 --> 00:07:23,000
pretty big cost that we now have the string data for each entry, plus a 

157
00:07:23,000 --> 00:07:26,000
left pointer and a right pointer. And if we're doing the work to keep it 

158
00:07:26,000 --> 00:07:30,000
balanced, which we probably ought to do if we want to avoid the degenerate cases 

159
00:07:30,000 --> 00:07:31,000
of it kinda getting totally out of whack and degenerating into linear time, 

160
00:07:31,000 --> 00:07:34,000
we're going to have a couple more 

161
00:07:34,000 --> 00:07:37,000
bits, you know, bytes thrown into that factor, too. 

162
00:07:37,000 --> 00:07:39,000
And so if we're assuming that there's 

163
00:07:39,000 --> 00:07:42,000
ten bytes of overhead, 

164
00:07:42,000 --> 00:07:46,000
four bytes in each of these plus another two for the balance factor, then we've added 

165
00:07:46,000 --> 00:07:48,000
ten bytes of overhead on top of the eight bytes for the word. 

166
00:07:48,000 --> 00:07:48,000
Getting a little 

167
00:07:48,000 --> 00:07:52,000
bit hefty. Something to worry about. 

168
00:07:52,000 --> 00:07:56,000
Also in terms of code complexity, which I didn't put a bullet up for, but it's 

169
00:07:56,000 --> 00:07:59,000
also worth kinda keeping - weighing it in your mind also which is 

170
00:07:59,000 --> 00:08:01,000
that building the fancier, more complicated data structure probably meant 

171
00:08:01,000 --> 00:08:02,000
you spent more time debugging it, 

172
00:08:02,000 --> 00:08:06,000
more time developing it, 

173
00:08:06,000 --> 00:08:08,000
right? You will have - it will be harder for someone later to maintain and deal 

174
00:08:08,000 --> 00:08:10,000
with because it's actually just more complicated code that somebody looking at a 

175
00:08:10,000 --> 00:08:12,000
sorted vector 

176
00:08:12,000 --> 00:08:13,000
isn't likely to actually - like, they want to add a new operation, 

177
00:08:13,000 --> 00:08:16,000
let's say they add remove. 

178
00:08:16,000 --> 00:08:19,000
You're able to take a word out of the lexicon. The lexicon doesn't currently support 

179
00:08:19,000 --> 00:08:23,000
that. Somebody adding that on the sorted vector implementation probably won't find themselves 

180
00:08:23,000 --> 00:08:25,000
too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching 

181
00:08:25,000 --> 00:08:27,000
the subtrees 

182
00:08:27,000 --> 00:08:30,000
and not destroying the balancing 

183
00:08:30,000 --> 00:08:32,000
is a pretty complicated operation. So we've made - we've invested in this data 

184
00:08:32,000 --> 00:08:34,000
structure that actually will keep on 

185
00:08:34,000 --> 00:08:36,000
creating 

186
00:08:36,000 --> 00:08:38,000
further development hassles for any modifications that are down the road for 

187
00:08:38,000 --> 00:08:40,000
this data structure. So 

188
00:08:40,000 --> 00:08:45,000
it is 

189
00:08:45,000 --> 00:08:47,000
a big investment in your brainpower to keep it working. So 

190
00:08:47,000 --> 00:08:51,000
let's think about what a hash table can do for us. A 

191
00:08:51,000 --> 00:08:56,000
hash table is the last thing that we looked at on Friday, 

192
00:08:56,000 --> 00:08:58,000
kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a 

193
00:08:58,000 --> 00:09:02,000
hashing function. 

194
00:09:02,000 --> 00:09:05,000
So if I have a hash table that has a certain number of buckets 

195
00:09:05,000 --> 00:09:06,000
that uses a link list for chaining to 

196
00:09:06,000 --> 00:09:10,000
handle the collision, 

197
00:09:10,000 --> 00:09:11,000
then what I would have is my words just scattered around my table, 

198
00:09:11,000 --> 00:09:14,000


199
00:09:14,000 --> 00:09:16,000
allowing me to have quick access by hashing to the right bucket and then 

200
00:09:16,000 --> 00:09:19,000
working my way down the bucket to C. 

201
00:09:19,000 --> 00:09:21,000
So the contains (word) would do a standard 

202
00:09:21,000 --> 00:09:24,000
hash table lookup. 

203
00:09:24,000 --> 00:09:27,000
Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket 

204
00:09:27,000 --> 00:09:30,000
three in this case. I go to bucket three. I walk down the link list. I find it or 

205
00:09:30,000 --> 00:09:31,000
not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket 

206
00:09:31,000 --> 00:09:33,000
three. 

207
00:09:33,000 --> 00:09:35,000
No questions asked. 

208
00:09:35,000 --> 00:09:38,000
In that case, the 

209
00:09:38,000 --> 00:09:39,000
expected time for that is going to be n over b, where n is the number of entries, b is the number of 

210
00:09:39,000 --> 00:09:41,000
buckets. 

211
00:09:41,000 --> 00:09:42,000
If we have buckets 

212
00:09:42,000 --> 00:09:47,000
set to be in the 

213
00:09:47,000 --> 00:09:52,000
rough order of the number of entries, then we're going to get constant access there. 

214
00:09:52,000 --> 00:09:54,000
Now we want to do contains (prefix). I want 

215
00:09:54,000 --> 00:10:00,000
to 

216
00:10:00,000 --> 00:10:05,000
know if there's any words in this table that begin with ''med.'' 

217
00:10:05,000 --> 00:10:09,000
Where do I look? 

218
00:10:09,000 --> 00:10:11,000
If I run ''med'' through the hash function, 

219
00:10:11,000 --> 00:10:15,000
do I expect that I would get the same number 

220
00:10:15,000 --> 00:10:18,000
as other words that begin with ''med,'' like ''mediate'' and ''median'' and 

221
00:10:18,000 --> 00:10:19,000
''mediator?'' Yes. 

222
00:10:19,000 --> 00:10:22,000
You say yes. All 

223
00:10:22,000 --> 00:10:26,000
right, let's take that to an extreme. 

224
00:10:26,000 --> 00:10:29,000
If I expect that if a prefix in any longer words all have the same thing, then should it 

225
00:10:29,000 --> 00:10:33,000
be the case, for example, I think about the simplest case of a prefix, ''m.'' 

226
00:10:33,000 --> 00:10:35,000
Should all the words that begin with ''m'' hash to the same place? 

227
00:10:35,000 --> 00:10:38,000
If they did, we're in trouble, right? 

228
00:10:38,000 --> 00:10:42,000
Now, if all prefixes did hash to the same place, then what I'm going to end up with is a 

229
00:10:42,000 --> 00:10:45,000
really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' - 

230
00:10:45,000 --> 00:10:46,000
there's 26 buckets that have all the ''z'' words, all the 

231
00:10:46,000 --> 00:10:48,000
''y'' words, all the what not, 

232
00:10:48,000 --> 00:10:52,000
and none of the other buckets would be used. All 

233
00:10:52,000 --> 00:10:55,000
right, so in fact, actually, as part of the good behavior of a hash function, we're 

234
00:10:55,000 --> 00:10:58,000
hopping that, actually, it does jumble things up. And then a word that looks like one 

235
00:10:58,000 --> 00:11:02,000
but has another letter added on the end, we're hoping it goes somewhere totally different, 

236
00:11:02,000 --> 00:11:04,000
that ''median'' and ''medians'' hopefully go to two totally disparate places. 

237
00:11:04,000 --> 00:11:07,000
That if there were patterns where 

238
00:11:07,000 --> 00:11:10,000
words that were substrings of each other all got hashed to the same 

239
00:11:10,000 --> 00:11:12,000
location, we would end up with heavy clustering. Because there are a lot of words that 

240
00:11:12,000 --> 00:11:15,000
repeat parts of the other words in the dictionary. 

241
00:11:15,000 --> 00:11:19,000
So in fact, if I want to know if there's any words that begin with ''med,'' 

242
00:11:19,000 --> 00:11:21,000
I have no a priori information about where to look for 

243
00:11:21,000 --> 00:11:25,000
them. They could be anywhere. 

244
00:11:25,000 --> 00:11:26,000
I could hash ''med'' and see if ''med'' itself was a word, right? 

245
00:11:26,000 --> 00:11:30,000
But let's say it was something that 

246
00:11:30,000 --> 00:11:33,000
itself was just the beginning of something. ''St,'' ''str'' - 

247
00:11:33,000 --> 00:11:36,000
there's a lot of words that begin with that, but ''str'' itself is not a word. So 

248
00:11:36,000 --> 00:11:39,000
hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then 

249
00:11:39,000 --> 00:11:41,000
occasionally there might be some ''str'' words also there. 

250
00:11:41,000 --> 00:11:44,000
But other than that, we just gotta look. 

251
00:11:44,000 --> 00:11:45,000
And where do we look? Everywhere. 

252
00:11:45,000 --> 00:11:50,000
Everywhere, everywhere, everywhere. 

253
00:11:50,000 --> 00:11:53,000
That in order to conclude there is no word that begins with the prefix we have, 

254
00:11:53,000 --> 00:11:55,000
the only way to be confident of that is to either find something, or to have searched 

255
00:11:55,000 --> 00:11:59,000
the entire table and not found it. 

256
00:11:59,000 --> 00:12:02,000
And so it does require this completely linear, exhaustive 

257
00:12:02,000 --> 00:12:04,000
look through the entire contents of the hash table to know 

258
00:12:04,000 --> 00:12:07,000
something, for example, is not a prefix. 

259
00:12:07,000 --> 00:12:09,000
You might find one quickly. So in the best case, you might be in the first couple of 

260
00:12:09,000 --> 00:12:10,000
buckets, happen upon one. 

261
00:12:10,000 --> 00:12:14,000
But no guarantees, 

262
00:12:14,000 --> 00:12:16,000
overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything. 

263
00:12:16,000 --> 00:12:18,000
That's kinda a bummer. Adding 

264
00:12:18,000 --> 00:12:19,000
a 

265
00:12:19,000 --> 00:12:22,000
word, 

266
00:12:22,000 --> 00:12:24,000
back to the fast behavior again. Hashing in the bucket, 

267
00:12:24,000 --> 00:12:26,000
checking to see if it's already there, adding if needed 

268
00:12:26,000 --> 00:12:28,000
[inaudible]. 

269
00:12:28,000 --> 00:12:31,000
So it gets good times on these two, 

270
00:12:31,000 --> 00:12:34,000
but then kinda really bottoms out when it comes to supporting 

271
00:12:34,000 --> 00:12:35,000
that prefix search. 

272
00:12:35,000 --> 00:12:38,000
Its space usage 

273
00:12:38,000 --> 00:12:42,000
is kinda roughly comparable to that of the binary search tree. 

274
00:12:42,000 --> 00:12:42,000
We've got the string data itself, and we've got the four-byte pointer on the link list. 

275
00:12:42,000 --> 00:12:44,000


276
00:12:44,000 --> 00:12:48,000
There's also the four-byte bucket pointer, 

277
00:12:48,000 --> 00:12:51,000
which is over in this other structure here, this array full of buckets. 

278
00:12:51,000 --> 00:12:54,000
If that number, though, is roughly equal to the number of cells, then you can kinda 

279
00:12:54,000 --> 00:12:56,000
just count it as each entry 

280
00:12:56,000 --> 00:12:59,000
had this 

281
00:12:59,000 --> 00:13:02,000
12-byte cell, plus a four-byte bucket pointer somewhere 

282
00:13:02,000 --> 00:13:04,000
that you can kinda think of as associated on a per entry basis to tell you it's about 

283
00:13:04,000 --> 00:13:06,000


284
00:13:06,000 --> 00:13:10,000
eight bytes of overhead on top of the string. 

285
00:13:10,000 --> 00:13:11,000
So 16 bytes for each entry in the table. 

286
00:13:11,000 --> 00:13:16,000
Okay, that's 

287
00:13:16,000 --> 00:13:18,000
what we have so far. 

288
00:13:18,000 --> 00:13:21,000
We've got 

289
00:13:21,000 --> 00:13:24,000
pretty good performance on contains (word) no matter what we 

290
00:13:24,000 --> 00:13:27,000
choose. We can get either logarithmic or constant time with either of these three 

291
00:13:27,000 --> 00:13:31,000
data structure. So perfectly acceptable performance there. 

292
00:13:31,000 --> 00:13:33,000
We cannot get contains (prefix) to run well on a hash table. 

293
00:13:33,000 --> 00:13:34,000
Just doesn't happen the way hashing works. 

294
00:13:34,000 --> 00:13:38,000
And then add 

295
00:13:38,000 --> 00:13:41,000
can be fast on the two complicated pointer-based data structures to the right, 

296
00:13:41,000 --> 00:13:42,000
but can't be fast on the sorted vector. 

297
00:13:42,000 --> 00:13:45,000
And then 

298
00:13:45,000 --> 00:13:47,000
we see kinda some trade-offs here in terms of space versus time, that there's 

299
00:13:47,000 --> 00:13:51,000
very little space overhead 

300
00:13:51,000 --> 00:13:55,000
in the sorted vector case, but really blossoms into 

301
00:13:55,000 --> 00:13:58,000
two times, almost three times, the storage needed. In the VST and hash table 

302
00:13:58,000 --> 00:14:00,000
case, the code also got a lot more complicated when we moved up to those 

303
00:14:00,000 --> 00:14:04,000
data structures. So 

304
00:14:04,000 --> 00:14:06,000
then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us 

305
00:14:06,000 --> 00:14:09,000
when I was 

306
00:14:09,000 --> 00:14:14,000
exploring this, but in fact, more important at the time, was the memory usage. 

307
00:14:14,000 --> 00:14:17,000
That the memory usage was really the big obstacle that we had. 

308
00:14:17,000 --> 00:14:19,000
The Official Scrabble Players Dictionary II is actually the source for the word list that 

309
00:14:19,000 --> 00:14:22,000
we're using. 

310
00:14:22,000 --> 00:14:23,000
It has about 128,000 words, I think, is exactly the 

311
00:14:23,000 --> 00:14:26,000
number it has. 

312
00:14:26,000 --> 00:14:29,000
And the average length of those is eight characters. And the file itself is 

313
00:14:29,000 --> 00:14:32,000
over a megabyte. So on disk, if you were just to look at the listing of all the words, 

314
00:14:32,000 --> 00:14:34,000
it's a megabyte - a little over a megabyte there. 

315
00:14:34,000 --> 00:14:37,000
If we were loading it into the sorted vector, 

316
00:14:37,000 --> 00:14:40,000
we'd get something that was just roughly equal to that. It would take about a megabyte of space, 

317
00:14:40,000 --> 00:14:45,000
plus a little bit of noise. 

318
00:14:45,000 --> 00:14:48,000
The ones that double up that take it up to two, two and a half megabytes total. 

319
00:14:48,000 --> 00:14:50,000
At the time we were working, and I know this is going to seem completely archaic to you, but we had 

320
00:14:50,000 --> 00:14:51,000
these stone tablets that we worked on. 

321
00:14:51,000 --> 00:14:55,000


322
00:14:55,000 --> 00:15:00,000
And they typically had main memories of about a megabyte, maybe, like in 

323
00:15:00,000 --> 00:15:02,000
an exciting case, four megabytes; four whole megabytes of RAM. 

324
00:15:02,000 --> 00:15:05,000
So it was not possible 

325
00:15:05,000 --> 00:15:09,000
that we could dedicate one or two megabytes of your main memory 

326
00:15:09,000 --> 00:15:12,000
to just your lexicon. It just wasn't - there just wasn't enough space to go around. 

327
00:15:12,000 --> 00:15:15,000
So we're working in this very tight environment in terms of that. So a little bit more 

328
00:15:15,000 --> 00:15:16,000
like, for example, today's world of embedded devices. If you're adding something for an 

329
00:15:16,000 --> 00:15:18,000
iPod or a 

330
00:15:18,000 --> 00:15:21,000


331
00:15:21,000 --> 00:15:24,000
cell phone you have a lot less 

332
00:15:24,000 --> 00:15:27,000
gargantuan memory. So your average desktop machine, having a 

333
00:15:27,000 --> 00:15:28,000
gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the 

334
00:15:28,000 --> 00:15:31,000
memory you want. And 

335
00:15:31,000 --> 00:15:34,000
so this seems kinda like a silly thing to focus on, but this was 1995, 

336
00:15:34,000 --> 00:15:35,000
and it turns out, yeah, there was certainly more interest in 

337
00:15:35,000 --> 00:15:37,000
keeping it 

338
00:15:37,000 --> 00:15:40,000
tight to the memory. 

339
00:15:40,000 --> 00:15:44,000
The Boggle lexicon that we have actually takes under a third of a megabyte. 

340
00:15:44,000 --> 00:15:45,000
The memory that's used while it's working and searching and checking for words and 

341
00:15:45,000 --> 00:15:47,000
prefixes 

342
00:15:47,000 --> 00:15:50,000
is actually smaller 

343
00:15:50,000 --> 00:15:53,000
than the on-disk form of the data we started with. 

344
00:15:53,000 --> 00:15:57,000
So it actually does a kind of a bit of a compression 

345
00:15:57,000 --> 00:16:00,000
as part of its strategy for storage, which is pretty interesting to think about. 

346
00:16:00,000 --> 00:16:02,000
So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little 

347
00:16:02,000 --> 00:16:03,000
interesting 

348
00:16:03,000 --> 00:16:07,000
historical perspective on it. 

349
00:16:07,000 --> 00:16:09,000
But Boggle was given by a colleague of mine for the first time, Todd 

350
00:16:09,000 --> 00:16:10,000
Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great 

351
00:16:10,000 --> 00:16:13,000
assignment.'' 

352
00:16:13,000 --> 00:16:16,000
And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we 

353
00:16:16,000 --> 00:16:18,000
have to get some lexicons.'' So he looked around and he found a word list that had about 

354
00:16:18,000 --> 00:16:20,000
20,000 words. 

355
00:16:20,000 --> 00:16:23,000


356
00:16:23,000 --> 00:16:27,000
And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte 

357
00:16:27,000 --> 00:16:30,000
or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had 

358
00:16:30,000 --> 00:16:33,000
about 20,000 words, which then takes about 

359
00:16:33,000 --> 00:16:34,000
200k. And he actually wrote it using a hash table. 

360
00:16:34,000 --> 00:16:37,000
So given he wrote it as a hash table, 

361
00:16:37,000 --> 00:16:38,000
he just didn't support contains (prefix). He 

362
00:16:38,000 --> 00:16:39,000
just didn't put it in, 

363
00:16:39,000 --> 00:16:42,000


364
00:16:42,000 --> 00:16:45,000
and - because you can't do it efficiently. So it actually took a smaller amount 

365
00:16:45,000 --> 00:16:49,000
of space and had a smaller word list and it wouldn't do contains (prefix). 

366
00:16:49,000 --> 00:16:51,000
So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without 

367
00:16:51,000 --> 00:16:55,000
contains (prefix). 

368
00:16:55,000 --> 00:16:58,000
It spends a lot more time looking down these data ends, but it eventually bottoms out 

369
00:16:58,000 --> 00:17:01,000
when it runs off the board and uses all the 

370
00:17:01,000 --> 00:17:04,000
cubes. And 

371
00:17:04,000 --> 00:17:05,000
it actually, in some ways, almost made the program a little more satisfying to 

372
00:17:05,000 --> 00:17:08,000
write in one odd way, 

373
00:17:08,000 --> 00:17:09,000
which is, when you're - so it's the computer's turn, right, that uses the 

374
00:17:09,000 --> 00:17:12,000
prefix pruning. 

375
00:17:12,000 --> 00:17:14,000
So you would type in your words and it would be perfectly fine finding it, but it 

376
00:17:14,000 --> 00:17:16,000
doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing 

377
00:17:16,000 --> 00:17:19,000
in the words; it's find them. 

378
00:17:19,000 --> 00:17:21,000
Then you would go to do the computer's turn and you'd say, ''Okay, go.'' 

379
00:17:21,000 --> 00:17:25,000
And now it would take a really long time. 

380
00:17:25,000 --> 00:17:27,000
It'd be hunting around the board. It would be working so hard your fan would be coming on. 

381
00:17:27,000 --> 00:17:30,000
You'd feel like oh, my computer's doing something. 

382
00:17:30,000 --> 00:17:31,000
And then it would show up with twice as many words as you found or ten times as 

383
00:17:31,000 --> 00:17:33,000
many words as you found, whatever. 

384
00:17:33,000 --> 00:17:36,000
And in fact, then you would say, 

385
00:17:36,000 --> 00:17:40,000
''Well, at least it had to work hard to beat me.'' 

386
00:17:40,000 --> 00:17:41,000
And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And 

387
00:17:41,000 --> 00:17:44,000


388
00:17:44,000 --> 00:17:47,000
in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to 

389
00:17:47,000 --> 00:17:49,000
beat you because it didn't know as many words. 125,000 words is 

390
00:17:49,000 --> 00:17:52,000
an enormous number of words. 

391
00:17:52,000 --> 00:17:55,000
The average - I've heard different factors on this, but the 

392
00:17:55,000 --> 00:17:59,000
average person's vocabulary, English vocabulary, is about 20,000 to 

393
00:17:59,000 --> 00:18:00,000
30,000 words. It tends to be about 10,000 higher if you've gone to college, so 

394
00:18:00,000 --> 00:18:03,000
maybe 30,000 to 40,000. 

395
00:18:03,000 --> 00:18:06,000
The fact that it knows 80,000 more words than the average college graduate 

396
00:18:06,000 --> 00:18:09,000
means it knows a lot of obscure words, too. So in fact, not only does it 

397
00:18:09,000 --> 00:18:12,000
very quickly find all those words and then produce these ridiculous words that you've 

398
00:18:12,000 --> 00:18:14,000
never heard of, it just - it's almost like it's taunting you by doing it 

399
00:18:14,000 --> 00:18:17,000
in a millisecond. 

400
00:18:17,000 --> 00:18:21,000
And back then it took a while, so it was actually like 

401
00:18:21,000 --> 00:18:23,000
okay, it beat me, but it had to work hard. 

402
00:18:23,000 --> 00:18:27,000
But then - so then I took over Boggle. 

403
00:18:27,000 --> 00:18:31,000
I taught x maybe a quarter or two later and I said, ''I'm going to write a 

404
00:18:31,000 --> 00:18:35,000
lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.'' 

405
00:18:35,000 --> 00:18:38,000
And my goal was to make one that would do prefix, but that also would be 

406
00:18:38,000 --> 00:18:40,000
tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd 

407
00:18:40,000 --> 00:18:42,000
actually make the game even more 

408
00:18:42,000 --> 00:18:48,000
fun to play. 

409
00:18:48,000 --> 00:18:52,000
All right, so here's where I started. 

410
00:18:52,000 --> 00:18:55,000
This is a little excerpt from the middle of the 

411
00:18:55,000 --> 00:18:58,000
dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see 

412
00:18:58,000 --> 00:19:02,000
''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't 

413
00:19:02,000 --> 00:19:04,000
believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and 

414
00:19:04,000 --> 00:19:05,000
''ies' and ''ing'' on 

415
00:19:05,000 --> 00:19:07,000
almost anything out there and it 

416
00:19:07,000 --> 00:19:08,000
makes a valid word. 

417
00:19:08,000 --> 00:19:11,000
So 

418
00:19:11,000 --> 00:19:14,000
although you never really thought about the fact that 

419
00:19:14,000 --> 00:19:16,000
''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount 

420
00:19:16,000 --> 00:19:18,000
of repetition. And this is 

421
00:19:18,000 --> 00:19:20,000
where I said we're going to use some information about the domain. 

422
00:19:20,000 --> 00:19:23,000
These aren't just arbitrary strings. 

423
00:19:23,000 --> 00:19:24,000
They're not just random sequences of letters that are 

424
00:19:24,000 --> 00:19:28,000
picked out of the air, 

425
00:19:28,000 --> 00:19:30,000
that they develop over time. They're word roots. They're 

426
00:19:30,000 --> 00:19:34,000
suffixes. There's these prefixes that mean things 

427
00:19:34,000 --> 00:19:35,000
that actually show you that looking at this little excerpt out of the middle, 

428
00:19:35,000 --> 00:19:38,000
there's an awful lot of words 

429
00:19:38,000 --> 00:19:42,000
that repeat portions of other words. 

430
00:19:42,000 --> 00:19:44,000
And that may be part of what we can capitalize on. That 

431
00:19:44,000 --> 00:19:47,000
instead of really recording ''straightaway'' 

432
00:19:47,000 --> 00:19:51,000
and ''straightaways,'' repeating those first 11 characters and adding an ''s,'' 

433
00:19:51,000 --> 00:19:55,000
there's some way that I can unify the data that I have here. The same way when we try to 

434
00:19:55,000 --> 00:19:56,000
find codes, you have the same passage of code repeated two or three times. 

435
00:19:56,000 --> 00:19:59,000
We're always saying 

436
00:19:59,000 --> 00:20:00,000
unify it, move it into a helper and call it in three places. It's like can we do the same thing 

437
00:20:00,000 --> 00:20:02,000
with our data? 

438
00:20:02,000 --> 00:20:07,000
Can we share the parts of the words that are in common 

439
00:20:07,000 --> 00:20:09,000
and only distinguish where they are different? 

440
00:20:09,000 --> 00:20:11,000
Take a look at this. 

441
00:20:11,000 --> 00:20:14,000
This is a little portion 

442
00:20:14,000 --> 00:20:18,000
of something called a trie. 

443
00:20:18,000 --> 00:20:20,000
A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but 

444
00:20:20,000 --> 00:20:23,000
sadly, it's still pronounced tri, just to confuse you. 

445
00:20:23,000 --> 00:20:28,000
And it is a variant of a tree. 

446
00:20:28,000 --> 00:20:30,000
In this case it's a tree that has 26 children coming out of any particular node, 

447
00:20:30,000 --> 00:20:32,000
one for each letter of the alphabet. 

448
00:20:32,000 --> 00:20:35,000
And so starting from the root node at the top, 

449
00:20:35,000 --> 00:20:39,000
all the words that begin with ''a'' are actually 

450
00:20:39,000 --> 00:20:42,000
down the A branch, and then there's a B branch. There'd be a C branch and a D 

451
00:20:42,000 --> 00:20:45,000
branch. And so the idea is that all the words, like if you think of the levels of the 

452
00:20:45,000 --> 00:20:46,000
tree are actually like - these are all the zero if characters of the word. 

453
00:20:46,000 --> 00:20:49,000
And then that second level is 

454
00:20:49,000 --> 00:20:52,000
all the characters that follow that zero if character. So here's the ''ab'' 

455
00:20:52,000 --> 00:20:56,000
words and the ''ac'' words and the ''ax'' words. 

456
00:20:56,000 --> 00:20:57,000
Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't 

457
00:20:57,000 --> 00:21:00,000
filled out in this tree. 

458
00:21:00,000 --> 00:21:02,000
And as you keep going down further, so that the depth 

459
00:21:02,000 --> 00:21:04,000
you would trace to a tree 

460
00:21:04,000 --> 00:21:08,000
would actually be 

461
00:21:08,000 --> 00:21:10,000
tracing through the characters in sequence that make words. 

462
00:21:10,000 --> 00:21:11,000
And so in this form, tracing out a 

463
00:21:11,000 --> 00:21:12,000
path through this tree 

464
00:21:12,000 --> 00:21:14,000
from the root 

465
00:21:14,000 --> 00:21:17,000
down to a particular node 

466
00:21:17,000 --> 00:21:19,000
tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac'' 

467
00:21:19,000 --> 00:21:23,000
and ''ace,'' 

468
00:21:23,000 --> 00:21:23,000
and then there's actually a little notation there that's done by a 

469
00:21:23,000 --> 00:21:24,000
visual 

470
00:21:24,000 --> 00:21:26,000
of the 

471
00:21:26,000 --> 00:21:29,000
bold and circle around it 

472
00:21:29,000 --> 00:21:30,000
that says, ''Oh, and the path that led here from the root is a word.'' So 

473
00:21:30,000 --> 00:21:33,000
''a'' is a word 

474
00:21:33,000 --> 00:21:35,000
and ''ace'' is a word and ''aces'' is a word. 

475
00:21:35,000 --> 00:21:39,000
''Act'' is a word and so is ''acted'' and ''actor.'' 

476
00:21:39,000 --> 00:21:44,000
And in this case, the ''act'' part is totally shared, 

477
00:21:44,000 --> 00:21:46,000
so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,'' 

478
00:21:46,000 --> 00:21:50,000
can all share that initial part. 

479
00:21:50,000 --> 00:21:52,000
And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like 

480
00:21:52,000 --> 00:21:55,000
''strawberry'' or ''straightforward'' or ''stratosphere'' 

481
00:21:55,000 --> 00:21:59,000
and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly'' 

482
00:21:59,000 --> 00:22:01,000
can all leverage the other ten or 12 characters that were already 

483
00:22:01,000 --> 00:22:04,000
invested in the root word, 

484
00:22:04,000 --> 00:22:06,000
and that ''straightjacket'' and ''straightforward'' can even share ''straight.'' 

485
00:22:06,000 --> 00:22:10,000
But there's actually a lot of 

486
00:22:10,000 --> 00:22:15,000
prefixes that can be unified across the space of the dictionary. 

487
00:22:15,000 --> 00:22:17,000
So knowing that words have these patterns helps us to do this. 

488
00:22:17,000 --> 00:22:19,000
So let's build that. 

489
00:22:19,000 --> 00:22:22,000
So the first form of this - 

490
00:22:22,000 --> 00:22:25,000
and this, again, is a thought experiment. I did not actually write it this way because 

491
00:22:25,000 --> 00:22:27,000
it would just be absolutely astronomically expensive. But it was a good way to kinda think 

492
00:22:27,000 --> 00:22:30,000
about where I needed to go. 

493
00:22:30,000 --> 00:22:34,000
I designed a new kinda trie node 

494
00:22:34,000 --> 00:22:35,000
that had the letter and a little Boolean flag of is this the word, so whether it had 

495
00:22:35,000 --> 00:22:39,000
the dark circle around it. 

496
00:22:39,000 --> 00:22:41,000
And then it had a 26-member children array, so pointers to other nodes like 

497
00:22:41,000 --> 00:22:44,000
this one. So totally recursive, 

498
00:22:44,000 --> 00:22:47,000
staring from that root and then the 

499
00:22:47,000 --> 00:22:50,000
idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last 

500
00:22:50,000 --> 00:22:52,000
slot to make it really easy to kinda just trace letter by letter 

501
00:22:52,000 --> 00:22:55,000
down to the levels of the tree. 

502
00:22:55,000 --> 00:22:59,000
So I'd have this one root node that pointed to the initial one, and then all the words 

503
00:22:59,000 --> 00:23:03,000
that begin with ''a'' are off that first lot; all from ''z'' off the last slot. 

504
00:23:03,000 --> 00:23:08,000
So when I want to know if a word is in the dictionary, 

505
00:23:08,000 --> 00:23:12,000
what do I have to do in this data structure? How do I find it? I 

506
00:23:12,000 --> 00:23:18,000
want to find the word ''far.'' Where do I look? 

507
00:23:18,000 --> 00:23:27,000
Left? Right? Down? Which way? It would be great 

508
00:23:27,000 --> 00:23:31,000
if you would tell me. Then I would understand what my data structure is. [Inaudible] 

509
00:23:31,000 --> 00:23:35,000
Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the 

510
00:23:35,000 --> 00:23:35,000
F node, and now it says recursively find ''ar'' from here. So it's very 

511
00:23:35,000 --> 00:23:37,000
recursive. It'll be like 

512
00:23:37,000 --> 00:23:39,000
find the first letter 

513
00:23:39,000 --> 00:23:43,000
and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find 

514
00:23:43,000 --> 00:23:46,000
an ''r'' out of the next node, and then when you get to that node, 

515
00:23:46,000 --> 00:23:47,000
you'll check this little ''is word Boolean,'' 

516
00:23:47,000 --> 00:23:49,000
and say okay, 

517
00:23:49,000 --> 00:23:53,000
was that path I traced 

518
00:23:53,000 --> 00:23:58,000
marked in the dictionary as leading to something that's known to be a word? 

519
00:23:58,000 --> 00:24:01,000
The prefix looks exactly the same, except for that last check for the as word. So 

520
00:24:01,000 --> 00:24:05,000
given that nodes exist in this, because further down there there's some words, I just 

521
00:24:05,000 --> 00:24:08,000
trace out a sequence of letters ''str,'' and as long as there's something there, 

522
00:24:08,000 --> 00:24:11,000
then I know that away from there are subsequent children beneath it 

523
00:24:11,000 --> 00:24:15,000
that lead to words. 

524
00:24:15,000 --> 00:24:17,000
And adding a word does the same operation. 

525
00:24:17,000 --> 00:24:19,000
If I need to add the word ''strawberry,'' 

526
00:24:19,000 --> 00:24:20,000
then I start looking for ''s,'' ''t,'' ''r,'' 

527
00:24:20,000 --> 00:24:24,000
and at some point, 

528
00:24:24,000 --> 00:24:26,000
I find some places where the nodes don't exist. Then I start creating them from 

529
00:24:26,000 --> 00:24:29,000
there on down. 

530
00:24:29,000 --> 00:24:33,000
So it is tracing what exists and then adding new nodes off the bottom, 

531
00:24:33,000 --> 00:24:38,000
and then marking the final one as yes, this is a word. 

532
00:24:38,000 --> 00:24:43,000
What's interesting about this is that all three of these operations 

533
00:24:43,000 --> 00:24:47,000
don't actually make any reference to how many words are in the dictionary. 

534
00:24:47,000 --> 00:24:49,000
They are all dependent on the length of the word you're searching for or adding into the dictionary. 

535
00:24:49,000 --> 00:24:51,000
If there's ten words in there, if there's 1,000 words in there, if 

536
00:24:51,000 --> 00:24:53,000
there's a million words in there. 

537
00:24:53,000 --> 00:24:56,000
But in no way does the big O depend on 

538
00:24:56,000 --> 00:24:59,000
how big or how loaded the data structure is. 

539
00:24:59,000 --> 00:25:00,000
That's kinda wacky. 

540
00:25:00,000 --> 00:25:06,000
The longest word, 

541
00:25:06,000 --> 00:25:09,000
according to the Oxford English Dictionary? 

542
00:25:09,000 --> 00:25:11,000
There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle 

543
00:25:11,000 --> 00:25:12,000
can't really make that puppy. But 

544
00:25:12,000 --> 00:25:14,000
just so you know, 

545
00:25:14,000 --> 00:25:17,000
45 is not so many, right? 

546
00:25:17,000 --> 00:25:21,000
And, 

547
00:25:21,000 --> 00:25:24,000
in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a 

548
00:25:24,000 --> 00:25:26,000
little bit of an inverse relationship with num words, 

549
00:25:26,000 --> 00:25:29,000
especially the add operation. 

550
00:25:29,000 --> 00:25:31,000
That as the data structure becomes more loaded, 

551
00:25:31,000 --> 00:25:34,000
there's typically less work to do in add. 

552
00:25:34,000 --> 00:25:37,000
So if ''strawberry'' is in there, and you go to add ''strawberries,'' 

553
00:25:37,000 --> 00:25:38,000
then actually you already have ''strawberries'' to depend on. You just need to build 

554
00:25:38,000 --> 00:25:41,000
off the thing. So the fact that 

555
00:25:41,000 --> 00:25:45,000
the more words that are in there, the more likely that some of the nodes you already need 

556
00:25:45,000 --> 00:25:46,000
are already in place, and then you can just kinda blow past them and just add your new 

557
00:25:46,000 --> 00:25:49,000
nodes. 

558
00:25:49,000 --> 00:25:53,000
Now, that's odd. You think of most data structures as really as 

559
00:25:53,000 --> 00:26:00,000
clearly as they get heavier, it takes more work to install new 

560
00:26:00,000 --> 00:26:03,000
things in it rather than less. [Inaudible]. 

561
00:26:03,000 --> 00:26:04,000
Well, the thing is, they're organized by letter, and that 26-number array. So in fact, 

562
00:26:04,000 --> 00:26:08,000
I don't even have to look. 

563
00:26:08,000 --> 00:26:11,000
So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go 

564
00:26:11,000 --> 00:26:14,000
straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm 

565
00:26:14,000 --> 00:26:16,000
not searching for it. It really just is exactly in the ''r'' slot. That's why 

566
00:26:16,000 --> 00:26:21,000
there's 26 of them. In 

567
00:26:21,000 --> 00:26:24,000
each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null. 

568
00:26:24,000 --> 00:26:27,000
So in this case, it's optimized to make it super fast. I don't even have to look 

569
00:26:27,000 --> 00:26:29,000
through the letters to find the one that matches. There's no matching. It just goes straight 

570
00:26:29,000 --> 00:26:32,000
to the ''r'' slot, ''z'' slot, the ''s'' slot. 

571
00:26:32,000 --> 00:26:35,000
We're going to see this is going to be a consequence we can't 

572
00:26:35,000 --> 00:26:37,000
tolerate, but it is, at least, the starting point for thinking about how to 

573
00:26:37,000 --> 00:26:38,000
build a data structure. 

574
00:26:38,000 --> 00:26:40,000
So the space usage 

575
00:26:40,000 --> 00:26:44,000
is where this thing 

576
00:26:44,000 --> 00:26:47,000
really, really - so this is an extreme example of a space-time trade-off, right? 

577
00:26:47,000 --> 00:26:50,000
We've got this thing that optimizes beautifully for OAV length of 

578
00:26:50,000 --> 00:26:53,000
the word. So that means typically eight operations on average are going to be 

579
00:26:53,000 --> 00:26:56,000
needed to add or to search for something. 

580
00:26:56,000 --> 00:26:57,000
The space we're using here is 106 bytes per nodes, so that's this 

581
00:26:57,000 --> 00:26:59,000
26-member, 

582
00:26:59,000 --> 00:27:02,000
four-byte array, 

583
00:27:02,000 --> 00:27:04,000
plus two bytes for the character and the Boolean that are up there. One 

584
00:27:04,000 --> 00:27:05,000
hundred and six bytes 

585
00:27:05,000 --> 00:27:09,000
is a lot of memory, 

586
00:27:09,000 --> 00:27:13,000
and the tree has about a quarter of a million nodes. 

587
00:27:13,000 --> 00:27:17,000
So given the 125,000 words that I said are in the input, 

588
00:27:17,000 --> 00:27:19,000
if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of 

589
00:27:19,000 --> 00:27:22,000
number to think about, though. You have 125,000 

590
00:27:22,000 --> 00:27:25,000
words. They take 250,000 nodes. That tells you, actually, that 

591
00:27:25,000 --> 00:27:27,000
kinda on average, each 

592
00:27:27,000 --> 00:27:29,000
word has about two unique nodes, 

593
00:27:29,000 --> 00:27:33,000
that even words like ''straightforward'' and ''straightforwardly,'' 

594
00:27:33,000 --> 00:27:34,000
on average, are sharing enough of their prefix - common prefix 

595
00:27:34,000 --> 00:27:38,000
parts - that 

596
00:27:38,000 --> 00:27:39,000
there's very little unique data added by any particular word in there that averages 

597
00:27:39,000 --> 00:27:40,000
across the whole thing. That's 

598
00:27:40,000 --> 00:27:42,000
pretty neat 

599
00:27:42,000 --> 00:27:45,000
to know. However, this is just not going to fly. 

600
00:27:45,000 --> 00:27:50,000
So here I was telling you that my main memory had 

601
00:27:50,000 --> 00:27:53,000
four megabytes, maybe, on a good day, and now I've actually 

602
00:27:53,000 --> 00:27:55,000
built a data structure that's going to require six times that 

603
00:27:55,000 --> 00:27:58,000
in terms of storage. 

604
00:27:58,000 --> 00:28:00,000
Okay, we gotta squeeze that down. 

605
00:28:00,000 --> 00:28:04,000
So the first thing we can do 

606
00:28:04,000 --> 00:28:08,000
is to realize that those 26 - allocated a full 26-member array, of 

607
00:28:08,000 --> 00:28:11,000
which I only plan on using some of the slots, is pretty wasteful. 

608
00:28:11,000 --> 00:28:13,000
So instead of saying I'm committing to a 26 per thing, I'll say well, how about if 

609
00:28:13,000 --> 00:28:16,000
I have an array that actually 

610
00:28:16,000 --> 00:28:20,000
is tight to the number of children coming out of this node? 

611
00:28:20,000 --> 00:28:20,000
So the very first node will have 26. There are words that start with every letter in the 

612
00:28:20,000 --> 00:28:22,000
alphabet. 

613
00:28:22,000 --> 00:28:25,000
But from there, it very quickly winnows down. 

614
00:28:25,000 --> 00:28:29,000
You have the letter ''x.'' You do not need the full 26 children coming 

615
00:28:29,000 --> 00:28:33,000
off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going 

616
00:28:33,000 --> 00:28:37,000
to need to be fleshed out. You need the ''xa,'' you need an ''xe,'' 

617
00:28:37,000 --> 00:28:40,000
but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that. 

618
00:28:40,000 --> 00:28:44,000
So if I change the structure here to instead of being 

619
00:28:44,000 --> 00:28:47,000
a 26-member array of node pointers, I make it to be a pointer to a 

620
00:28:47,000 --> 00:28:50,000
set of nodes pointers, so this is a dynamic array 

621
00:28:50,000 --> 00:28:54,000
that I'm keeping track of the number of children in a second field here 

622
00:28:54,000 --> 00:28:56,000
that the first one will be allocated to 26, the second will be allocated to eight. 

623
00:28:56,000 --> 00:28:59,000
This is going to change a little bit of our running times, because now we won't 

624
00:28:59,000 --> 00:29:01,000
be able to go exactly to the ''s'' slot. We'll have to go look for it. 

625
00:29:01,000 --> 00:29:03,000
So on each step down the tree, 

626
00:29:03,000 --> 00:29:07,000
it'll be 

627
00:29:07,000 --> 00:29:12,000
looking through that sized array at this slot to find the right match. But it adds a 

628
00:29:12,000 --> 00:29:14,000
constant factor, basically, on top of OAV length of the word. 

629
00:29:14,000 --> 00:29:15,000
So the space issues of this 

630
00:29:15,000 --> 00:29:19,000


631
00:29:19,000 --> 00:29:23,000
is - we're just not storing all those nulls. And there were a lot of nulls. That 

632
00:29:23,000 --> 00:29:25,000
the node itself is not just ten bytes, and then there's some space in this 

633
00:29:25,000 --> 00:29:27,000
dynamic array out there, which is the number of children 

634
00:29:27,000 --> 00:29:31,000
times the four-byte pointer, 

635
00:29:31,000 --> 00:29:33,000
that on average, a node has six children. Across the 

636
00:29:33,000 --> 00:29:36,000
250,000 nodes in the 

637
00:29:36,000 --> 00:29:39,000
data structure that's being billed from this, so really 26 was way 

638
00:29:39,000 --> 00:29:40,000
overkill. About 20 of them had nulls 

639
00:29:40,000 --> 00:29:42,000
for most of the nodes. 

640
00:29:42,000 --> 00:29:46,000
And many of them, for example, at the very bottom of the tree, 

641
00:29:46,000 --> 00:29:49,000
have completely none. They're all nulls. So you get to a word like 

642
00:29:49,000 --> 00:29:52,000
''straightforwardly.'' There's nothing after that ''y,'' 

643
00:29:52,000 --> 00:29:53,000
so it has zero children coming out the back of that one. 

644
00:29:53,000 --> 00:29:57,000
And so certainly 

645
00:29:57,000 --> 00:30:00,000
having 26 null children pointing away from that was a 

646
00:30:00,000 --> 00:30:01,000
waste that we can tighten up. 

647
00:30:01,000 --> 00:30:03,000
So when we get it down to this we'll have 

648
00:30:03,000 --> 00:30:04,000


649
00:30:04,000 --> 00:30:05,000
44 bytes 

650
00:30:05,000 --> 00:30:06,000


651
00:30:06,000 --> 00:30:08,000
- 

652
00:30:08,000 --> 00:30:11,000
or 34 bytes for the per node. 

653
00:30:11,000 --> 00:30:15,000
Still a quarter million nodes. Still 

654
00:30:15,000 --> 00:30:18,000
eight and half megabytes. So we're not winning any prizes just yet. 

655
00:30:18,000 --> 00:30:22,000
But we are making good progress. That got us down a factor of 

656
00:30:22,000 --> 00:30:25,000
three or four right there. So I still have the same 

657
00:30:25,000 --> 00:30:28,000
number of nodes. All I changed was the pointers to the 

658
00:30:28,000 --> 00:30:32,000
subsequent nodes further down the tree. So 

659
00:30:32,000 --> 00:30:35,000
I - the kind of letters and how the letters connect to each other - the connectivity is 

660
00:30:35,000 --> 00:30:39,000
still pretty much the same. It's just that in any particular node, how many children does it have? 

661
00:30:39,000 --> 00:30:40,000
I was storing a whole bunch of nulls and now I'm not storing those nulls. 

662
00:30:40,000 --> 00:30:43,000
So the number of nodes is still 

663
00:30:43,000 --> 00:30:47,000
exactly as it was before. I'm going 

664
00:30:47,000 --> 00:30:49,000
to do one other little squishing thing on this data structure 

665
00:30:49,000 --> 00:30:51,000
and then I'm going 

666
00:30:51,000 --> 00:30:53,000
to flip gears for a second. 

667
00:30:53,000 --> 00:30:56,000
So still using kinda my 

668
00:30:56,000 --> 00:30:59,000
CS side of things, how can I squash things down? 

669
00:30:59,000 --> 00:31:03,000
I'm going to use a technique that we used in the heap. 

670
00:31:03,000 --> 00:31:04,000
So in the case of the binary heap that we implemented into the 

671
00:31:04,000 --> 00:31:07,000
PQ, 

672
00:31:07,000 --> 00:31:11,000
you'll remember that we drew pictures in the handout that showed a heap with oh, I 

673
00:31:11,000 --> 00:31:11,000
have a parent with two children, which has four grandchildren and 

674
00:31:11,000 --> 00:31:15,000
so on. 

675
00:31:15,000 --> 00:31:16,000
But that we actually compressed that down into an array, 

676
00:31:16,000 --> 00:31:20,000
and in the array 

677
00:31:20,000 --> 00:31:24,000
there was this pattern of 

678
00:31:24,000 --> 00:31:27,000
here is the root node. 

679
00:31:27,000 --> 00:31:30,000
So we'll call this level zero. 

680
00:31:30,000 --> 00:31:33,000
And then the next two nodes 

681
00:31:33,000 --> 00:31:36,000
are level one. 

682
00:31:36,000 --> 00:31:38,000
And the next four nodes 

683
00:31:38,000 --> 00:31:42,000
are level two and so on. So 

684
00:31:42,000 --> 00:31:45,000
there's one here, two here, there's four here, there's eight here. And we kinda flattened it down 

685
00:31:45,000 --> 00:31:50,000
by generation, so looking at what was at the top and what was underneath it. 

686
00:31:50,000 --> 00:31:52,000
When we did that, we removed all the space for pointers. 

687
00:31:52,000 --> 00:31:54,000
Although we drew those pictures in the heap as though there were pointers 

688
00:31:54,000 --> 00:31:57,000
there, in fact, there were not. 

689
00:31:57,000 --> 00:32:02,000
There were no pointers being stored in the heap in this version of the PQ. 

690
00:32:02,000 --> 00:32:02,000
And so that gets us back to kind of array like storage requirements, 

691
00:32:02,000 --> 00:32:06,000
but 

692
00:32:06,000 --> 00:32:09,000
by flattening it down into this array, and still kinda using the tree 

693
00:32:09,000 --> 00:32:10,000
conceptually to work our way through it, we're getting the advantages of 

694
00:32:10,000 --> 00:32:12,000
that 

695
00:32:12,000 --> 00:32:14,000


696
00:32:14,000 --> 00:32:15,000
traversal that's based on height of tree, 

697
00:32:15,000 --> 00:32:20,000
not length of vector. 

698
00:32:20,000 --> 00:32:22,000
So if we do the same thing with the trie, it's a little bit more complicated 

699
00:32:22,000 --> 00:32:25,000
because there were a couple things about heap that made this a little 

700
00:32:25,000 --> 00:32:27,000
easier. Heap always had two children, so there were these rules about 

701
00:32:27,000 --> 00:32:30,000
how the tree was filled in that meant 

702
00:32:30,000 --> 00:32:33,000
you could always compute parent and child index using this kinda divide by two, 

703
00:32:33,000 --> 00:32:36,000
multiple by two exchange operation. 

704
00:32:36,000 --> 00:32:39,000
In this case, what we're going to have to do is just lay it down by generation, 

705
00:32:39,000 --> 00:32:41,000
but we don't know how big a generation's gonna be, so we're actually gonna have to keep 

706
00:32:41,000 --> 00:32:43,000
track of it manually. 

707
00:32:43,000 --> 00:32:46,000
So the root node will have 

708
00:32:46,000 --> 00:32:49,000
all A through Z. So this will be 

709
00:32:49,000 --> 00:32:53,000
level one, which is A through Z here, 

710
00:32:53,000 --> 00:32:57,000
and then level two is - there'll be a bunch of slots here, which is the 

711
00:32:57,000 --> 00:33:00,000
A followed by some letter, and then there'd be the B's followed by some letter, and 

712
00:33:00,000 --> 00:33:03,000
then there'd be the C's followed by some letter, and so on. 

713
00:33:03,000 --> 00:33:06,000
And then further down is level three, which are the three-character words. So these are all 

714
00:33:06,000 --> 00:33:09,000
the single-character paths, 

715
00:33:09,000 --> 00:33:12,000
two-character paths, three-character paths, kinda flattened down. 

716
00:33:12,000 --> 00:33:15,000
And so if I look at what I changed here, my first node 

717
00:33:15,000 --> 00:33:18,000
says it has no letter that's not the root node, it's not a word, 

718
00:33:18,000 --> 00:33:20,000
and it says it's children begin at index one. So 

719
00:33:20,000 --> 00:33:23,000
the next location. So it's not a 

720
00:33:23,000 --> 00:33:26,000
multiply by two and add one or something like that. It's okay, here's 

721
00:33:26,000 --> 00:33:30,000
where my children begin. They're at slot one in the array. 

722
00:33:30,000 --> 00:33:31,000
And then there are 26 of them. So that means that A, B, C, D are kinda in order there. 

723
00:33:31,000 --> 00:33:33,000
If I look at A, 

724
00:33:33,000 --> 00:33:36,000
''a'' is a word, 

725
00:33:36,000 --> 00:33:39,000
and it says its children begin at index 27 

726
00:33:39,000 --> 00:33:44,000
and there are 12 of them, so there are 12 characters out of the 26 that 

727
00:33:44,000 --> 00:33:47,000
can follow an ''a'' in the sequence of a valid word. And 

728
00:33:47,000 --> 00:33:48,000
then B's children begin at index 39, which is 27 plus 12. So that's 

729
00:33:48,000 --> 00:33:50,000
where 

730
00:33:50,000 --> 00:33:54,000
A's children are sitting here at index 

731
00:33:54,000 --> 00:33:56,000
27 to 38 and then 39 to - I think B has eight or so children - 

732
00:33:56,000 --> 00:33:59,000
and C and so on. 

733
00:33:59,000 --> 00:34:00,000
So we flattened it all down by generation 

734
00:34:00,000 --> 00:34:01,000
into 

735
00:34:01,000 --> 00:34:04,000
an array. 

736
00:34:04,000 --> 00:34:07,000
That means all the space for pointers 

737
00:34:07,000 --> 00:34:10,000
has now been removed. 

738
00:34:10,000 --> 00:34:12,000
One serious consequence of this, though, 

739
00:34:12,000 --> 00:34:14,000
is that the pointers got us flexibility, that 

740
00:34:14,000 --> 00:34:15,000
that insert and remove 

741
00:34:15,000 --> 00:34:19,000


742
00:34:19,000 --> 00:34:22,000
was dependent on the idea that pointers really being there, pointers buying us 

743
00:34:22,000 --> 00:34:25,000
the ability to insert and delete without doing a shuffle. 

744
00:34:25,000 --> 00:34:27,000
That now, once I have flattened it, I have suddenly bought back the same 

745
00:34:27,000 --> 00:34:29,000
problems that arrays have, which is 

746
00:34:29,000 --> 00:34:33,000
if I have the words here that have 

747
00:34:33,000 --> 00:34:36,000
''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And 

748
00:34:36,000 --> 00:34:40,000
I try to add the word ''add,'' 

749
00:34:40,000 --> 00:34:43,000
that in order to put ''ad'' in there, we're going to have to move everything over and then 

750
00:34:43,000 --> 00:34:46,000
there's going to be a lot of shuffling and what not. 

751
00:34:46,000 --> 00:34:48,000
So I'm starting to build a data structure that's losing some flexibility but 

752
00:34:48,000 --> 00:34:49,000
saving space. 

753
00:34:49,000 --> 00:34:53,000
So 

754
00:34:53,000 --> 00:34:54,000
I'm, in some sense, focusing my attention on how can I build a data structure that's really fast 

755
00:34:54,000 --> 00:34:56,000
to search, 

756
00:34:56,000 --> 00:35:00,000
even if it was a little harder to build? 

757
00:35:00,000 --> 00:35:02,000
Because it turns out the thing I most care about is making sure it can 

758
00:35:02,000 --> 00:35:07,000
actually very, very efficiently look up words. 

759
00:35:07,000 --> 00:35:11,000
Maybe building it once, being a little slow, isn't so painful. 

760
00:35:11,000 --> 00:35:14,000
So the space issues in this gets us down to just ten bytes per 

761
00:35:14,000 --> 00:35:16,000
node. All the pointers are gone. So there was 24 bytes of pointers that we've 

762
00:35:16,000 --> 00:35:19,000
just eliminated. 

763
00:35:19,000 --> 00:35:21,000
And that means we've got two and a half megabyte 

764
00:35:21,000 --> 00:35:28,000
memory footprint right now. 

765
00:35:28,000 --> 00:35:29,000
So we're kinda getting close to hitting the range for binary tree and hash tables. 

766
00:35:29,000 --> 00:35:33,000
Now I'm 

767
00:35:33,000 --> 00:35:37,000
going to go back to the domain again. So I worked on it kinda from the CS side, thinking 

768
00:35:37,000 --> 00:35:39,000
about things I know about data structures and 

769
00:35:39,000 --> 00:35:41,000
reorganizing arrays and heaps and pointers. I'm going 

770
00:35:41,000 --> 00:35:43,000
to go back and look at that domain again 

771
00:35:43,000 --> 00:35:46,000
and see if there's something else I 

772
00:35:46,000 --> 00:35:48,000
can kinda take advantage of. 

773
00:35:48,000 --> 00:35:50,000
So this is a different 

774
00:35:50,000 --> 00:35:52,000
cross-section of the dictionary 

775
00:35:52,000 --> 00:35:57,000
showing the words ''adapted,'' 

776
00:35:57,000 --> 00:36:02,000
''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.'' 

777
00:36:02,000 --> 00:36:04,000
So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look 

778
00:36:04,000 --> 00:36:05,000
sorta over a little bit further. I've got 

779
00:36:05,000 --> 00:36:10,000
''desert,'' 

780
00:36:10,000 --> 00:36:15,000
''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay, 

781
00:36:15,000 --> 00:36:20,000
and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt'' 

782
00:36:20,000 --> 00:36:23,000
that all are verbs for which you can apply the past tense in the 

783
00:36:23,000 --> 00:36:26,000
gerund form and the ''ion's'' that 

784
00:36:26,000 --> 00:36:29,000
tack onto that root word to give you these suffixes. 

785
00:36:29,000 --> 00:36:32,000
And each of these words that has shown up here has exactly the same set of suffixes 

786
00:36:32,000 --> 00:36:33,000
that can be applied to it. 

787
00:36:33,000 --> 00:36:35,000
So I 

788
00:36:35,000 --> 00:36:37,000
did this thing about prefixes. I 

789
00:36:37,000 --> 00:36:42,000
went out of my way to find that - 

790
00:36:42,000 --> 00:36:45,000
so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.'' 

791
00:36:45,000 --> 00:36:47,000
But is there some way that I could take this idea of ''corrupt'' 

792
00:36:47,000 --> 00:36:50,000
and ''assert'' and ''insert,'' 

793
00:36:50,000 --> 00:36:54,000
and then once I get down to that ''t,'' where they end, is there a way where they can also 

794
00:36:54,000 --> 00:36:57,000
share their backend parts? That those ''ing's'' and ''ion's,'' can they be 

795
00:36:57,000 --> 00:36:58,000
unified, as well? Well, 

796
00:36:58,000 --> 00:37:00,000
why not? 

797
00:37:00,000 --> 00:37:03,000
Why not? So here is 

798
00:37:03,000 --> 00:37:06,000
the first version of what's going to be the DAWG, 

799
00:37:06,000 --> 00:37:07,000
the Directed Acyclic Word 

800
00:37:07,000 --> 00:37:10,000
Graph. 

801
00:37:10,000 --> 00:37:12,000
It takes the idea of unifying the prefixes and now applies it to the 

802
00:37:12,000 --> 00:37:14,000
suffixes. 

803
00:37:14,000 --> 00:37:17,000
That there are a lot of 

804
00:37:17,000 --> 00:37:20,000
words that end in the same letters, as well as start in the same letters. Why not just apply 

805
00:37:20,000 --> 00:37:22,000
the same kind of unification and sharing on the back end? 

806
00:37:22,000 --> 00:37:24,000
So here comes in 

807
00:37:24,000 --> 00:37:26,000
''adapt'' and ''adopt'' 

808
00:37:26,000 --> 00:37:30,000
and ''based'' and ''port'' 

809
00:37:30,000 --> 00:37:33,000
and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter'' 

810
00:37:33,000 --> 00:37:37,000
and ''adopted'' and ''basting'' and ''bastion,'' 

811
00:37:37,000 --> 00:37:40,000
all coming off that sharing all the endings, because they have the same 

812
00:37:40,000 --> 00:37:41,000
words that are viable and the same connections that are 

813
00:37:41,000 --> 00:37:46,000
between them. So 

814
00:37:46,000 --> 00:37:49,000
the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for 

815
00:37:49,000 --> 00:37:50,000
example, being shared from ''portion'' and ''baste'' and 

816
00:37:50,000 --> 00:37:51,000
''adopt'' and 

817
00:37:51,000 --> 00:37:54,000
''adopter,'' 

818
00:37:54,000 --> 00:37:57,000
all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a 

819
00:37:57,000 --> 00:37:59,000
word for them to share. 

820
00:37:59,000 --> 00:38:01,000
So it is directed. So 

821
00:38:01,000 --> 00:38:04,000
it is a graph. 

822
00:38:04,000 --> 00:38:05,000
So once we remove this - the idea 

823
00:38:05,000 --> 00:38:09,000
that there's a single path 

824
00:38:09,000 --> 00:38:12,000
from the root to any node, we're no longer a tree structure. So we 

825
00:38:12,000 --> 00:38:16,000
can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,'' 

826
00:38:16,000 --> 00:38:19,000
and so that no longer fits the definition of what is a tree. It's 

827
00:38:19,000 --> 00:38:23,000
become a graph. We can get to these nodes from different places. 

828
00:38:23,000 --> 00:38:24,000
The arcs go one way in this case. We can't go back up the tree 

829
00:38:24,000 --> 00:38:27,000
to find the words in reverse. 

830
00:38:27,000 --> 00:38:31,000
There are no cycles. So acyclic, right, you can't just 

831
00:38:31,000 --> 00:38:33,000
wank around some portion and get this many bananas as you want. 

832
00:38:33,000 --> 00:38:34,000
And 

833
00:38:34,000 --> 00:38:37,000
they have 

834
00:38:37,000 --> 00:38:44,000
been with certain nodes that are marked, in this case, the same way they were in the 

835
00:38:44,000 --> 00:38:45,000
trie. This path from the root node to here does represent a word in this 

836
00:38:45,000 --> 00:38:48,000
situation. 

837
00:38:48,000 --> 00:38:53,000
So building this is a little bit complicated. I'll give you a little bit 

838
00:38:53,000 --> 00:38:56,000
of the idea of the intuition for it, and I will not get too deep into it. 

839
00:38:56,000 --> 00:39:00,000
But largely the way you do it is you build the trie, 

840
00:39:00,000 --> 00:39:01,000
so that it doesn't have the unification of the suffixes; it only unifies the 

841
00:39:01,000 --> 00:39:03,000
prefixes. And 

842
00:39:03,000 --> 00:39:07,000
then you work backwards 

843
00:39:07,000 --> 00:39:09,000
on the suffixes. So in some sense you look at all the words that end, for example, in 

844
00:39:09,000 --> 00:39:12,000
''s'' in the dictionary. 

845
00:39:12,000 --> 00:39:15,000
And you find all the ones that just end in one ''s'' that goes nowhere, and you say 

846
00:39:15,000 --> 00:39:17,000
look, this ''s'' looks like every other ''s'' in this thing. All these words 

847
00:39:17,000 --> 00:39:20,000
that are just plurals or 

848
00:39:20,000 --> 00:39:23,000
verbs that can have an ''s'' tacked on. Let's share that ''s.'' 

849
00:39:23,000 --> 00:39:25,000
And then so you take however many s's there were, 1,000 of them, and you 

850
00:39:25,000 --> 00:39:29,000
say these are all the same ''s.'' 

851
00:39:29,000 --> 00:39:31,000
And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda 

852
00:39:31,000 --> 00:39:33,000
traversing the graph in reverse, and you say well look, 

853
00:39:33,000 --> 00:39:36,000
here's a bunch of people who lead in on an ''e.'' 

854
00:39:36,000 --> 00:39:37,000
And if that ''e'' is a word or not, there might be some group that has ''e'' 

855
00:39:37,000 --> 00:39:38,000
that's also a word, 

856
00:39:38,000 --> 00:39:40,000
so like 

857
00:39:40,000 --> 00:39:42,000
the word ''apple'' and ''apples.'' 

858
00:39:42,000 --> 00:39:46,000
But sometimes the ''e'' is not, 

859
00:39:46,000 --> 00:39:49,000
where that wasn't a stopping place, let's say, for ''parties'' it wasn't. 

860
00:39:49,000 --> 00:39:53,000
So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't 

861
00:39:53,000 --> 00:39:54,000
are that way can unify on that. So you just work your way backwards from there. So you say 

862
00:39:54,000 --> 00:39:55,000
what letters led into that ''e?'' 

863
00:39:55,000 --> 00:39:58,000
And so kinda from 

864
00:39:58,000 --> 00:40:02,000
the end of the words backwards, you're looking for all the people who can end in an 

865
00:40:02,000 --> 00:40:05,000
''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up, 

866
00:40:05,000 --> 00:40:07,000
making sure that they have all the same outgoing connections. So for 

867
00:40:07,000 --> 00:40:09,000
example, ''running'' has 

868
00:40:09,000 --> 00:40:13,000
an ''ing'' at the end. So does ''swimming,'' 

869
00:40:13,000 --> 00:40:15,000
but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about 

870
00:40:15,000 --> 00:40:18,000
unifying things that 

871
00:40:18,000 --> 00:40:20,000
they would kinda have to have all identical properties from this point 

872
00:40:20,000 --> 00:40:22,000
down to the bottom of the tree 

873
00:40:22,000 --> 00:40:25,000
to be unifiable. 

874
00:40:25,000 --> 00:40:28,000
But it's a pretty neat process. It's actually what's called 

875
00:40:28,000 --> 00:40:29,000
DFT subset minimization, so if you want to learn a little bit about it, that's the 

876
00:40:29,000 --> 00:40:31,000
word to look up. 

877
00:40:31,000 --> 00:40:35,000
And so if I do that, 

878
00:40:35,000 --> 00:40:36,000
and I build this as a DAWG, I'm still using the same structure definition here, 

879
00:40:36,000 --> 00:40:40,000


880
00:40:40,000 --> 00:40:44,000
but what I am doing is unifying suffixes as well as prefixes 

881
00:40:44,000 --> 00:40:46,000
and that, actually, is gonna change the number of nodes that I need drastically. 

882
00:40:46,000 --> 00:40:49,000
It goes down to 80,000 

883
00:40:49,000 --> 00:40:52,000
from my initial 250,000. 

884
00:40:52,000 --> 00:40:55,000
And if you think about the number of words again, 

885
00:40:55,000 --> 00:40:57,000
there were 125,000 words in the dictionary, 

886
00:40:57,000 --> 00:41:01,000
the DAWG has just 80,000 words. So it means, actually, once you start 

887
00:41:01,000 --> 00:41:04,000
unifying the suffixes, it's like most words don't even have - 

888
00:41:04,000 --> 00:41:07,000
or many of the words don't even have nodes of their own that are unique to 

889
00:41:07,000 --> 00:41:11,000
them whatsoever. They just exist kinda as portions of other words 

890
00:41:11,000 --> 00:41:12,000
that have all been joined together and 

891
00:41:12,000 --> 00:41:15,000
compressed into one package. 

892
00:41:15,000 --> 00:41:18,000
So 80,000 nodes total, about 

893
00:41:18,000 --> 00:41:21,000
two-thirds of the number of words total. 

894
00:41:21,000 --> 00:41:23,000
So we have ten nodes - ten-byte nodes 

895
00:41:23,000 --> 00:41:27,000
and 80,000 of them. 

896
00:41:27,000 --> 00:41:31,000
We are down to under a megabyte. So we've crossed that little interesting barrier of 

897
00:41:31,000 --> 00:41:34,000
the data structure on disk. The data that was fed into it, those words, was over a 

898
00:41:34,000 --> 00:41:36,000
megabyte and now, actually, we have created a representation that actually is 

899
00:41:36,000 --> 00:41:40,000
compressed, that uses less space 

900
00:41:40,000 --> 00:41:42,000
than the text file did in the first place. And that's really because it is 

901
00:41:42,000 --> 00:41:44,000
finding all these ways in which words are like each other 

902
00:41:44,000 --> 00:41:46,000
and sharing, 

903
00:41:46,000 --> 00:41:48,000
making use of the existing 

904
00:41:48,000 --> 00:41:50,000
structure in the 

905
00:41:50,000 --> 00:41:51,000
DAWG to add new words 

906
00:41:51,000 --> 00:41:53,000
without 

907
00:41:53,000 --> 00:41:56,000
adding bulk. 

908
00:41:56,000 --> 00:41:57,000
So once I've done this I have to say that I've made the data structure 

909
00:41:57,000 --> 00:42:01,000
even less flexible. 

910
00:42:01,000 --> 00:42:03,000
So to be clear, each of these steps has a cost somewhere else that made the code more 

911
00:42:03,000 --> 00:42:07,000
complicated. It also made it so that 

912
00:42:07,000 --> 00:42:10,000
now that I have all this sharing back here, that when I'm adding a new word, so I go 

913
00:42:10,000 --> 00:42:14,000
back to look at this picture here, that 

914
00:42:14,000 --> 00:42:15,000
I go in to add a word and I say oh, actually, the word 

915
00:42:15,000 --> 00:42:17,000
''portioning'' is a word. 

916
00:42:17,000 --> 00:42:19,000
Well, ''adoptioning'' is not. 

917
00:42:19,000 --> 00:42:22,000
So 

918
00:42:22,000 --> 00:42:26,000
that will necessitate if I go in to add the word ''portioning,'' that I actually have 

919
00:42:26,000 --> 00:42:30,000
to split it off of its existing unified branch, because if I just tacked on the 

920
00:42:30,000 --> 00:42:33,000
''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning'' 

921
00:42:33,000 --> 00:42:37,000
and ''adoptioning'' would suddenly become words, too. So actually I have to 

922
00:42:37,000 --> 00:42:38,000
be extra careful once I've got it unified this way that when I add new words that I don't 

923
00:42:38,000 --> 00:42:39,000
actually 

924
00:42:39,000 --> 00:42:42,000
accidentally 

925
00:42:42,000 --> 00:42:45,000
add some additional words that I didn't plan on because these paths have 

926
00:42:45,000 --> 00:42:49,000
been shared. So it requires a little bit more careful management. Again, 

927
00:42:49,000 --> 00:42:50,000
the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried. 

928
00:42:50,000 --> 00:42:53,000
So I 

929
00:42:53,000 --> 00:42:55,000
build it off the data structure - the input data file I had, 

930
00:42:55,000 --> 00:42:57,000
that then would operate very, very efficiently, 

931
00:42:57,000 --> 00:43:00,000
and I wasn't planning on making a lot 

932
00:43:00,000 --> 00:43:01,000
of runtime changes to it. 

933
00:43:01,000 --> 00:43:04,000
I get this down to this 

934
00:43:04,000 --> 00:43:08,000
and then the last thing I'm going to do to it 

935
00:43:08,000 --> 00:43:11,000
is go back to kinda the CS side of it and see if I can squeeze it down kinda 

936
00:43:11,000 --> 00:43:15,000
in terms of just data structure, what's being stored per node. 

937
00:43:15,000 --> 00:43:18,000
So I have 80,000 of these. So 

938
00:43:18,000 --> 00:43:22,000
typically, when you look at a structure, 

939
00:43:22,000 --> 00:43:24,000
it's very rare that the idea if you change something from an int to a 

940
00:43:24,000 --> 00:43:28,000
char, so it was a two-byte thing into a one-byte thing, 

941
00:43:28,000 --> 00:43:32,000
that you're going to have any profound impact on the total size needed. 

942
00:43:32,000 --> 00:43:33,000
But if you happen to know you're going to make a lot of these structures, tens of thousands, 

943
00:43:33,000 --> 00:43:35,000
almost 100,000 of them, 

944
00:43:35,000 --> 00:43:38,000
then it turns out every byte counts. 

945
00:43:38,000 --> 00:43:40,000
If I can get this thing down from, in this case, 

946
00:43:40,000 --> 00:43:42,000
ten bytes to eight bytes, or to 

947
00:43:42,000 --> 00:43:46,000
six bytes or four bytes, 

948
00:43:46,000 --> 00:43:49,000
then it will cut my memory by quite a factor. 

949
00:43:49,000 --> 00:43:52,000
So what I did in the final version of it 

950
00:43:52,000 --> 00:43:58,000
is I used a little technique called bit packing, 

951
00:43:58,000 --> 00:44:00,000
which is to actually stash stuff in the absolute minimum possible space 

952
00:44:00,000 --> 00:44:04,000
that I can fit it into, 

953
00:44:04,000 --> 00:44:04,000
taking advantage of things like, for example, if there are only 26 letters in the 

954
00:44:04,000 --> 00:44:08,000


955
00:44:08,000 --> 00:44:10,000
alphabet, a character is actually a full eight-bit value, which can hold 

956
00:44:10,000 --> 00:44:13,000
numbers from zero to 255. 

957
00:44:13,000 --> 00:44:15,000
And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses, 

958
00:44:15,000 --> 00:44:19,000
I'm not using the digits. 

959
00:44:19,000 --> 00:44:22,000
So having, in some sense, the capacity to store all those distinct characters is actually 

960
00:44:22,000 --> 00:44:26,000
a cost I'm paying on every character that I don't need to. 

961
00:44:26,000 --> 00:44:29,000
So in fact, I squeezed it down by dividing up into the sub-bit level. This 

962
00:44:29,000 --> 00:44:30,000
something that's talked about a little bit in Chapter 15 if you're at all curious. It's 

963
00:44:30,000 --> 00:44:33,000
not what I 

964
00:44:33,000 --> 00:44:34,000
consider core material for 106B, but I'm just sorta mentioning it to kinda 

965
00:44:34,000 --> 00:44:36,000
complete the picture. 

966
00:44:36,000 --> 00:44:38,000
Where five bits, 

967
00:44:38,000 --> 00:44:41,000
given that a bit is either zero or one, 

968
00:44:41,000 --> 00:44:45,000
that I have five of them connected together, then I have 25 

969
00:44:45,000 --> 00:44:45,000
different patterns that I can represent in that 

970
00:44:45,000 --> 00:44:48,000


971
00:44:48,000 --> 00:44:50,000
amount of capacity there, and that's 32 different patterns, which is enough for 

972
00:44:50,000 --> 00:44:54,000
my characters, plus a little slack. 

973
00:44:54,000 --> 00:44:56,000
I use exactly one bit for is word, rather than a full Boolean. All I need 

974
00:44:56,000 --> 00:44:59,000
to know is yes or no. 

975
00:44:59,000 --> 00:45:02,000
I also changed the way I did 

976
00:45:02,000 --> 00:45:04,000
knowing that you were the last child of a generation, rather than having 

977
00:45:04,000 --> 00:45:06,000
A say, ''there are 12 children here.'' 

978
00:45:06,000 --> 00:45:10,000
I just said, go to index 27 

979
00:45:10,000 --> 00:45:12,000
and start reading, and one of them will tell you it's the last. So each of them says, ''No, 

980
00:45:12,000 --> 00:45:16,000
I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it 

981
00:45:16,000 --> 00:45:18,000
will be marked as ''yes,'' and that's how you'll know - that's where A ends and the 

982
00:45:18,000 --> 00:45:20,000
next one beings. 

983
00:45:20,000 --> 00:45:21,000
And so if I can get it down in this way 

984
00:45:21,000 --> 00:45:24,000


985
00:45:24,000 --> 00:45:28,000
I've taken 

986
00:45:28,000 --> 00:45:31,000
what was a ten-byte structure down to four. And it turns out there's often 

987
00:45:31,000 --> 00:45:35,000
reasons why trying to get it down to a multiple of two is actually 

988
00:45:35,000 --> 00:45:37,000
a critical barrier. That if it were five bytes, it's actually likely to be eight, 

989
00:45:37,000 --> 00:45:39,000
just because of what's called alignment and padding restrictions. 

990
00:45:39,000 --> 00:45:42,000
So if were at 

991
00:45:42,000 --> 00:45:45,000
seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it 

992
00:45:45,000 --> 00:45:47,000
still might not make any change. It might be, actually, artificially decided that in 

993
00:45:47,000 --> 00:45:50,000
each of the structures is going to be eight behind the scenes. 

994
00:45:50,000 --> 00:45:54,000
But getting it down to four actually tends to be another one of those critical barriers where 

995
00:45:54,000 --> 00:45:56,000
okay, four actually will be smaller than something that was ten. And 

996
00:45:56,000 --> 00:45:58,000
so I take my 80,000 nodes, 

997
00:45:58,000 --> 00:46:03,000
I get them to be four bytes each, 

998
00:46:03,000 --> 00:46:06,000
and then I now have a data structure that's about a third of a megabyte in memory, 

999
00:46:06,000 --> 00:46:09,000
as opposed to the over a megabyte on disk 

1000
00:46:09,000 --> 00:46:11,000
in the text form. 

1001
00:46:11,000 --> 00:46:12,000
And so that's how it does what it does. So if you're going to look at the code, 

1002
00:46:12,000 --> 00:46:16,000
you'll see that 

1003
00:46:16,000 --> 00:46:17,000
there is this complicated bit pack structure that 

1004
00:46:17,000 --> 00:46:20,000
it does 

1005
00:46:20,000 --> 00:46:24,000
look for things char byte using a recursive character-by-character 

1006
00:46:24,000 --> 00:46:24,000
strategy of start at the top, find the matching character, then go to the children index, 

1007
00:46:24,000 --> 00:46:28,000
and then 

1008
00:46:28,000 --> 00:46:33,000
work your way down the children to find the next matching character, and so on 

1009
00:46:33,000 --> 00:46:37,000
to find its way through the data structure. 

1010
00:46:37,000 --> 00:46:41,000
So [inaudible] is where the child index, or for example, the idea that 

1011
00:46:41,000 --> 00:46:42,000
A says my children start at 27 or B starts at 39, 

1012
00:46:42,000 --> 00:46:45,000
and so I just need 

1013
00:46:45,000 --> 00:46:51,000
a number there, but I'm using 

1014
00:46:51,000 --> 00:46:55,000
a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell 

1015
00:46:55,000 --> 00:46:57,000
you a couple of interesting things about it, actually, before I close this off, which is the 

1016
00:46:57,000 --> 00:47:02,000
lexicon.dat file 

1017
00:47:02,000 --> 00:47:03,000
actually just is an in-memory representation of what the DAWG data 

1018
00:47:03,000 --> 00:47:07,000
structure looks like. 

1019
00:47:07,000 --> 00:47:11,000
And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a 

1020
00:47:11,000 --> 00:47:15,000
little bit - make data structures harder to rebuild and 

1021
00:47:15,000 --> 00:47:17,000
to dehydrate to disk because they have to kinda be wired up in memory. You don't know what 

1022
00:47:17,000 --> 00:47:20,000
memory you're going to get from new and what the addresses are. 

1023
00:47:20,000 --> 00:47:22,000
This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's 

1024
00:47:22,000 --> 00:47:26,000
just one big array of these blocks. 

1025
00:47:26,000 --> 00:47:27,000
And all the information is relative in there. It says go to the index 27, index 

1026
00:47:27,000 --> 00:47:30,000
45. 

1027
00:47:30,000 --> 00:47:34,000
So you can take that whole block and just write it to disk, 300 bytes worth, 

1028
00:47:34,000 --> 00:47:36,000
read it back in 300 bytes worth and it kinda is rehydrated 

1029
00:47:36,000 --> 00:47:39,000
without any extra effort. So 

1030
00:47:39,000 --> 00:47:42,000
that's actually what the lexicon.dat file is, it's just a dehydrated form of 

1031
00:47:42,000 --> 00:47:44,000
the DAWG having been 

1032
00:47:44,000 --> 00:47:48,000
build in memory. That makes it very, very fast to load. 

1033
00:47:48,000 --> 00:47:49,000
At that point, though, it's super inflexible. If you want to add a word to 

1034
00:47:49,000 --> 00:47:52,000
it, 

1035
00:47:52,000 --> 00:47:56,000
the way the structure is build, you'd have to kinda add nodes and make sure it wasn't 

1036
00:47:56,000 --> 00:47:58,000
glomming on to some existing words that were nearby, and so it would be a lot of work to 

1037
00:47:58,000 --> 00:48:01,000
actually edit the DAWG in place. 

1038
00:48:01,000 --> 00:48:04,000
In fact, when you ask it to add a word, 

1039
00:48:04,000 --> 00:48:05,000
it actually doesn't edit that one at all. 

1040
00:48:05,000 --> 00:48:09,000
You say I'd like to put a new word in. 

1041
00:48:09,000 --> 00:48:11,000
It says I'm just - this one's so perfect the way it is. I've built this one, I love this 

1042
00:48:11,000 --> 00:48:14,000
one, I'm very happy with it. 

1043
00:48:14,000 --> 00:48:16,000
If you want to put some other words into it, it just keeps a second data structure 

1044
00:48:16,000 --> 00:48:21,000
on the side, an 

1045
00:48:21,000 --> 00:48:21,000
auxiliary one over here, where it sticks the additional words you ask it to put in. I 

1046
00:48:21,000 --> 00:48:24,000
have 

1047
00:48:24,000 --> 00:48:28,000
to say that seems like a very obvious thing to do, but in fact, I didn't even think of 

1048
00:48:28,000 --> 00:48:31,000
that for years. We had the lexicon. I said you can't use the lexicon to make new 

1049
00:48:31,000 --> 00:48:32,000
word lists or edit them. You can only load this one, beautiful, perfect one. You can't do 

1050
00:48:32,000 --> 00:48:35,000
anything else. And 

1051
00:48:35,000 --> 00:48:38,000
at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it 

1052
00:48:38,000 --> 00:48:41,000
for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just 

1053
00:48:41,000 --> 00:48:41,000
can't do it. You don't understand. It's very complicated. 

1054
00:48:41,000 --> 00:48:45,000
You 

1055
00:48:45,000 --> 00:48:50,000
just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really, 

1056
00:48:50,000 --> 00:48:51,000
after some number of years of just being a baby about it, I'm like oh, of course, why don't I 

1057
00:48:51,000 --> 00:48:54,000
just 

1058
00:48:54,000 --> 00:48:57,000
keep another structure off to the side? It's an abstraction. It's a word list. You don't need 

1059
00:48:57,000 --> 00:49:00,000
to know what I do. The fact that I keep a DAWG for some of the words 

1060
00:49:00,000 --> 00:49:04,000
and just an ordinary set for some of the other words 

1061
00:49:04,000 --> 00:49:06,000
is really completely irrelevant. And as long as when you ask for words I 

1062
00:49:06,000 --> 00:49:08,000
look in both and I tell you, ''Yes, it's there,'' why do you need to know 

1063
00:49:08,000 --> 00:49:11,000
where I keep it? Okay, so it took 

1064
00:49:11,000 --> 00:49:13,000
me a while to buy abstraction, even though I teach it all the time. It just 

1065
00:49:13,000 --> 00:49:17,000
wasn't totally making sense. 

1066
00:49:17,000 --> 00:49:19,000
So a neat little thing, just kinda keeping both behind. And so it turns out that I 

1067
00:49:19,000 --> 00:49:21,000
learned about this, actually, from a paper 

1068
00:49:21,000 --> 00:49:22,000
that I read at the - in the 

1069
00:49:22,000 --> 00:49:23,000


1070
00:49:23,000 --> 00:49:27,000
early ?90s about somebody 

1071
00:49:27,000 --> 00:49:30,000
building a Scrabble player, and they wanted a dictionary that 

1072
00:49:30,000 --> 00:49:33,000
supported kinda finding plays on the board and being able to extend words. And so 

1073
00:49:33,000 --> 00:49:34,000
knowing things about root words and how they extend and prefixes and suffixes was kinda the 

1074
00:49:34,000 --> 00:49:38,000
original motivation. 

1075
00:49:38,000 --> 00:49:39,000
And this kind of data structure is actually really great for any kind of word playing. 

1076
00:49:39,000 --> 00:49:42,000
So you have 

1077
00:49:42,000 --> 00:49:43,000
anagrams that you want to rearrange or words that you want to extend past suffix and 

1078
00:49:43,000 --> 00:49:46,000
prefix that 

1079
00:49:46,000 --> 00:49:50,000
this kind of data structure is really optimized for 

1080
00:49:50,000 --> 00:49:52,000
doing that kind of traversal and kinda leveraging the existing patterns 

1081
00:49:52,000 --> 00:49:56,000
in words to save space while doing so. 

1082
00:49:56,000 --> 00:50:00,000
So it was a really fun little project to do. It was kinda neat to think about. Even though today, 

1083
00:50:00,000 --> 00:50:02,000
if you needed a lexicon, you could just use a set. It's fast enough. It takes 

1084
00:50:02,000 --> 00:50:05,000
a ton of memory, but you know what, memory's cheap. 

1085
00:50:05,000 --> 00:50:06,000
So this is not the kind of thing you would need to do today except when you were in some kinda 

1086
00:50:06,000 --> 00:50:07,000
embedded, 

1087
00:50:07,000 --> 00:50:11,000
low memory, 

1088
00:50:11,000 --> 00:50:14,000
very, very tight processor environment, but it still makes for kinda interesting sort of artifact 

1089
00:50:14,000 --> 00:50:18,000
to think about well, how you get there and what kind of things you can use. 

1090
00:50:18,000 --> 00:50:19,000
So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will 

1091
00:50:19,000 --> 00:50:22,000
see 

1092
00:50:22,000 --> 00:50:23,000
you