1 00:00:00,000 --> 00:00:12,000 2 00:00:12,000 --> 00:00:16,000 3 00:00:16,000 --> 00:00:25,000 This presentation is delivered by the Stanford Center for Professional 4 00:00:25,000 --> 00:00:28,000 Development. 5 00:00:28,000 --> 00:00:32,000 The piece I need is not here, so I'm going to actually just talk to you about what 6 00:00:32,000 --> 00:00:34,000 it says on my slides and then when Jason arrives we'll plug it back in and see what's going on. 7 00:00:34,000 --> 00:00:36,000 Oh, there's Jason. There we go. 8 00:00:36,000 --> 00:00:40,000 So what I'm going to do today is actually something kinda fun. And so I'm 9 00:00:40,000 --> 00:00:41,000 gonna take a case study of a particular data structure 10 00:00:41,000 --> 00:00:45,000 design, 11 00:00:45,000 --> 00:00:49,000 and the one I'm looking at there is the lexicon. 12 00:00:49,000 --> 00:00:53,000 So how many of you actually ever decided to just open up the lexicon file, just to see 13 00:00:53,000 --> 00:00:55,000 what was in there at some point during the quarter. We used it a couple of times. 14 00:00:55,000 --> 00:01:00,000 Anybody want to tell us what they learned by opening up that file? 15 00:01:00,000 --> 00:01:04,000 Anything interesting? 16 00:01:04,000 --> 00:01:05,000 It's a data structure called a DAWG. You're like, well that's a pretty funny name for something. 17 00:01:05,000 --> 00:01:10,000 And 18 00:01:10,000 --> 00:01:14,000 was the code particularly well-[inaudible], easy to read, easy to understand? 19 00:01:14,000 --> 00:01:18,000 No. Who wrote that code? She should totally be fired. 20 00:01:18,000 --> 00:01:22,000 Yeah, so this what I wrote kinda a long time ago, actually, to solve this particular problem. And actually I would like to key through 21 00:01:22,000 --> 00:01:28,000 kinda how it is I came to discover the DAWG and why it is the DAWG was the 22 00:01:28,000 --> 00:01:29,000 choice that made sense for what the need pattern we had for the lexicon was. 23 00:01:29,000 --> 00:01:30,000 So let me just refresh what 24 00:01:30,000 --> 00:01:34,000 the lexicon does. 25 00:01:34,000 --> 00:01:36,000 It is a special case of kinda set or map. It is a container where you're going stick things 26 00:01:36,000 --> 00:01:38,000 in, they're going to be unique, right? There's no 27 00:01:38,000 --> 00:01:41,000 need for any sort of duplicate. 28 00:01:41,000 --> 00:01:43,000 The keys are always strings. And in fact, they're English words, so that's going to be an important 29 00:01:43,000 --> 00:01:47,000 thing to remember, that 30 00:01:47,000 --> 00:01:48,000 they're not just arbitrary sequences of letters, right? They have certain patterns to 31 00:01:48,000 --> 00:01:51,000 them 32 00:01:51,000 --> 00:01:54,000 and certain sizes that might actually be something we could capitalize on. When we know 33 00:01:54,000 --> 00:01:57,000 something about the domain, we're able to kinda special case or tune or 34 00:01:57,000 --> 00:01:58,000 data structure to solve that problem 35 00:01:58,000 --> 00:01:59,000 very well, 36 00:01:59,000 --> 00:02:03,000 in particular. 37 00:02:03,000 --> 00:02:07,000 It has no associated value, so it doesn't have dictionary definitions or synonyms or anything 38 00:02:07,000 --> 00:02:08,000 like that, right? So it really just is exists or doesn't exist 39 00:02:08,000 --> 00:02:12,000 that we were interested in supporting. 40 00:02:12,000 --> 00:02:15,000 And it also has, in addition to the kind of being able to add words and check if a 41 00:02:15,000 --> 00:02:16,000 word is contained, it's also this prefixing operation that we had talked a 42 00:02:16,000 --> 00:02:19,000 little about 43 00:02:19,000 --> 00:02:23,000 in Boggle, and that came back on the midterm. Like, why was prefixing important, right? 44 00:02:23,000 --> 00:02:25,000 It was critical for pruning these dead-end paths, these searches on the 45 00:02:25,000 --> 00:02:26,000 Boggle board that otherwise 46 00:02:26,000 --> 00:02:32,000 would really have 47 00:02:32,000 --> 00:02:32,000 taken a lot of time and kinda led to nothing of any value. So what I'm gonna to talk you 48 00:02:32,000 --> 00:02:36,000 through 49 00:02:36,000 --> 00:02:40,000 is some of the obvious choices based on data structures we already know about, 50 00:02:40,000 --> 00:02:43,000 and see where they would go. So this is a thought experiment. It's not that I wrote all 51 00:02:43,000 --> 00:02:45,000 these to find out. Actually, I mostly did a lot of this on paper, trying to figure 52 00:02:45,000 --> 00:02:48,000 out what options we had 53 00:02:48,000 --> 00:02:49,000 and how they would play out and what kind of trade-offs I'd have to make to make them 54 00:02:49,000 --> 00:02:52,000 work. 55 00:02:52,000 --> 00:02:55,000 So the simplest case of any kind of collection is to think about 56 00:02:55,000 --> 00:02:58,000 whether vector or array can do the job for you. 57 00:02:58,000 --> 00:03:00,000 In this case, having a vector that's kept in sorted - alphabetically sorted - order seems 58 00:03:00,000 --> 00:03:03,000 like a pretty good, 59 00:03:03,000 --> 00:03:05,000 easy way to get this set up and running and how would it behave? 60 00:03:05,000 --> 00:03:06,000 So in order to do a contains 61 00:03:06,000 --> 00:03:07,000 (word) 62 00:03:07,000 --> 00:03:15,000 on this data structure, 63 00:03:15,000 --> 00:03:17,000 what algorithm are you going to use to do your searching? Binary search, 64 00:03:17,000 --> 00:03:21,000 right? It's in sorted order, right? We've got to take advantage of that, right? If we do 65 00:03:21,000 --> 00:03:25,000 a linear search I'm looking for the word ''mediocre,'' I'm going to be trucking through, 66 00:03:25,000 --> 00:03:28,000 you know, thousands of words before I find it. So definitely want to be using binary search. 67 00:03:28,000 --> 00:03:29,000 Equals-equals, comparing my strings, less than, greater than, dividing in half. And so 68 00:03:29,000 --> 00:03:32,000 it should run a logarithm time. 69 00:03:32,000 --> 00:03:35,000 Logarithm time, one of those greater performers that you 70 00:03:35,000 --> 00:03:38,000 learned in the PQ assignment, hopefully. It's just basically not measurable on 71 00:03:38,000 --> 00:03:41,000 today's computer. So even for entries of 72 00:03:41,000 --> 00:03:42,000 100,000, 500,000, basically 73 00:03:42,000 --> 00:03:46,000 for free. 74 00:03:46,000 --> 00:03:50,000 How do we do contains (prefix)? Can we also do a fast prefix search given this 75 00:03:50,000 --> 00:03:51,000 data structure? 76 00:03:51,000 --> 00:03:54,000 I hope so right? Thinking about 77 00:03:54,000 --> 00:03:57,000 the same kind of binary search. In this case, though, looking at substring, right? So 78 00:03:57,000 --> 00:03:59,000 comparing I'm looking for a three-character substring, only looking at the 79 00:03:59,000 --> 00:04:02,000 first three characters, decide which way to go. 80 00:04:02,000 --> 00:04:05,000 Once I find something that matches in the first three characters, I'm done. 81 00:04:05,000 --> 00:04:07,000 So again, still using the sorted property to quickly narrow in on where 82 00:04:07,000 --> 00:04:10,000 it could be. 83 00:04:10,000 --> 00:04:14,000 When it's time to add a new word to this, 84 00:04:14,000 --> 00:04:17,000 if someone asked you to put in the word ''abalone,'' I gotta 85 00:04:17,000 --> 00:04:21,000 use that binary search to find out where to go. So in logarithmic time I can 86 00:04:21,000 --> 00:04:23,000 tell you where the new word goes, but then to put it into position, 87 00:04:23,000 --> 00:04:26,000 there's gonna be some shuffling. 88 00:04:26,000 --> 00:04:28,000 And that's where this operation 89 00:04:28,000 --> 00:04:32,000 starts to bog down, 90 00:04:32,000 --> 00:04:33,000 given the idea that you might be moving half or more of your items, 91 00:04:33,000 --> 00:04:36,000 on average, 92 00:04:36,000 --> 00:04:40,000 to make that space for that one to open up. 93 00:04:40,000 --> 00:04:42,000 It is going to require a linear time operation to get new words into the data 94 00:04:42,000 --> 00:04:43,000 structure. Okay, 95 00:04:43,000 --> 00:04:48,000 probably still, though, 96 00:04:48,000 --> 00:04:51,000 a reasonable approach, right? These are the operations that get done 97 00:04:51,000 --> 00:04:54,000 immense numbers of times in Boggle, right? You're doing a ton of contains (word)'s and contains (prefix)'s 98 00:04:54,000 --> 00:04:58,000 as you're exhaustively searching that big board. 99 00:04:58,000 --> 00:05:01,000 But the add's are done kinda once at the beginning. You build that big dictionary and 100 00:05:01,000 --> 00:05:02,000 then you subsequently use it. And so if it took a little bit more time to built 101 00:05:02,000 --> 00:05:06,000 that thing, 102 00:05:06,000 --> 00:05:08,000 sorta an n-squared operation to kinda load it up, well, 103 00:05:08,000 --> 00:05:11,000 maybe that would be okay. 104 00:05:11,000 --> 00:05:14,000 The other we want to look at is space usage, and that's - and some of this is a little 105 00:05:14,000 --> 00:05:18,000 bit of an interesting artifact of the time I was working on this. It turns 106 00:05:18,000 --> 00:05:21,000 out space was actually a limiting factor in what we had available 107 00:05:21,000 --> 00:05:22,000 to us, in terms of core memories of our machines and 108 00:05:22,000 --> 00:05:25,000 what we could take up for our dictionary. 109 00:05:25,000 --> 00:05:28,000 In this case, it has per entry the size of the string, 110 00:05:28,000 --> 00:05:29,000 which we don't exactly know the string it represented. It's an abstraction 111 00:05:29,000 --> 00:05:32,000 to us, but 112 00:05:32,000 --> 00:05:35,000 we can make the assumption of kinda a simplification, that 113 00:05:35,000 --> 00:05:37,000 it's probably about the length of the 114 00:05:37,000 --> 00:05:39,000 string times of the size of the character. And there might be a little bit more 115 00:05:39,000 --> 00:05:40,000 housekeeping associated with it, but 116 00:05:40,000 --> 00:05:42,000 117 00:05:42,000 --> 00:05:44,000 that's a pretty good estimate of what it's taking. So in fact, it's the string data 118 00:05:44,000 --> 00:05:46,000 itself 119 00:05:46,000 --> 00:05:50,000 that's being represented in this vector 120 00:05:50,000 --> 00:05:51,000 without much else overhead. So if we say words are about average of about eight characters, 121 00:05:51,000 --> 00:05:54,000 about eight bytes per word. 122 00:05:54,000 --> 00:05:57,000 So if we have 100,000 we expect to have 800,000 bytes 123 00:05:57,000 --> 00:06:00,000 invested in this sort of vector arrangement. 124 00:06:00,000 --> 00:06:02,000 Excess capacity, a couple other things that might tweak that a little bit, but that 125 00:06:02,000 --> 00:06:04,000 gives us at least a good starting point to 126 00:06:04,000 --> 00:06:07,000 think about how this 127 00:06:07,000 --> 00:06:09,000 compares to alternatives. Okay, 128 00:06:09,000 --> 00:06:12,000 so then if we said 129 00:06:12,000 --> 00:06:16,000 we know that sortedness - that binary search is the 130 00:06:16,000 --> 00:06:17,000 real advantage of keeping it in sorted order, but the contiguous nature of the array 131 00:06:17,000 --> 00:06:19,000 kinda 132 00:06:19,000 --> 00:06:23,000 bites us in terms of insert. 133 00:06:23,000 --> 00:06:26,000 If we reorganize into that binary search tree to give ourselves the dynamic flexibility of 134 00:06:26,000 --> 00:06:27,000 wiring in the left and right halves through pointers, as opposed to 135 00:06:27,000 --> 00:06:29,000 contiguous memory, 136 00:06:29,000 --> 00:06:31,000 then we know we can also get add to run 137 00:06:31,000 --> 00:06:35,000 in logarithmic time. So 138 00:06:35,000 --> 00:06:36,000 looking at these operations, contains (word) is a tree search that's using equals-equals, 139 00:06:36,000 --> 00:06:40,000 left-right, deciding which way to go, 140 00:06:40,000 --> 00:06:43,000 and also performs logarithmically if the tree is balanced. 141 00:06:43,000 --> 00:06:45,000 Contains (prefix), same deal. Keep 142 00:06:45,000 --> 00:06:46,000 looking at the [inaudible], 143 00:06:46,000 --> 00:06:48,000 working our way down, knowing which way to go. 144 00:06:48,000 --> 00:06:51,000 And then 145 00:06:51,000 --> 00:06:54,000 add can also perform in logarithmic time 146 00:06:54,000 --> 00:06:57,000 because it does the same search, falling off the bottom and 147 00:06:57,000 --> 00:06:58,000 then inserting the new word there, or finding it along the way and not having any 148 00:06:58,000 --> 00:07:01,000 work to do. 149 00:07:01,000 --> 00:07:03,000 So we can get all our operations done in logarithmic time, which we know to be kinda 150 00:07:03,000 --> 00:07:05,000 fast enough to actually not be worth 151 00:07:05,000 --> 00:07:08,000 fighting any harder for. 152 00:07:08,000 --> 00:07:11,000 Where did we pay for this? 153 00:07:11,000 --> 00:07:13,000 Nothing really comes for free, right? 154 00:07:13,000 --> 00:07:15,000 Overhead, right? And overhead, in this case, memory is certainly one of the places 155 00:07:15,000 --> 00:07:20,000 where there's going to be a 156 00:07:20,000 --> 00:07:23,000 pretty big cost that we now have the string data for each entry, plus a 157 00:07:23,000 --> 00:07:26,000 left pointer and a right pointer. And if we're doing the work to keep it 158 00:07:26,000 --> 00:07:30,000 balanced, which we probably ought to do if we want to avoid the degenerate cases 159 00:07:30,000 --> 00:07:31,000 of it kinda getting totally out of whack and degenerating into linear time, 160 00:07:31,000 --> 00:07:34,000 we're going to have a couple more 161 00:07:34,000 --> 00:07:37,000 bits, you know, bytes thrown into that factor, too. 162 00:07:37,000 --> 00:07:39,000 And so if we're assuming that there's 163 00:07:39,000 --> 00:07:42,000 ten bytes of overhead, 164 00:07:42,000 --> 00:07:46,000 four bytes in each of these plus another two for the balance factor, then we've added 165 00:07:46,000 --> 00:07:48,000 ten bytes of overhead on top of the eight bytes for the word. 166 00:07:48,000 --> 00:07:48,000 Getting a little 167 00:07:48,000 --> 00:07:52,000 bit hefty. Something to worry about. 168 00:07:52,000 --> 00:07:56,000 Also in terms of code complexity, which I didn't put a bullet up for, but it's 169 00:07:56,000 --> 00:07:59,000 also worth kinda keeping - weighing it in your mind also which is 170 00:07:59,000 --> 00:08:01,000 that building the fancier, more complicated data structure probably meant 171 00:08:01,000 --> 00:08:02,000 you spent more time debugging it, 172 00:08:02,000 --> 00:08:06,000 more time developing it, 173 00:08:06,000 --> 00:08:08,000 right? You will have - it will be harder for someone later to maintain and deal 174 00:08:08,000 --> 00:08:10,000 with because it's actually just more complicated code that somebody looking at a 175 00:08:10,000 --> 00:08:12,000 sorted vector 176 00:08:12,000 --> 00:08:13,000 isn't likely to actually - like, they want to add a new operation, 177 00:08:13,000 --> 00:08:16,000 let's say they add remove. 178 00:08:16,000 --> 00:08:19,000 You're able to take a word out of the lexicon. The lexicon doesn't currently support 179 00:08:19,000 --> 00:08:23,000 that. Somebody adding that on the sorted vector implementation probably won't find themselves 180 00:08:23,000 --> 00:08:25,000 too tripped up. Someone who has to do a remove out of a binary search tree, removing it, reattaching 181 00:08:25,000 --> 00:08:27,000 the subtrees 182 00:08:27,000 --> 00:08:30,000 and not destroying the balancing 183 00:08:30,000 --> 00:08:32,000 is a pretty complicated operation. So we've made - we've invested in this data 184 00:08:32,000 --> 00:08:34,000 structure that actually will keep on 185 00:08:34,000 --> 00:08:36,000 creating 186 00:08:36,000 --> 00:08:38,000 further development hassles for any modifications that are down the road for 187 00:08:38,000 --> 00:08:40,000 this data structure. So 188 00:08:40,000 --> 00:08:45,000 it is 189 00:08:45,000 --> 00:08:47,000 a big investment in your brainpower to keep it working. So 190 00:08:47,000 --> 00:08:51,000 let's think about what a hash table can do for us. A 191 00:08:51,000 --> 00:08:56,000 hash table is the last thing that we looked at on Friday, 192 00:08:56,000 --> 00:08:58,000 kinda just moving away from this whole idea of sortedness and kinda moving it out of bound off to this idea of a 193 00:08:58,000 --> 00:09:02,000 hashing function. 194 00:09:02,000 --> 00:09:05,000 So if I have a hash table that has a certain number of buckets 195 00:09:05,000 --> 00:09:06,000 that uses a link list for chaining to 196 00:09:06,000 --> 00:09:10,000 handle the collision, 197 00:09:10,000 --> 00:09:11,000 then what I would have is my words just scattered around my table, 198 00:09:11,000 --> 00:09:14,000 199 00:09:14,000 --> 00:09:16,000 allowing me to have quick access by hashing to the right bucket and then 200 00:09:16,000 --> 00:09:19,000 working my way down the bucket to C. 201 00:09:19,000 --> 00:09:21,000 So the contains (word) would do a standard 202 00:09:21,000 --> 00:09:24,000 hash table lookup. 203 00:09:24,000 --> 00:09:27,000 Take the word, you know, ''mediate,'' run it through the hash function. It says oh, it's bucket 204 00:09:27,000 --> 00:09:30,000 three in this case. I go to bucket three. I walk down the link list. I find it or 205 00:09:30,000 --> 00:09:31,000 not. I can tell you whether the word ''mediate,'' if it's in the table, had to be in bucket 206 00:09:31,000 --> 00:09:33,000 three. 207 00:09:33,000 --> 00:09:35,000 No questions asked. 208 00:09:35,000 --> 00:09:38,000 In that case, the 209 00:09:38,000 --> 00:09:39,000 expected time for that is going to be n over b, where n is the number of entries, b is the number of 210 00:09:39,000 --> 00:09:41,000 buckets. 211 00:09:41,000 --> 00:09:42,000 If we have buckets 212 00:09:42,000 --> 00:09:47,000 set to be in the 213 00:09:47,000 --> 00:09:52,000 rough order of the number of entries, then we're going to get constant access there. 214 00:09:52,000 --> 00:09:54,000 Now we want to do contains (prefix). I want 215 00:09:54,000 --> 00:10:00,000 to 216 00:10:00,000 --> 00:10:05,000 know if there's any words in this table that begin with ''med.'' 217 00:10:05,000 --> 00:10:09,000 Where do I look? 218 00:10:09,000 --> 00:10:11,000 If I run ''med'' through the hash function, 219 00:10:11,000 --> 00:10:15,000 do I expect that I would get the same number 220 00:10:15,000 --> 00:10:18,000 as other words that begin with ''med,'' like ''mediate'' and ''median'' and 221 00:10:18,000 --> 00:10:19,000 ''mediator?'' Yes. 222 00:10:19,000 --> 00:10:22,000 You say yes. All 223 00:10:22,000 --> 00:10:26,000 right, let's take that to an extreme. 224 00:10:26,000 --> 00:10:29,000 If I expect that if a prefix in any longer words all have the same thing, then should it 225 00:10:29,000 --> 00:10:33,000 be the case, for example, I think about the simplest case of a prefix, ''m.'' 226 00:10:33,000 --> 00:10:35,000 Should all the words that begin with ''m'' hash to the same place? 227 00:10:35,000 --> 00:10:38,000 If they did, we're in trouble, right? 228 00:10:38,000 --> 00:10:42,000 Now, if all prefixes did hash to the same place, then what I'm going to end up with is a 229 00:10:42,000 --> 00:10:45,000 really heavily clustered table where ''a,'' ''b,'' ''c,'' you know, ''z'' - 230 00:10:45,000 --> 00:10:46,000 there's 26 buckets that have all the ''z'' words, all the 231 00:10:46,000 --> 00:10:48,000 ''y'' words, all the what not, 232 00:10:48,000 --> 00:10:52,000 and none of the other buckets would be used. All 233 00:10:52,000 --> 00:10:55,000 right, so in fact, actually, as part of the good behavior of a hash function, we're 234 00:10:55,000 --> 00:10:58,000 hopping that, actually, it does jumble things up. And then a word that looks like one 235 00:10:58,000 --> 00:11:02,000 but has another letter added on the end, we're hoping it goes somewhere totally different, 236 00:11:02,000 --> 00:11:04,000 that ''median'' and ''medians'' hopefully go to two totally disparate places. 237 00:11:04,000 --> 00:11:07,000 That if there were patterns where 238 00:11:07,000 --> 00:11:10,000 words that were substrings of each other all got hashed to the same 239 00:11:10,000 --> 00:11:12,000 location, we would end up with heavy clustering. Because there are a lot of words that 240 00:11:12,000 --> 00:11:15,000 repeat parts of the other words in the dictionary. 241 00:11:15,000 --> 00:11:19,000 So in fact, if I want to know if there's any words that begin with ''med,'' 242 00:11:19,000 --> 00:11:21,000 I have no a priori information about where to look for 243 00:11:21,000 --> 00:11:25,000 them. They could be anywhere. 244 00:11:25,000 --> 00:11:26,000 I could hash ''med'' and see if ''med'' itself was a word, right? 245 00:11:26,000 --> 00:11:30,000 But let's say it was something that 246 00:11:30,000 --> 00:11:33,000 itself was just the beginning of something. ''St,'' ''str'' - 247 00:11:33,000 --> 00:11:36,000 there's a lot of words that begin with that, but ''str'' itself is not a word. So 248 00:11:36,000 --> 00:11:39,000 hashing to that bucket and looking for it would only tell me if there's exactly ''str,'' and then 249 00:11:39,000 --> 00:11:41,000 occasionally there might be some ''str'' words also there. 250 00:11:41,000 --> 00:11:44,000 But other than that, we just gotta look. 251 00:11:44,000 --> 00:11:45,000 And where do we look? Everywhere. 252 00:11:45,000 --> 00:11:50,000 Everywhere, everywhere, everywhere. 253 00:11:50,000 --> 00:11:53,000 That in order to conclude there is no word that begins with the prefix we have, 254 00:11:53,000 --> 00:11:55,000 the only way to be confident of that is to either find something, or to have searched 255 00:11:55,000 --> 00:11:59,000 the entire table and not found it. 256 00:11:59,000 --> 00:12:02,000 And so it does require this completely linear, exhaustive 257 00:12:02,000 --> 00:12:04,000 look through the entire contents of the hash table to know 258 00:12:04,000 --> 00:12:07,000 something, for example, is not a prefix. 259 00:12:07,000 --> 00:12:09,000 You might find one quickly. So in the best case, you might be in the first couple of 260 00:12:09,000 --> 00:12:10,000 buckets, happen upon one. 261 00:12:10,000 --> 00:12:14,000 But no guarantees, 262 00:12:14,000 --> 00:12:16,000 overall. And certainly when it's not a prefix, it's going to require a complete look-see of everything. 263 00:12:16,000 --> 00:12:18,000 That's kinda a bummer. Adding 264 00:12:18,000 --> 00:12:19,000 a 265 00:12:19,000 --> 00:12:22,000 word, 266 00:12:22,000 --> 00:12:24,000 back to the fast behavior again. Hashing in the bucket, 267 00:12:24,000 --> 00:12:26,000 checking to see if it's already there, adding if needed 268 00:12:26,000 --> 00:12:28,000 [inaudible]. 269 00:12:28,000 --> 00:12:31,000 So it gets good times on these two, 270 00:12:31,000 --> 00:12:34,000 but then kinda really bottoms out when it comes to supporting 271 00:12:34,000 --> 00:12:35,000 that prefix search. 272 00:12:35,000 --> 00:12:38,000 Its space usage 273 00:12:38,000 --> 00:12:42,000 is kinda roughly comparable to that of the binary search tree. 274 00:12:42,000 --> 00:12:42,000 We've got the string data itself, and we've got the four-byte pointer on the link list. 275 00:12:42,000 --> 00:12:44,000 276 00:12:44,000 --> 00:12:48,000 There's also the four-byte bucket pointer, 277 00:12:48,000 --> 00:12:51,000 which is over in this other structure here, this array full of buckets. 278 00:12:51,000 --> 00:12:54,000 If that number, though, is roughly equal to the number of cells, then you can kinda 279 00:12:54,000 --> 00:12:56,000 just count it as each entry 280 00:12:56,000 --> 00:12:59,000 had this 281 00:12:59,000 --> 00:13:02,000 12-byte cell, plus a four-byte bucket pointer somewhere 282 00:13:02,000 --> 00:13:04,000 that you can kinda think of as associated on a per entry basis to tell you it's about 283 00:13:04,000 --> 00:13:06,000 284 00:13:06,000 --> 00:13:10,000 eight bytes of overhead on top of the string. 285 00:13:10,000 --> 00:13:11,000 So 16 bytes for each entry in the table. 286 00:13:11,000 --> 00:13:16,000 Okay, that's 287 00:13:16,000 --> 00:13:18,000 what we have so far. 288 00:13:18,000 --> 00:13:21,000 We've got 289 00:13:21,000 --> 00:13:24,000 pretty good performance on contains (word) no matter what we 290 00:13:24,000 --> 00:13:27,000 choose. We can get either logarithmic or constant time with either of these three 291 00:13:27,000 --> 00:13:31,000 data structure. So perfectly acceptable performance there. 292 00:13:31,000 --> 00:13:33,000 We cannot get contains (prefix) to run well on a hash table. 293 00:13:33,000 --> 00:13:34,000 Just doesn't happen the way hashing works. 294 00:13:34,000 --> 00:13:38,000 And then add 295 00:13:38,000 --> 00:13:41,000 can be fast on the two complicated pointer-based data structures to the right, 296 00:13:41,000 --> 00:13:42,000 but can't be fast on the sorted vector. 297 00:13:42,000 --> 00:13:45,000 And then 298 00:13:45,000 --> 00:13:47,000 we see kinda some trade-offs here in terms of space versus time, that there's 299 00:13:47,000 --> 00:13:51,000 very little space overhead 300 00:13:51,000 --> 00:13:55,000 in the sorted vector case, but really blossoms into 301 00:13:55,000 --> 00:13:58,000 two times, almost three times, the storage needed. In the VST and hash table 302 00:13:58,000 --> 00:14:00,000 case, the code also got a lot more complicated when we moved up to those 303 00:14:00,000 --> 00:14:04,000 data structures. So 304 00:14:04,000 --> 00:14:06,000 then I'll tell you, actually, that it turns out that the one type of performance that was interesting to us 305 00:14:06,000 --> 00:14:09,000 when I was 306 00:14:09,000 --> 00:14:14,000 exploring this, but in fact, more important at the time, was the memory usage. 307 00:14:14,000 --> 00:14:17,000 That the memory usage was really the big obstacle that we had. 308 00:14:17,000 --> 00:14:19,000 The Official Scrabble Players Dictionary II is actually the source for the word list that 309 00:14:19,000 --> 00:14:22,000 we're using. 310 00:14:22,000 --> 00:14:23,000 It has about 128,000 words, I think, is exactly the 311 00:14:23,000 --> 00:14:26,000 number it has. 312 00:14:26,000 --> 00:14:29,000 And the average length of those is eight characters. And the file itself is 313 00:14:29,000 --> 00:14:32,000 over a megabyte. So on disk, if you were just to look at the listing of all the words, 314 00:14:32,000 --> 00:14:34,000 it's a megabyte - a little over a megabyte there. 315 00:14:34,000 --> 00:14:37,000 If we were loading it into the sorted vector, 316 00:14:37,000 --> 00:14:40,000 we'd get something that was just roughly equal to that. It would take about a megabyte of space, 317 00:14:40,000 --> 00:14:45,000 plus a little bit of noise. 318 00:14:45,000 --> 00:14:48,000 The ones that double up that take it up to two, two and a half megabytes total. 319 00:14:48,000 --> 00:14:50,000 At the time we were working, and I know this is going to seem completely archaic to you, but we had 320 00:14:50,000 --> 00:14:51,000 these stone tablets that we worked on. 321 00:14:51,000 --> 00:14:55,000 322 00:14:55,000 --> 00:15:00,000 And they typically had main memories of about a megabyte, maybe, like in 323 00:15:00,000 --> 00:15:02,000 an exciting case, four megabytes; four whole megabytes of RAM. 324 00:15:02,000 --> 00:15:05,000 So it was not possible 325 00:15:05,000 --> 00:15:09,000 that we could dedicate one or two megabytes of your main memory 326 00:15:09,000 --> 00:15:12,000 to just your lexicon. It just wasn't - there just wasn't enough space to go around. 327 00:15:12,000 --> 00:15:15,000 So we're working in this very tight environment in terms of that. So a little bit more 328 00:15:15,000 --> 00:15:16,000 like, for example, today's world of embedded devices. If you're adding something for an 329 00:15:16,000 --> 00:15:18,000 iPod or a 330 00:15:18,000 --> 00:15:21,000 331 00:15:21,000 --> 00:15:24,000 cell phone you have a lot less 332 00:15:24,000 --> 00:15:27,000 gargantuan memory. So your average desktop machine, having a 333 00:15:27,000 --> 00:15:28,000 gigabyte of RAM, now you're like, whatever, a megabyte here, ten megabytes there, free, all the 334 00:15:28,000 --> 00:15:31,000 memory you want. And 335 00:15:31,000 --> 00:15:34,000 so this seems kinda like a silly thing to focus on, but this was 1995, 336 00:15:34,000 --> 00:15:35,000 and it turns out, yeah, there was certainly more interest in 337 00:15:35,000 --> 00:15:37,000 keeping it 338 00:15:37,000 --> 00:15:40,000 tight to the memory. 339 00:15:40,000 --> 00:15:44,000 The Boggle lexicon that we have actually takes under a third of a megabyte. 340 00:15:44,000 --> 00:15:45,000 The memory that's used while it's working and searching and checking for words and 341 00:15:45,000 --> 00:15:47,000 prefixes 342 00:15:47,000 --> 00:15:50,000 is actually smaller 343 00:15:50,000 --> 00:15:53,000 than the on-disk form of the data we started with. 344 00:15:53,000 --> 00:15:57,000 So it actually does a kind of a bit of a compression 345 00:15:57,000 --> 00:16:00,000 as part of its strategy for storage, which is pretty interesting to think about. 346 00:16:00,000 --> 00:16:02,000 So I'll tell you, actually, the truth about how we first did Boggle, because it's kinda a little 347 00:16:02,000 --> 00:16:03,000 interesting 348 00:16:03,000 --> 00:16:07,000 historical perspective on it. 349 00:16:07,000 --> 00:16:09,000 But Boggle was given by a colleague of mine for the first time, Todd 350 00:16:09,000 --> 00:16:10,000 Feldman. He was the guy who came up with the idea, which is a great idea. And he said, ''Oh, this is a great 351 00:16:10,000 --> 00:16:13,000 assignment.'' 352 00:16:13,000 --> 00:16:16,000 And when he wrote it, he said, ''Oh, I need a lexicon.'' So as part of kinda putting the assignment together, he was like, ''Oh, we 353 00:16:16,000 --> 00:16:18,000 have to get some lexicons.'' So he looked around and he found a word list that had about 354 00:16:18,000 --> 00:16:20,000 20,000 words. 355 00:16:20,000 --> 00:16:23,000 356 00:16:23,000 --> 00:16:27,000 And I think we first found this word list and we said, ''Oh, it's just impossible.'' It's going to take a megabyte 357 00:16:27,000 --> 00:16:30,000 or more and we just don't have that kind of space. So we need much smaller data files. So we had a data file that had 358 00:16:30,000 --> 00:16:33,000 about 20,000 words, which then takes about 359 00:16:33,000 --> 00:16:34,000 200k. And he actually wrote it using a hash table. 360 00:16:34,000 --> 00:16:37,000 So given he wrote it as a hash table, 361 00:16:37,000 --> 00:16:38,000 he just didn't support contains (prefix). He 362 00:16:38,000 --> 00:16:39,000 just didn't put it in, 363 00:16:39,000 --> 00:16:42,000 364 00:16:42,000 --> 00:16:45,000 and - because you can't do it efficiently. So it actually took a smaller amount 365 00:16:45,000 --> 00:16:49,000 of space and had a smaller word list and it wouldn't do contains (prefix). 366 00:16:49,000 --> 00:16:51,000 So then people wrote Boggle. And as you learned from that midterm question, you can write Boggle without 367 00:16:51,000 --> 00:16:55,000 contains (prefix). 368 00:16:55,000 --> 00:16:58,000 It spends a lot more time looking down these data ends, but it eventually bottoms out 369 00:16:58,000 --> 00:17:01,000 when it runs off the board and uses all the 370 00:17:01,000 --> 00:17:04,000 cubes. And 371 00:17:04,000 --> 00:17:05,000 it actually, in some ways, almost made the program a little more satisfying to 372 00:17:05,000 --> 00:17:08,000 write in one odd way, 373 00:17:08,000 --> 00:17:09,000 which is, when you're - so it's the computer's turn, right, that uses the 374 00:17:09,000 --> 00:17:12,000 prefix pruning. 375 00:17:12,000 --> 00:17:14,000 So you would type in your words and it would be perfectly fine finding it, but it 376 00:17:14,000 --> 00:17:16,000 doesn't use the prefix part when it's doing the human's turn. You're typing in words; it's finding them. You're typing 377 00:17:16,000 --> 00:17:19,000 in the words; it's find them. 378 00:17:19,000 --> 00:17:21,000 Then you would go to do the computer's turn and you'd say, ''Okay, go.'' 379 00:17:21,000 --> 00:17:25,000 And now it would take a really long time. 380 00:17:25,000 --> 00:17:27,000 It'd be hunting around the board. It would be working so hard your fan would be coming on. 381 00:17:27,000 --> 00:17:30,000 You'd feel like oh, my computer's doing something. 382 00:17:30,000 --> 00:17:31,000 And then it would show up with twice as many words as you found or ten times as 383 00:17:31,000 --> 00:17:33,000 many words as you found, whatever. 384 00:17:33,000 --> 00:17:36,000 And in fact, then you would say, 385 00:17:36,000 --> 00:17:40,000 ''Well, at least it had to work hard to beat me.'' 386 00:17:40,000 --> 00:17:41,000 And then you felt like I wrote this program that causes my processor to get busy, right? I feel good. And 387 00:17:41,000 --> 00:17:44,000 388 00:17:44,000 --> 00:17:47,000 in fact, actually, the word list was pretty small. So in fact, it didn't - it wasn't even actually as likely to 389 00:17:47,000 --> 00:17:49,000 beat you because it didn't know as many words. 125,000 words is 390 00:17:49,000 --> 00:17:52,000 an enormous number of words. 391 00:17:52,000 --> 00:17:55,000 The average - I've heard different factors on this, but the 392 00:17:55,000 --> 00:17:59,000 average person's vocabulary, English vocabulary, is about 20,000 to 393 00:17:59,000 --> 00:18:00,000 30,000 words. It tends to be about 10,000 higher if you've gone to college, so 394 00:18:00,000 --> 00:18:03,000 maybe 30,000 to 40,000. 395 00:18:03,000 --> 00:18:06,000 The fact that it knows 80,000 more words than the average college graduate 396 00:18:06,000 --> 00:18:09,000 means it knows a lot of obscure words, too. So in fact, not only does it 397 00:18:09,000 --> 00:18:12,000 very quickly find all those words and then produce these ridiculous words that you've 398 00:18:12,000 --> 00:18:14,000 never heard of, it just - it's almost like it's taunting you by doing it 399 00:18:14,000 --> 00:18:17,000 in a millisecond. 400 00:18:17,000 --> 00:18:21,000 And back then it took a while, so it was actually like 401 00:18:21,000 --> 00:18:23,000 okay, it beat me, but it had to work hard. 402 00:18:23,000 --> 00:18:27,000 But then - so then I took over Boggle. 403 00:18:27,000 --> 00:18:31,000 I taught x maybe a quarter or two later and I said, ''I'm going to write a 404 00:18:31,000 --> 00:18:35,000 lexicon. I'm going to go up and make a little project for myself, because I don't have enough work to do.'' 405 00:18:35,000 --> 00:18:38,000 And my goal was to make one that would do prefix, but that also would be 406 00:18:38,000 --> 00:18:40,000 tight enough on memory that we could actually load a much bigger dictionary file, because I thought I'd 407 00:18:40,000 --> 00:18:42,000 actually make the game even more 408 00:18:42,000 --> 00:18:48,000 fun to play. 409 00:18:48,000 --> 00:18:52,000 All right, so here's where I started. 410 00:18:52,000 --> 00:18:55,000 This is a little excerpt from the middle of the 411 00:18:55,000 --> 00:18:58,000 dictionary. ''Stra,'' yeah, that's totally the thing you notice, right, when you look at this thing. You see 412 00:18:58,000 --> 00:19:02,000 ''straddle,'' ''straddler,'' ''straddlers,'' ''straddles,'' ''straddling,'' and then these words that you can't 413 00:19:02,000 --> 00:19:04,000 believe you're allowed to do this, but apparently you can put ''er'' and ''ier'' and 414 00:19:04,000 --> 00:19:05,000 ''ies' and ''ing'' on 415 00:19:05,000 --> 00:19:07,000 almost anything out there and it 416 00:19:07,000 --> 00:19:08,000 makes a valid word. 417 00:19:08,000 --> 00:19:11,000 So 418 00:19:11,000 --> 00:19:14,000 although you never really thought about the fact that 419 00:19:14,000 --> 00:19:16,000 ''straightforward,'' ''straightforwardly,'' ''straightforwardness,'' are all there, there is a huge amount 420 00:19:16,000 --> 00:19:18,000 of repetition. And this is 421 00:19:18,000 --> 00:19:20,000 where I said we're going to use some information about the domain. 422 00:19:20,000 --> 00:19:23,000 These aren't just arbitrary strings. 423 00:19:23,000 --> 00:19:24,000 They're not just random sequences of letters that are 424 00:19:24,000 --> 00:19:28,000 picked out of the air, 425 00:19:28,000 --> 00:19:30,000 that they develop over time. They're word roots. They're 426 00:19:30,000 --> 00:19:34,000 suffixes. There's these prefixes that mean things 427 00:19:34,000 --> 00:19:35,000 that actually show you that looking at this little excerpt out of the middle, 428 00:19:35,000 --> 00:19:38,000 there's an awful lot of words 429 00:19:38,000 --> 00:19:42,000 that repeat portions of other words. 430 00:19:42,000 --> 00:19:44,000 And that may be part of what we can capitalize on. That 431 00:19:44,000 --> 00:19:47,000 instead of really recording ''straightaway'' 432 00:19:47,000 --> 00:19:51,000 and ''straightaways,'' repeating those first 11 characters and adding an ''s,'' 433 00:19:51,000 --> 00:19:55,000 there's some way that I can unify the data that I have here. The same way when we try to 434 00:19:55,000 --> 00:19:56,000 find codes, you have the same passage of code repeated two or three times. 435 00:19:56,000 --> 00:19:59,000 We're always saying 436 00:19:59,000 --> 00:20:00,000 unify it, move it into a helper and call it in three places. It's like can we do the same thing 437 00:20:00,000 --> 00:20:02,000 with our data? 438 00:20:02,000 --> 00:20:07,000 Can we share the parts of the words that are in common 439 00:20:07,000 --> 00:20:09,000 and only distinguish where they are different? 440 00:20:09,000 --> 00:20:11,000 Take a look at this. 441 00:20:11,000 --> 00:20:14,000 This is a little portion 442 00:20:14,000 --> 00:20:18,000 of something called a trie. 443 00:20:18,000 --> 00:20:20,000 A trie is spelled t-r-i-e and it's actually from ''retrieval,'' but 444 00:20:20,000 --> 00:20:23,000 sadly, it's still pronounced tri, just to confuse you. 445 00:20:23,000 --> 00:20:28,000 And it is a variant of a tree. 446 00:20:28,000 --> 00:20:30,000 In this case it's a tree that has 26 children coming out of any particular node, 447 00:20:30,000 --> 00:20:32,000 one for each letter of the alphabet. 448 00:20:32,000 --> 00:20:35,000 And so starting from the root node at the top, 449 00:20:35,000 --> 00:20:39,000 all the words that begin with ''a'' are actually 450 00:20:39,000 --> 00:20:42,000 down the A branch, and then there's a B branch. There'd be a C branch and a D 451 00:20:42,000 --> 00:20:45,000 branch. And so the idea is that all the words, like if you think of the levels of the 452 00:20:45,000 --> 00:20:46,000 tree are actually like - these are all the zero if characters of the word. 453 00:20:46,000 --> 00:20:49,000 And then that second level is 454 00:20:49,000 --> 00:20:52,000 all the characters that follow that zero if character. So here's the ''ab'' 455 00:20:52,000 --> 00:20:56,000 words and the ''ac'' words and the ''ax'' words. 456 00:20:56,000 --> 00:20:57,000 Over here is the ''st'' words, but there's not a ''sx,'' so there's some places that actually aren't 457 00:20:57,000 --> 00:21:00,000 filled out in this tree. 458 00:21:00,000 --> 00:21:02,000 And as you keep going down further, so that the depth 459 00:21:02,000 --> 00:21:04,000 you would trace to a tree 460 00:21:04,000 --> 00:21:08,000 would actually be 461 00:21:08,000 --> 00:21:10,000 tracing through the characters in sequence that make words. 462 00:21:10,000 --> 00:21:11,000 And so in this form, tracing out a 463 00:21:11,000 --> 00:21:12,000 path through this tree 464 00:21:12,000 --> 00:21:14,000 from the root 465 00:21:14,000 --> 00:21:17,000 down to a particular node 466 00:21:17,000 --> 00:21:19,000 tells you about prefixes. So there are words that begin with ''a'' and words that begin with ''ac'' 467 00:21:19,000 --> 00:21:23,000 and ''ace,'' 468 00:21:23,000 --> 00:21:23,000 and then there's actually a little notation there that's done by a 469 00:21:23,000 --> 00:21:24,000 visual 470 00:21:24,000 --> 00:21:26,000 of the 471 00:21:26,000 --> 00:21:29,000 bold and circle around it 472 00:21:29,000 --> 00:21:30,000 that says, ''Oh, and the path that led here from the root is a word.'' So 473 00:21:30,000 --> 00:21:33,000 ''a'' is a word 474 00:21:33,000 --> 00:21:35,000 and ''ace'' is a word and ''aces'' is a word. 475 00:21:35,000 --> 00:21:39,000 ''Act'' is a word and so is ''acted'' and ''actor.'' 476 00:21:39,000 --> 00:21:44,000 And in this case, the ''act'' part is totally shared, 477 00:21:44,000 --> 00:21:46,000 so everything that comes off of ''act,'' ''acting,'' ''acted,'' ''actor,'' ''actors,'' 478 00:21:46,000 --> 00:21:50,000 can all share that initial part. 479 00:21:50,000 --> 00:21:52,000 And for a word like ''act,'' it doesn't seem that profound. But you start thinking about words like 480 00:21:52,000 --> 00:21:55,000 ''strawberry'' or ''straightforward'' or ''stratosphere'' 481 00:21:55,000 --> 00:21:59,000 and realize that ''stratospheric'' and ''strawberries'' and ''straightforwardly'' 482 00:21:59,000 --> 00:22:01,000 can all leverage the other ten or 12 characters that were already 483 00:22:01,000 --> 00:22:04,000 invested in the root word, 484 00:22:04,000 --> 00:22:06,000 and that ''straightjacket'' and ''straightforward'' can even share ''straight.'' 485 00:22:06,000 --> 00:22:10,000 But there's actually a lot of 486 00:22:10,000 --> 00:22:15,000 prefixes that can be unified across the space of the dictionary. 487 00:22:15,000 --> 00:22:17,000 So knowing that words have these patterns helps us to do this. 488 00:22:17,000 --> 00:22:19,000 So let's build that. 489 00:22:19,000 --> 00:22:22,000 So the first form of this - 490 00:22:22,000 --> 00:22:25,000 and this, again, is a thought experiment. I did not actually write it this way because 491 00:22:25,000 --> 00:22:27,000 it would just be absolutely astronomically expensive. But it was a good way to kinda think 492 00:22:27,000 --> 00:22:30,000 about where I needed to go. 493 00:22:30,000 --> 00:22:34,000 I designed a new kinda trie node 494 00:22:34,000 --> 00:22:35,000 that had the letter and a little Boolean flag of is this the word, so whether it had 495 00:22:35,000 --> 00:22:39,000 the dark circle around it. 496 00:22:39,000 --> 00:22:41,000 And then it had a 26-member children array, so pointers to other nodes like 497 00:22:41,000 --> 00:22:44,000 this one. So totally recursive, 498 00:22:44,000 --> 00:22:47,000 staring from that root and then the 499 00:22:47,000 --> 00:22:50,000 idea is to use those 26 children as ''a'' being in the first slot and ''z'' in the last 500 00:22:50,000 --> 00:22:52,000 slot to make it really easy to kinda just trace letter by letter 501 00:22:52,000 --> 00:22:55,000 down to the levels of the tree. 502 00:22:55,000 --> 00:22:59,000 So I'd have this one root node that pointed to the initial one, and then all the words 503 00:22:59,000 --> 00:23:03,000 that begin with ''a'' are off that first lot; all from ''z'' off the last slot. 504 00:23:03,000 --> 00:23:08,000 So when I want to know if a word is in the dictionary, 505 00:23:08,000 --> 00:23:12,000 what do I have to do in this data structure? How do I find it? I 506 00:23:12,000 --> 00:23:18,000 want to find the word ''far.'' Where do I look? 507 00:23:18,000 --> 00:23:27,000 Left? Right? Down? Which way? It would be great 508 00:23:27,000 --> 00:23:31,000 if you would tell me. Then I would understand what my data structure is. [Inaudible] 509 00:23:31,000 --> 00:23:35,000 Exactly. So at the very top I say okay, ''f'' is my first letter. Match my ''f,'' get me to the 510 00:23:35,000 --> 00:23:35,000 F node, and now it says recursively find ''ar'' from here. So it's very 511 00:23:35,000 --> 00:23:37,000 recursive. It'll be like 512 00:23:37,000 --> 00:23:39,000 find the first letter 513 00:23:39,000 --> 00:23:43,000 and then recursively find the substring from here, working down. So find the ''a'' out of the next node, find 514 00:23:43,000 --> 00:23:46,000 an ''r'' out of the next node, and then when you get to that node, 515 00:23:46,000 --> 00:23:47,000 you'll check this little ''is word Boolean,'' 516 00:23:47,000 --> 00:23:49,000 and say okay, 517 00:23:49,000 --> 00:23:53,000 was that path I traced 518 00:23:53,000 --> 00:23:58,000 marked in the dictionary as leading to something that's known to be a word? 519 00:23:58,000 --> 00:24:01,000 The prefix looks exactly the same, except for that last check for the as word. So 520 00:24:01,000 --> 00:24:05,000 given that nodes exist in this, because further down there there's some words, I just 521 00:24:05,000 --> 00:24:08,000 trace out a sequence of letters ''str,'' and as long as there's something there, 522 00:24:08,000 --> 00:24:11,000 then I know that away from there are subsequent children beneath it 523 00:24:11,000 --> 00:24:15,000 that lead to words. 524 00:24:15,000 --> 00:24:17,000 And adding a word does the same operation. 525 00:24:17,000 --> 00:24:19,000 If I need to add the word ''strawberry,'' 526 00:24:19,000 --> 00:24:20,000 then I start looking for ''s,'' ''t,'' ''r,'' 527 00:24:20,000 --> 00:24:24,000 and at some point, 528 00:24:24,000 --> 00:24:26,000 I find some places where the nodes don't exist. Then I start creating them from 529 00:24:26,000 --> 00:24:29,000 there on down. 530 00:24:29,000 --> 00:24:33,000 So it is tracing what exists and then adding new nodes off the bottom, 531 00:24:33,000 --> 00:24:38,000 and then marking the final one as yes, this is a word. 532 00:24:38,000 --> 00:24:43,000 What's interesting about this is that all three of these operations 533 00:24:43,000 --> 00:24:47,000 don't actually make any reference to how many words are in the dictionary. 534 00:24:47,000 --> 00:24:49,000 They are all dependent on the length of the word you're searching for or adding into the dictionary. 535 00:24:49,000 --> 00:24:51,000 If there's ten words in there, if there's 1,000 words in there, if 536 00:24:51,000 --> 00:24:53,000 there's a million words in there. 537 00:24:53,000 --> 00:24:56,000 But in no way does the big O depend on 538 00:24:56,000 --> 00:24:59,000 how big or how loaded the data structure is. 539 00:24:59,000 --> 00:25:00,000 That's kinda wacky. 540 00:25:00,000 --> 00:25:06,000 The longest word, 541 00:25:06,000 --> 00:25:09,000 according to the Oxford English Dictionary? 542 00:25:09,000 --> 00:25:11,000 There you go. You're not going to make that on the Boggle board anytime soon because it's only got 16 cubes. Even Big Boggle 543 00:25:11,000 --> 00:25:12,000 can't really make that puppy. But 544 00:25:12,000 --> 00:25:14,000 just so you know, 545 00:25:14,000 --> 00:25:17,000 45 is not so many, right? 546 00:25:17,000 --> 00:25:21,000 And, 547 00:25:21,000 --> 00:25:24,000 in fact, actually, here's a little thought for you. Not only is it not dependent on num words, it actually has a 548 00:25:24,000 --> 00:25:26,000 little bit of an inverse relationship with num words, 549 00:25:26,000 --> 00:25:29,000 especially the add operation. 550 00:25:29,000 --> 00:25:31,000 That as the data structure becomes more loaded, 551 00:25:31,000 --> 00:25:34,000 there's typically less work to do in add. 552 00:25:34,000 --> 00:25:37,000 So if ''strawberry'' is in there, and you go to add ''strawberries,'' 553 00:25:37,000 --> 00:25:38,000 then actually you already have ''strawberries'' to depend on. You just need to build 554 00:25:38,000 --> 00:25:41,000 off the thing. So the fact that 555 00:25:41,000 --> 00:25:45,000 the more words that are in there, the more likely that some of the nodes you already need 556 00:25:45,000 --> 00:25:46,000 are already in place, and then you can just kinda blow past them and just add your new 557 00:25:46,000 --> 00:25:49,000 nodes. 558 00:25:49,000 --> 00:25:53,000 Now, that's odd. You think of most data structures as really as 559 00:25:53,000 --> 00:26:00,000 clearly as they get heavier, it takes more work to install new 560 00:26:00,000 --> 00:26:03,000 things in it rather than less. [Inaudible]. 561 00:26:03,000 --> 00:26:04,000 Well, the thing is, they're organized by letter, and that 26-number array. So in fact, 562 00:26:04,000 --> 00:26:08,000 I don't even have to look. 563 00:26:08,000 --> 00:26:11,000 So I just go straight to the slot. If I'm looking for strawberry, I'll look for ''s'' and I go 564 00:26:11,000 --> 00:26:14,000 straight to the ''t'' slot and then the ''r'' slot. So in fact, if the mode was already in place, I'm 565 00:26:14,000 --> 00:26:16,000 not searching for it. It really just is exactly in the ''r'' slot. That's why 566 00:26:16,000 --> 00:26:21,000 there's 26 of them. In 567 00:26:21,000 --> 00:26:24,000 each array? Each array is 26 and they're A through Z and if there's an empty slot we're using a null. 568 00:26:24,000 --> 00:26:27,000 So in this case, it's optimized to make it super fast. I don't even have to look 569 00:26:27,000 --> 00:26:29,000 through the letters to find the one that matches. There's no matching. It just goes straight 570 00:26:29,000 --> 00:26:32,000 to the ''r'' slot, ''z'' slot, the ''s'' slot. 571 00:26:32,000 --> 00:26:35,000 We're going to see this is going to be a consequence we can't 572 00:26:35,000 --> 00:26:37,000 tolerate, but it is, at least, the starting point for thinking about how to 573 00:26:37,000 --> 00:26:38,000 build a data structure. 574 00:26:38,000 --> 00:26:40,000 So the space usage 575 00:26:40,000 --> 00:26:44,000 is where this thing 576 00:26:44,000 --> 00:26:47,000 really, really - so this is an extreme example of a space-time trade-off, right? 577 00:26:47,000 --> 00:26:50,000 We've got this thing that optimizes beautifully for OAV length of 578 00:26:50,000 --> 00:26:53,000 the word. So that means typically eight operations on average are going to be 579 00:26:53,000 --> 00:26:56,000 needed to add or to search for something. 580 00:26:56,000 --> 00:26:57,000 The space we're using here is 106 bytes per nodes, so that's this 581 00:26:57,000 --> 00:26:59,000 26-member, 582 00:26:59,000 --> 00:27:02,000 four-byte array, 583 00:27:02,000 --> 00:27:04,000 plus two bytes for the character and the Boolean that are up there. One 584 00:27:04,000 --> 00:27:05,000 hundred and six bytes 585 00:27:05,000 --> 00:27:09,000 is a lot of memory, 586 00:27:09,000 --> 00:27:13,000 and the tree has about a quarter of a million nodes. 587 00:27:13,000 --> 00:27:17,000 So given the 125,000 words that I said are in the input, 588 00:27:17,000 --> 00:27:19,000 if you build it into a tree, they take about 250,000 nodes. That's an interesting sort of 589 00:27:19,000 --> 00:27:22,000 number to think about, though. You have 125,000 590 00:27:22,000 --> 00:27:25,000 words. They take 250,000 nodes. That tells you, actually, that 591 00:27:25,000 --> 00:27:27,000 kinda on average, each 592 00:27:27,000 --> 00:27:29,000 word has about two unique nodes, 593 00:27:29,000 --> 00:27:33,000 that even words like ''straightforward'' and ''straightforwardly,'' 594 00:27:33,000 --> 00:27:34,000 on average, are sharing enough of their prefix - common prefix 595 00:27:34,000 --> 00:27:38,000 parts - that 596 00:27:38,000 --> 00:27:39,000 there's very little unique data added by any particular word in there that averages 597 00:27:39,000 --> 00:27:40,000 across the whole thing. That's 598 00:27:40,000 --> 00:27:42,000 pretty neat 599 00:27:42,000 --> 00:27:45,000 to know. However, this is just not going to fly. 600 00:27:45,000 --> 00:27:50,000 So here I was telling you that my main memory had 601 00:27:50,000 --> 00:27:53,000 four megabytes, maybe, on a good day, and now I've actually 602 00:27:53,000 --> 00:27:55,000 built a data structure that's going to require six times that 603 00:27:55,000 --> 00:27:58,000 in terms of storage. 604 00:27:58,000 --> 00:28:00,000 Okay, we gotta squeeze that down. 605 00:28:00,000 --> 00:28:04,000 So the first thing we can do 606 00:28:04,000 --> 00:28:08,000 is to realize that those 26 - allocated a full 26-member array, of 607 00:28:08,000 --> 00:28:11,000 which I only plan on using some of the slots, is pretty wasteful. 608 00:28:11,000 --> 00:28:13,000 So instead of saying I'm committing to a 26 per thing, I'll say well, how about if 609 00:28:13,000 --> 00:28:16,000 I have an array that actually 610 00:28:16,000 --> 00:28:20,000 is tight to the number of children coming out of this node? 611 00:28:20,000 --> 00:28:20,000 So the very first node will have 26. There are words that start with every letter in the 612 00:28:20,000 --> 00:28:22,000 alphabet. 613 00:28:22,000 --> 00:28:25,000 But from there, it very quickly winnows down. 614 00:28:25,000 --> 00:28:29,000 You have the letter ''x.'' You do not need the full 26 children coming 615 00:28:29,000 --> 00:28:33,000 off of ''x.'' There's only a few letters that follow ''x'' in the English language that are going 616 00:28:33,000 --> 00:28:37,000 to need to be fleshed out. You need the ''xa,'' you need an ''xe,'' 617 00:28:37,000 --> 00:28:40,000 but you don't need ''xj,'' you don't need ''xk,'' a whole bunch of things like that. 618 00:28:40,000 --> 00:28:44,000 So if I change the structure here to instead of being 619 00:28:44,000 --> 00:28:47,000 a 26-member array of node pointers, I make it to be a pointer to a 620 00:28:47,000 --> 00:28:50,000 set of nodes pointers, so this is a dynamic array 621 00:28:50,000 --> 00:28:54,000 that I'm keeping track of the number of children in a second field here 622 00:28:54,000 --> 00:28:56,000 that the first one will be allocated to 26, the second will be allocated to eight. 623 00:28:56,000 --> 00:28:59,000 This is going to change a little bit of our running times, because now we won't 624 00:28:59,000 --> 00:29:01,000 be able to go exactly to the ''s'' slot. We'll have to go look for it. 625 00:29:01,000 --> 00:29:03,000 So on each step down the tree, 626 00:29:03,000 --> 00:29:07,000 it'll be 627 00:29:07,000 --> 00:29:12,000 looking through that sized array at this slot to find the right match. But it adds a 628 00:29:12,000 --> 00:29:14,000 constant factor, basically, on top of OAV length of the word. 629 00:29:14,000 --> 00:29:15,000 So the space issues of this 630 00:29:15,000 --> 00:29:19,000 631 00:29:19,000 --> 00:29:23,000 is - we're just not storing all those nulls. And there were a lot of nulls. That 632 00:29:23,000 --> 00:29:25,000 the node itself is not just ten bytes, and then there's some space in this 633 00:29:25,000 --> 00:29:27,000 dynamic array out there, which is the number of children 634 00:29:27,000 --> 00:29:31,000 times the four-byte pointer, 635 00:29:31,000 --> 00:29:33,000 that on average, a node has six children. Across the 636 00:29:33,000 --> 00:29:36,000 250,000 nodes in the 637 00:29:36,000 --> 00:29:39,000 data structure that's being billed from this, so really 26 was way 638 00:29:39,000 --> 00:29:40,000 overkill. About 20 of them had nulls 639 00:29:40,000 --> 00:29:42,000 for most of the nodes. 640 00:29:42,000 --> 00:29:46,000 And many of them, for example, at the very bottom of the tree, 641 00:29:46,000 --> 00:29:49,000 have completely none. They're all nulls. So you get to a word like 642 00:29:49,000 --> 00:29:52,000 ''straightforwardly.'' There's nothing after that ''y,'' 643 00:29:52,000 --> 00:29:53,000 so it has zero children coming out the back of that one. 644 00:29:53,000 --> 00:29:57,000 And so certainly 645 00:29:57,000 --> 00:30:00,000 having 26 null children pointing away from that was a 646 00:30:00,000 --> 00:30:01,000 waste that we can tighten up. 647 00:30:01,000 --> 00:30:03,000 So when we get it down to this we'll have 648 00:30:03,000 --> 00:30:04,000 649 00:30:04,000 --> 00:30:05,000 44 bytes 650 00:30:05,000 --> 00:30:06,000 651 00:30:06,000 --> 00:30:08,000 - 652 00:30:08,000 --> 00:30:11,000 or 34 bytes for the per node. 653 00:30:11,000 --> 00:30:15,000 Still a quarter million nodes. Still 654 00:30:15,000 --> 00:30:18,000 eight and half megabytes. So we're not winning any prizes just yet. 655 00:30:18,000 --> 00:30:22,000 But we are making good progress. That got us down a factor of 656 00:30:22,000 --> 00:30:25,000 three or four right there. So I still have the same 657 00:30:25,000 --> 00:30:28,000 number of nodes. All I changed was the pointers to the 658 00:30:28,000 --> 00:30:32,000 subsequent nodes further down the tree. So 659 00:30:32,000 --> 00:30:35,000 I - the kind of letters and how the letters connect to each other - the connectivity is 660 00:30:35,000 --> 00:30:39,000 still pretty much the same. It's just that in any particular node, how many children does it have? 661 00:30:39,000 --> 00:30:40,000 I was storing a whole bunch of nulls and now I'm not storing those nulls. 662 00:30:40,000 --> 00:30:43,000 So the number of nodes is still 663 00:30:43,000 --> 00:30:47,000 exactly as it was before. I'm going 664 00:30:47,000 --> 00:30:49,000 to do one other little squishing thing on this data structure 665 00:30:49,000 --> 00:30:51,000 and then I'm going 666 00:30:51,000 --> 00:30:53,000 to flip gears for a second. 667 00:30:53,000 --> 00:30:56,000 So still using kinda my 668 00:30:56,000 --> 00:30:59,000 CS side of things, how can I squash things down? 669 00:30:59,000 --> 00:31:03,000 I'm going to use a technique that we used in the heap. 670 00:31:03,000 --> 00:31:04,000 So in the case of the binary heap that we implemented into the 671 00:31:04,000 --> 00:31:07,000 PQ, 672 00:31:07,000 --> 00:31:11,000 you'll remember that we drew pictures in the handout that showed a heap with oh, I 673 00:31:11,000 --> 00:31:11,000 have a parent with two children, which has four grandchildren and 674 00:31:11,000 --> 00:31:15,000 so on. 675 00:31:15,000 --> 00:31:16,000 But that we actually compressed that down into an array, 676 00:31:16,000 --> 00:31:20,000 and in the array 677 00:31:20,000 --> 00:31:24,000 there was this pattern of 678 00:31:24,000 --> 00:31:27,000 here is the root node. 679 00:31:27,000 --> 00:31:30,000 So we'll call this level zero. 680 00:31:30,000 --> 00:31:33,000 And then the next two nodes 681 00:31:33,000 --> 00:31:36,000 are level one. 682 00:31:36,000 --> 00:31:38,000 And the next four nodes 683 00:31:38,000 --> 00:31:42,000 are level two and so on. So 684 00:31:42,000 --> 00:31:45,000 there's one here, two here, there's four here, there's eight here. And we kinda flattened it down 685 00:31:45,000 --> 00:31:50,000 by generation, so looking at what was at the top and what was underneath it. 686 00:31:50,000 --> 00:31:52,000 When we did that, we removed all the space for pointers. 687 00:31:52,000 --> 00:31:54,000 Although we drew those pictures in the heap as though there were pointers 688 00:31:54,000 --> 00:31:57,000 there, in fact, there were not. 689 00:31:57,000 --> 00:32:02,000 There were no pointers being stored in the heap in this version of the PQ. 690 00:32:02,000 --> 00:32:02,000 And so that gets us back to kind of array like storage requirements, 691 00:32:02,000 --> 00:32:06,000 but 692 00:32:06,000 --> 00:32:09,000 by flattening it down into this array, and still kinda using the tree 693 00:32:09,000 --> 00:32:10,000 conceptually to work our way through it, we're getting the advantages of 694 00:32:10,000 --> 00:32:12,000 that 695 00:32:12,000 --> 00:32:14,000 696 00:32:14,000 --> 00:32:15,000 traversal that's based on height of tree, 697 00:32:15,000 --> 00:32:20,000 not length of vector. 698 00:32:20,000 --> 00:32:22,000 So if we do the same thing with the trie, it's a little bit more complicated 699 00:32:22,000 --> 00:32:25,000 because there were a couple things about heap that made this a little 700 00:32:25,000 --> 00:32:27,000 easier. Heap always had two children, so there were these rules about 701 00:32:27,000 --> 00:32:30,000 how the tree was filled in that meant 702 00:32:30,000 --> 00:32:33,000 you could always compute parent and child index using this kinda divide by two, 703 00:32:33,000 --> 00:32:36,000 multiple by two exchange operation. 704 00:32:36,000 --> 00:32:39,000 In this case, what we're going to have to do is just lay it down by generation, 705 00:32:39,000 --> 00:32:41,000 but we don't know how big a generation's gonna be, so we're actually gonna have to keep 706 00:32:41,000 --> 00:32:43,000 track of it manually. 707 00:32:43,000 --> 00:32:46,000 So the root node will have 708 00:32:46,000 --> 00:32:49,000 all A through Z. So this will be 709 00:32:49,000 --> 00:32:53,000 level one, which is A through Z here, 710 00:32:53,000 --> 00:32:57,000 and then level two is - there'll be a bunch of slots here, which is the 711 00:32:57,000 --> 00:33:00,000 A followed by some letter, and then there'd be the B's followed by some letter, and 712 00:33:00,000 --> 00:33:03,000 then there'd be the C's followed by some letter, and so on. 713 00:33:03,000 --> 00:33:06,000 And then further down is level three, which are the three-character words. So these are all 714 00:33:06,000 --> 00:33:09,000 the single-character paths, 715 00:33:09,000 --> 00:33:12,000 two-character paths, three-character paths, kinda flattened down. 716 00:33:12,000 --> 00:33:15,000 And so if I look at what I changed here, my first node 717 00:33:15,000 --> 00:33:18,000 says it has no letter that's not the root node, it's not a word, 718 00:33:18,000 --> 00:33:20,000 and it says it's children begin at index one. So 719 00:33:20,000 --> 00:33:23,000 the next location. So it's not a 720 00:33:23,000 --> 00:33:26,000 multiply by two and add one or something like that. It's okay, here's 721 00:33:26,000 --> 00:33:30,000 where my children begin. They're at slot one in the array. 722 00:33:30,000 --> 00:33:31,000 And then there are 26 of them. So that means that A, B, C, D are kinda in order there. 723 00:33:31,000 --> 00:33:33,000 If I look at A, 724 00:33:33,000 --> 00:33:36,000 ''a'' is a word, 725 00:33:36,000 --> 00:33:39,000 and it says its children begin at index 27 726 00:33:39,000 --> 00:33:44,000 and there are 12 of them, so there are 12 characters out of the 26 that 727 00:33:44,000 --> 00:33:47,000 can follow an ''a'' in the sequence of a valid word. And 728 00:33:47,000 --> 00:33:48,000 then B's children begin at index 39, which is 27 plus 12. So that's 729 00:33:48,000 --> 00:33:50,000 where 730 00:33:50,000 --> 00:33:54,000 A's children are sitting here at index 731 00:33:54,000 --> 00:33:56,000 27 to 38 and then 39 to - I think B has eight or so children - 732 00:33:56,000 --> 00:33:59,000 and C and so on. 733 00:33:59,000 --> 00:34:00,000 So we flattened it all down by generation 734 00:34:00,000 --> 00:34:01,000 into 735 00:34:01,000 --> 00:34:04,000 an array. 736 00:34:04,000 --> 00:34:07,000 That means all the space for pointers 737 00:34:07,000 --> 00:34:10,000 has now been removed. 738 00:34:10,000 --> 00:34:12,000 One serious consequence of this, though, 739 00:34:12,000 --> 00:34:14,000 is that the pointers got us flexibility, that 740 00:34:14,000 --> 00:34:15,000 that insert and remove 741 00:34:15,000 --> 00:34:19,000 742 00:34:19,000 --> 00:34:22,000 was dependent on the idea that pointers really being there, pointers buying us 743 00:34:22,000 --> 00:34:25,000 the ability to insert and delete without doing a shuffle. 744 00:34:25,000 --> 00:34:27,000 That now, once I have flattened it, I have suddenly bought back the same 745 00:34:27,000 --> 00:34:29,000 problems that arrays have, which is 746 00:34:29,000 --> 00:34:33,000 if I have the words here that have 747 00:34:33,000 --> 00:34:36,000 ''ab'' for abalone, and ''ac'' for act and I don't happen to have any words with ''ad.'' And 748 00:34:36,000 --> 00:34:40,000 I try to add the word ''add,'' 749 00:34:40,000 --> 00:34:43,000 that in order to put ''ad'' in there, we're going to have to move everything over and then 750 00:34:43,000 --> 00:34:46,000 there's going to be a lot of shuffling and what not. 751 00:34:46,000 --> 00:34:48,000 So I'm starting to build a data structure that's losing some flexibility but 752 00:34:48,000 --> 00:34:49,000 saving space. 753 00:34:49,000 --> 00:34:53,000 So 754 00:34:53,000 --> 00:34:54,000 I'm, in some sense, focusing my attention on how can I build a data structure that's really fast 755 00:34:54,000 --> 00:34:56,000 to search, 756 00:34:56,000 --> 00:35:00,000 even if it was a little harder to build? 757 00:35:00,000 --> 00:35:02,000 Because it turns out the thing I most care about is making sure it can 758 00:35:02,000 --> 00:35:07,000 actually very, very efficiently look up words. 759 00:35:07,000 --> 00:35:11,000 Maybe building it once, being a little slow, isn't so painful. 760 00:35:11,000 --> 00:35:14,000 So the space issues in this gets us down to just ten bytes per 761 00:35:14,000 --> 00:35:16,000 node. All the pointers are gone. So there was 24 bytes of pointers that we've 762 00:35:16,000 --> 00:35:19,000 just eliminated. 763 00:35:19,000 --> 00:35:21,000 And that means we've got two and a half megabyte 764 00:35:21,000 --> 00:35:28,000 memory footprint right now. 765 00:35:28,000 --> 00:35:29,000 So we're kinda getting close to hitting the range for binary tree and hash tables. 766 00:35:29,000 --> 00:35:33,000 Now I'm 767 00:35:33,000 --> 00:35:37,000 going to go back to the domain again. So I worked on it kinda from the CS side, thinking 768 00:35:37,000 --> 00:35:39,000 about things I know about data structures and 769 00:35:39,000 --> 00:35:41,000 reorganizing arrays and heaps and pointers. I'm going 770 00:35:41,000 --> 00:35:43,000 to go back and look at that domain again 771 00:35:43,000 --> 00:35:46,000 and see if there's something else I 772 00:35:46,000 --> 00:35:48,000 can kinda take advantage of. 773 00:35:48,000 --> 00:35:50,000 So this is a different 774 00:35:50,000 --> 00:35:52,000 cross-section of the dictionary 775 00:35:52,000 --> 00:35:57,000 showing the words ''adapted,'' 776 00:35:57,000 --> 00:36:02,000 ''adaptor,'' ''adaptors,'' ''adapting,'' ''adaption,'' adaptions,'' ''adapts.'' 777 00:36:02,000 --> 00:36:04,000 So eight words that are all built off the ''adapt'' root with a bunch of suffixes. I look 778 00:36:04,000 --> 00:36:05,000 sorta over a little bit further. I've got 779 00:36:05,000 --> 00:36:10,000 ''desert,'' 780 00:36:10,000 --> 00:36:15,000 ''deserted,'' ''deserter,'' ''deserting,'' ''desertings,'' ''desertion,'' ''desertions,'' ''deserts.'' Okay, 781 00:36:15,000 --> 00:36:20,000 and ''detect'' and ''insert'' and ''perfect'' and ''invent'' and ''interrupt'' 782 00:36:20,000 --> 00:36:23,000 that all are verbs for which you can apply the past tense in the 783 00:36:23,000 --> 00:36:26,000 gerund form and the ''ion's'' that 784 00:36:26,000 --> 00:36:29,000 tack onto that root word to give you these suffixes. 785 00:36:29,000 --> 00:36:32,000 And each of these words that has shown up here has exactly the same set of suffixes 786 00:36:32,000 --> 00:36:33,000 that can be applied to it. 787 00:36:33,000 --> 00:36:35,000 So I 788 00:36:35,000 --> 00:36:37,000 did this thing about prefixes. I 789 00:36:37,000 --> 00:36:42,000 went out of my way to find that - 790 00:36:42,000 --> 00:36:45,000 so all these words could share that same prefix, ''assert,'' ''asserter,'' ''assertings,'' ''asserting.'' 791 00:36:45,000 --> 00:36:47,000 But is there some way that I could take this idea of ''corrupt'' 792 00:36:47,000 --> 00:36:50,000 and ''assert'' and ''insert,'' 793 00:36:50,000 --> 00:36:54,000 and then once I get down to that ''t,'' where they end, is there a way where they can also 794 00:36:54,000 --> 00:36:57,000 share their backend parts? That those ''ing's'' and ''ion's,'' can they be 795 00:36:57,000 --> 00:36:58,000 unified, as well? Well, 796 00:36:58,000 --> 00:37:00,000 why not? 797 00:37:00,000 --> 00:37:03,000 Why not? So here is 798 00:37:03,000 --> 00:37:06,000 the first version of what's going to be the DAWG, 799 00:37:06,000 --> 00:37:07,000 the Directed Acyclic Word 800 00:37:07,000 --> 00:37:10,000 Graph. 801 00:37:10,000 --> 00:37:12,000 It takes the idea of unifying the prefixes and now applies it to the 802 00:37:12,000 --> 00:37:14,000 suffixes. 803 00:37:14,000 --> 00:37:17,000 That there are a lot of 804 00:37:17,000 --> 00:37:20,000 words that end in the same letters, as well as start in the same letters. Why not just apply 805 00:37:20,000 --> 00:37:22,000 the same kind of unification and sharing on the back end? 806 00:37:22,000 --> 00:37:24,000 So here comes in 807 00:37:24,000 --> 00:37:26,000 ''adapt'' and ''adopt'' 808 00:37:26,000 --> 00:37:30,000 and ''based'' and ''port'' 809 00:37:30,000 --> 00:37:33,000 and out the back end is ''ports'' and ''portion'' and ''porting'' and ''porter'' and ''adopter'' 810 00:37:33,000 --> 00:37:37,000 and ''adopted'' and ''basting'' and ''bastion,'' 811 00:37:37,000 --> 00:37:40,000 all coming off that sharing all the endings, because they have the same 812 00:37:40,000 --> 00:37:41,000 words that are viable and the same connections that are 813 00:37:41,000 --> 00:37:46,000 between them. So 814 00:37:46,000 --> 00:37:49,000 the idea that all sharing the ''er'' and the ''er'' leading in to the ''s.'' And that same ''s,'' for 815 00:37:49,000 --> 00:37:50,000 example, being shared from ''portion'' and ''baste'' and 816 00:37:50,000 --> 00:37:51,000 ''adopt'' and 817 00:37:51,000 --> 00:37:54,000 ''adopter,'' 818 00:37:54,000 --> 00:37:57,000 all coming into the same ending of oh, these things that end in ''s'' are words. Here's an ''s'' that is a 819 00:37:57,000 --> 00:37:59,000 word for them to share. 820 00:37:59,000 --> 00:38:01,000 So it is directed. So 821 00:38:01,000 --> 00:38:04,000 it is a graph. 822 00:38:04,000 --> 00:38:05,000 So once we remove this - the idea 823 00:38:05,000 --> 00:38:09,000 that there's a single path 824 00:38:09,000 --> 00:38:12,000 from the root to any node, we're no longer a tree structure. So we 825 00:38:12,000 --> 00:38:16,000 can get to ''t'' by going through ''adapt'' or ''adopt'' or ''port,'' 826 00:38:16,000 --> 00:38:19,000 and so that no longer fits the definition of what is a tree. It's 827 00:38:19,000 --> 00:38:23,000 become a graph. We can get to these nodes from different places. 828 00:38:23,000 --> 00:38:24,000 The arcs go one way in this case. We can't go back up the tree 829 00:38:24,000 --> 00:38:27,000 to find the words in reverse. 830 00:38:27,000 --> 00:38:31,000 There are no cycles. So acyclic, right, you can't just 831 00:38:31,000 --> 00:38:33,000 wank around some portion and get this many bananas as you want. 832 00:38:33,000 --> 00:38:34,000 And 833 00:38:34,000 --> 00:38:37,000 they have 834 00:38:37,000 --> 00:38:44,000 been with certain nodes that are marked, in this case, the same way they were in the 835 00:38:44,000 --> 00:38:45,000 trie. This path from the root node to here does represent a word in this 836 00:38:45,000 --> 00:38:48,000 situation. 837 00:38:48,000 --> 00:38:53,000 So building this is a little bit complicated. I'll give you a little bit 838 00:38:53,000 --> 00:38:56,000 of the idea of the intuition for it, and I will not get too deep into it. 839 00:38:56,000 --> 00:39:00,000 But largely the way you do it is you build the trie, 840 00:39:00,000 --> 00:39:01,000 so that it doesn't have the unification of the suffixes; it only unifies the 841 00:39:01,000 --> 00:39:03,000 prefixes. And 842 00:39:03,000 --> 00:39:07,000 then you work backwards 843 00:39:07,000 --> 00:39:09,000 on the suffixes. So in some sense you look at all the words that end, for example, in 844 00:39:09,000 --> 00:39:12,000 ''s'' in the dictionary. 845 00:39:12,000 --> 00:39:15,000 And you find all the ones that just end in one ''s'' that goes nowhere, and you say 846 00:39:15,000 --> 00:39:17,000 look, this ''s'' looks like every other ''s'' in this thing. All these words 847 00:39:17,000 --> 00:39:20,000 that are just plurals or 848 00:39:20,000 --> 00:39:23,000 verbs that can have an ''s'' tacked on. Let's share that ''s.'' 849 00:39:23,000 --> 00:39:25,000 And then so you take however many s's there were, 1,000 of them, and you 850 00:39:25,000 --> 00:39:29,000 say these are all the same ''s.'' 851 00:39:29,000 --> 00:39:31,000 And then what you do is you look at all the nodes that lead into that ''s.'' So you're kinda 852 00:39:31,000 --> 00:39:33,000 traversing the graph in reverse, and you say well look, 853 00:39:33,000 --> 00:39:36,000 here's a bunch of people who lead in on an ''e.'' 854 00:39:36,000 --> 00:39:37,000 And if that ''e'' is a word or not, there might be some group that has ''e'' 855 00:39:37,000 --> 00:39:38,000 that's also a word, 856 00:39:38,000 --> 00:39:40,000 so like 857 00:39:40,000 --> 00:39:42,000 the word ''apple'' and ''apples.'' 858 00:39:42,000 --> 00:39:46,000 But sometimes the ''e'' is not, 859 00:39:46,000 --> 00:39:49,000 where that wasn't a stopping place, let's say, for ''parties'' it wasn't. 860 00:39:49,000 --> 00:39:53,000 So all the ones that are a word you can unify on the ''e'' that was a word, and all the ones that aren't 861 00:39:53,000 --> 00:39:54,000 are that way can unify on that. So you just work your way backwards from there. So you say 862 00:39:54,000 --> 00:39:55,000 what letters led into that ''e?'' 863 00:39:55,000 --> 00:39:58,000 And so kinda from 864 00:39:58,000 --> 00:40:02,000 the end of the words backwards, you're looking for all the people who can end in an 865 00:40:02,000 --> 00:40:05,000 ''s,'' and end in an ''e,'' and end in an ''ies,'' and unify them up, 866 00:40:05,000 --> 00:40:07,000 making sure that they have all the same outgoing connections. So for 867 00:40:07,000 --> 00:40:09,000 example, ''running'' has 868 00:40:09,000 --> 00:40:13,000 an ''ing'' at the end. So does ''swimming,'' 869 00:40:13,000 --> 00:40:15,000 but there's ''swimmingly'' and there's not ''runningly.'' So you have to be careful about 870 00:40:15,000 --> 00:40:18,000 unifying things that 871 00:40:18,000 --> 00:40:20,000 they would kinda have to have all identical properties from this point 872 00:40:20,000 --> 00:40:22,000 down to the bottom of the tree 873 00:40:22,000 --> 00:40:25,000 to be unifiable. 874 00:40:25,000 --> 00:40:28,000 But it's a pretty neat process. It's actually what's called 875 00:40:28,000 --> 00:40:29,000 DFT subset minimization, so if you want to learn a little bit about it, that's the 876 00:40:29,000 --> 00:40:31,000 word to look up. 877 00:40:31,000 --> 00:40:35,000 And so if I do that, 878 00:40:35,000 --> 00:40:36,000 and I build this as a DAWG, I'm still using the same structure definition here, 879 00:40:36,000 --> 00:40:40,000 880 00:40:40,000 --> 00:40:44,000 but what I am doing is unifying suffixes as well as prefixes 881 00:40:44,000 --> 00:40:46,000 and that, actually, is gonna change the number of nodes that I need drastically. 882 00:40:46,000 --> 00:40:49,000 It goes down to 80,000 883 00:40:49,000 --> 00:40:52,000 from my initial 250,000. 884 00:40:52,000 --> 00:40:55,000 And if you think about the number of words again, 885 00:40:55,000 --> 00:40:57,000 there were 125,000 words in the dictionary, 886 00:40:57,000 --> 00:41:01,000 the DAWG has just 80,000 words. So it means, actually, once you start 887 00:41:01,000 --> 00:41:04,000 unifying the suffixes, it's like most words don't even have - 888 00:41:04,000 --> 00:41:07,000 or many of the words don't even have nodes of their own that are unique to 889 00:41:07,000 --> 00:41:11,000 them whatsoever. They just exist kinda as portions of other words 890 00:41:11,000 --> 00:41:12,000 that have all been joined together and 891 00:41:12,000 --> 00:41:15,000 compressed into one package. 892 00:41:15,000 --> 00:41:18,000 So 80,000 nodes total, about 893 00:41:18,000 --> 00:41:21,000 two-thirds of the number of words total. 894 00:41:21,000 --> 00:41:23,000 So we have ten nodes - ten-byte nodes 895 00:41:23,000 --> 00:41:27,000 and 80,000 of them. 896 00:41:27,000 --> 00:41:31,000 We are down to under a megabyte. So we've crossed that little interesting barrier of 897 00:41:31,000 --> 00:41:34,000 the data structure on disk. The data that was fed into it, those words, was over a 898 00:41:34,000 --> 00:41:36,000 megabyte and now, actually, we have created a representation that actually is 899 00:41:36,000 --> 00:41:40,000 compressed, that uses less space 900 00:41:40,000 --> 00:41:42,000 than the text file did in the first place. And that's really because it is 901 00:41:42,000 --> 00:41:44,000 finding all these ways in which words are like each other 902 00:41:44,000 --> 00:41:46,000 and sharing, 903 00:41:46,000 --> 00:41:48,000 making use of the existing 904 00:41:48,000 --> 00:41:50,000 structure in the 905 00:41:50,000 --> 00:41:51,000 DAWG to add new words 906 00:41:51,000 --> 00:41:53,000 without 907 00:41:53,000 --> 00:41:56,000 adding bulk. 908 00:41:56,000 --> 00:41:57,000 So once I've done this I have to say that I've made the data structure 909 00:41:57,000 --> 00:42:01,000 even less flexible. 910 00:42:01,000 --> 00:42:03,000 So to be clear, each of these steps has a cost somewhere else that made the code more 911 00:42:03,000 --> 00:42:07,000 complicated. It also made it so that 912 00:42:07,000 --> 00:42:10,000 now that I have all this sharing back here, that when I'm adding a new word, so I go 913 00:42:10,000 --> 00:42:14,000 back to look at this picture here, that 914 00:42:14,000 --> 00:42:15,000 I go in to add a word and I say oh, actually, the word 915 00:42:15,000 --> 00:42:17,000 ''portioning'' is a word. 916 00:42:17,000 --> 00:42:19,000 Well, ''adoptioning'' is not. 917 00:42:19,000 --> 00:42:22,000 So 918 00:42:22,000 --> 00:42:26,000 that will necessitate if I go in to add the word ''portioning,'' that I actually have 919 00:42:26,000 --> 00:42:30,000 to split it off of its existing unified branch, because if I just tacked on the 920 00:42:30,000 --> 00:42:33,000 ''ing'' on the end of ''portioning'' there, that it turns out that ''adaptioning'' 921 00:42:33,000 --> 00:42:37,000 and ''adoptioning'' would suddenly become words, too. So actually I have to 922 00:42:37,000 --> 00:42:38,000 be extra careful once I've got it unified this way that when I add new words that I don't 923 00:42:38,000 --> 00:42:39,000 actually 924 00:42:39,000 --> 00:42:42,000 accidentally 925 00:42:42,000 --> 00:42:45,000 add some additional words that I didn't plan on because these paths have 926 00:42:45,000 --> 00:42:49,000 been shared. So it requires a little bit more careful management. Again, 927 00:42:49,000 --> 00:42:50,000 the goal, though, here, in this case, was to try to build a data structure that was kinda freeze-dried. 928 00:42:50,000 --> 00:42:53,000 So I 929 00:42:53,000 --> 00:42:55,000 build it off the data structure - the input data file I had, 930 00:42:55,000 --> 00:42:57,000 that then would operate very, very efficiently, 931 00:42:57,000 --> 00:43:00,000 and I wasn't planning on making a lot 932 00:43:00,000 --> 00:43:01,000 of runtime changes to it. 933 00:43:01,000 --> 00:43:04,000 I get this down to this 934 00:43:04,000 --> 00:43:08,000 and then the last thing I'm going to do to it 935 00:43:08,000 --> 00:43:11,000 is go back to kinda the CS side of it and see if I can squeeze it down kinda 936 00:43:11,000 --> 00:43:15,000 in terms of just data structure, what's being stored per node. 937 00:43:15,000 --> 00:43:18,000 So I have 80,000 of these. So 938 00:43:18,000 --> 00:43:22,000 typically, when you look at a structure, 939 00:43:22,000 --> 00:43:24,000 it's very rare that the idea if you change something from an int to a 940 00:43:24,000 --> 00:43:28,000 char, so it was a two-byte thing into a one-byte thing, 941 00:43:28,000 --> 00:43:32,000 that you're going to have any profound impact on the total size needed. 942 00:43:32,000 --> 00:43:33,000 But if you happen to know you're going to make a lot of these structures, tens of thousands, 943 00:43:33,000 --> 00:43:35,000 almost 100,000 of them, 944 00:43:35,000 --> 00:43:38,000 then it turns out every byte counts. 945 00:43:38,000 --> 00:43:40,000 If I can get this thing down from, in this case, 946 00:43:40,000 --> 00:43:42,000 ten bytes to eight bytes, or to 947 00:43:42,000 --> 00:43:46,000 six bytes or four bytes, 948 00:43:46,000 --> 00:43:49,000 then it will cut my memory by quite a factor. 949 00:43:49,000 --> 00:43:52,000 So what I did in the final version of it 950 00:43:52,000 --> 00:43:58,000 is I used a little technique called bit packing, 951 00:43:58,000 --> 00:44:00,000 which is to actually stash stuff in the absolute minimum possible space 952 00:44:00,000 --> 00:44:04,000 that I can fit it into, 953 00:44:04,000 --> 00:44:04,000 taking advantage of things like, for example, if there are only 26 letters in the 954 00:44:04,000 --> 00:44:08,000 955 00:44:08,000 --> 00:44:10,000 alphabet, a character is actually a full eight-bit value, which can hold 956 00:44:10,000 --> 00:44:13,000 numbers from zero to 255. 957 00:44:13,000 --> 00:44:15,000 And I just don't need that much space. I'm not using the yen sign, I'm not using the parentheses, 958 00:44:15,000 --> 00:44:19,000 I'm not using the digits. 959 00:44:19,000 --> 00:44:22,000 So having, in some sense, the capacity to store all those distinct characters is actually 960 00:44:22,000 --> 00:44:26,000 a cost I'm paying on every character that I don't need to. 961 00:44:26,000 --> 00:44:29,000 So in fact, I squeezed it down by dividing up into the sub-bit level. This 962 00:44:29,000 --> 00:44:30,000 something that's talked about a little bit in Chapter 15 if you're at all curious. It's 963 00:44:30,000 --> 00:44:33,000 not what I 964 00:44:33,000 --> 00:44:34,000 consider core material for 106B, but I'm just sorta mentioning it to kinda 965 00:44:34,000 --> 00:44:36,000 complete the picture. 966 00:44:36,000 --> 00:44:38,000 Where five bits, 967 00:44:38,000 --> 00:44:41,000 given that a bit is either zero or one, 968 00:44:41,000 --> 00:44:45,000 that I have five of them connected together, then I have 25 969 00:44:45,000 --> 00:44:45,000 different patterns that I can represent in that 970 00:44:45,000 --> 00:44:48,000 971 00:44:48,000 --> 00:44:50,000 amount of capacity there, and that's 32 different patterns, which is enough for 972 00:44:50,000 --> 00:44:54,000 my characters, plus a little slack. 973 00:44:54,000 --> 00:44:56,000 I use exactly one bit for is word, rather than a full Boolean. All I need 974 00:44:56,000 --> 00:44:59,000 to know is yes or no. 975 00:44:59,000 --> 00:45:02,000 I also changed the way I did 976 00:45:02,000 --> 00:45:04,000 knowing that you were the last child of a generation, rather than having 977 00:45:04,000 --> 00:45:06,000 A say, ''there are 12 children here.'' 978 00:45:06,000 --> 00:45:10,000 I just said, go to index 27 979 00:45:10,000 --> 00:45:12,000 and start reading, and one of them will tell you it's the last. So each of them says, ''No, 980 00:45:12,000 --> 00:45:16,000 I'm not the last,'' ''No, I'm not the last.'' And then when you get to the last one, it 981 00:45:16,000 --> 00:45:18,000 will be marked as ''yes,'' and that's how you'll know - that's where A ends and the 982 00:45:18,000 --> 00:45:20,000 next one beings. 983 00:45:20,000 --> 00:45:21,000 And so if I can get it down in this way 984 00:45:21,000 --> 00:45:24,000 985 00:45:24,000 --> 00:45:28,000 I've taken 986 00:45:28,000 --> 00:45:31,000 what was a ten-byte structure down to four. And it turns out there's often 987 00:45:31,000 --> 00:45:35,000 reasons why trying to get it down to a multiple of two is actually 988 00:45:35,000 --> 00:45:37,000 a critical barrier. That if it were five bytes, it's actually likely to be eight, 989 00:45:37,000 --> 00:45:39,000 just because of what's called alignment and padding restrictions. 990 00:45:39,000 --> 00:45:42,000 So if were at 991 00:45:42,000 --> 00:45:45,000 seven and I squeeze down to six, it might not make any change. And if I squeeze down to five, it 992 00:45:45,000 --> 00:45:47,000 still might not make any change. It might be, actually, artificially decided that in 993 00:45:47,000 --> 00:45:50,000 each of the structures is going to be eight behind the scenes. 994 00:45:50,000 --> 00:45:54,000 But getting it down to four actually tends to be another one of those critical barriers where 995 00:45:54,000 --> 00:45:56,000 okay, four actually will be smaller than something that was ten. And 996 00:45:56,000 --> 00:45:58,000 so I take my 80,000 nodes, 997 00:45:58,000 --> 00:46:03,000 I get them to be four bytes each, 998 00:46:03,000 --> 00:46:06,000 and then I now have a data structure that's about a third of a megabyte in memory, 999 00:46:06,000 --> 00:46:09,000 as opposed to the over a megabyte on disk 1000 00:46:09,000 --> 00:46:11,000 in the text form. 1001 00:46:11,000 --> 00:46:12,000 And so that's how it does what it does. So if you're going to look at the code, 1002 00:46:12,000 --> 00:46:16,000 you'll see that 1003 00:46:16,000 --> 00:46:17,000 there is this complicated bit pack structure that 1004 00:46:17,000 --> 00:46:20,000 it does 1005 00:46:20,000 --> 00:46:24,000 look for things char byte using a recursive character-by-character 1006 00:46:24,000 --> 00:46:24,000 strategy of start at the top, find the matching character, then go to the children index, 1007 00:46:24,000 --> 00:46:28,000 and then 1008 00:46:28,000 --> 00:46:33,000 work your way down the children to find the next matching character, and so on 1009 00:46:33,000 --> 00:46:37,000 to find its way through the data structure. 1010 00:46:37,000 --> 00:46:41,000 So [inaudible] is where the child index, or for example, the idea that 1011 00:46:41,000 --> 00:46:42,000 A says my children start at 27 or B starts at 39, 1012 00:46:42,000 --> 00:46:45,000 and so I just need 1013 00:46:45,000 --> 00:46:51,000 a number there, but I'm using 1014 00:46:51,000 --> 00:46:55,000 a slightly smaller than an integer, just to make them all fit very tightly. So I'll tell 1015 00:46:55,000 --> 00:46:57,000 you a couple of interesting things about it, actually, before I close this off, which is the 1016 00:46:57,000 --> 00:47:02,000 lexicon.dat file 1017 00:47:02,000 --> 00:47:03,000 actually just is an in-memory representation of what the DAWG data 1018 00:47:03,000 --> 00:47:07,000 structure looks like. 1019 00:47:07,000 --> 00:47:11,000 And so in fact, it's very easy to take the whole data structure. It doesn't have any pointers in it. Pointers, actually, tend to be a 1020 00:47:11,000 --> 00:47:15,000 little bit - make data structures harder to rebuild and 1021 00:47:15,000 --> 00:47:17,000 to dehydrate to disk because they have to kinda be wired up in memory. You don't know what 1022 00:47:17,000 --> 00:47:20,000 memory you're going to get from new and what the addresses are. 1023 00:47:20,000 --> 00:47:22,000 This one, actually, doesn't have any pointers. Everything is actually kinda self-contained. It's 1024 00:47:22,000 --> 00:47:26,000 just one big array of these blocks. 1025 00:47:26,000 --> 00:47:27,000 And all the information is relative in there. It says go to the index 27, index 1026 00:47:27,000 --> 00:47:30,000 45. 1027 00:47:30,000 --> 00:47:34,000 So you can take that whole block and just write it to disk, 300 bytes worth, 1028 00:47:34,000 --> 00:47:36,000 read it back in 300 bytes worth and it kinda is rehydrated 1029 00:47:36,000 --> 00:47:39,000 without any extra effort. So 1030 00:47:39,000 --> 00:47:42,000 that's actually what the lexicon.dat file is, it's just a dehydrated form of 1031 00:47:42,000 --> 00:47:44,000 the DAWG having been 1032 00:47:44,000 --> 00:47:48,000 build in memory. That makes it very, very fast to load. 1033 00:47:48,000 --> 00:47:49,000 At that point, though, it's super inflexible. If you want to add a word to 1034 00:47:49,000 --> 00:47:52,000 it, 1035 00:47:52,000 --> 00:47:56,000 the way the structure is build, you'd have to kinda add nodes and make sure it wasn't 1036 00:47:56,000 --> 00:47:58,000 glomming on to some existing words that were nearby, and so it would be a lot of work to 1037 00:47:58,000 --> 00:48:01,000 actually edit the DAWG in place. 1038 00:48:01,000 --> 00:48:04,000 In fact, when you ask it to add a word, 1039 00:48:04,000 --> 00:48:05,000 it actually doesn't edit that one at all. 1040 00:48:05,000 --> 00:48:09,000 You say I'd like to put a new word in. 1041 00:48:09,000 --> 00:48:11,000 It says I'm just - this one's so perfect the way it is. I've built this one, I love this 1042 00:48:11,000 --> 00:48:14,000 one, I'm very happy with it. 1043 00:48:14,000 --> 00:48:16,000 If you want to put some other words into it, it just keeps a second data structure 1044 00:48:16,000 --> 00:48:21,000 on the side, an 1045 00:48:21,000 --> 00:48:21,000 auxiliary one over here, where it sticks the additional words you ask it to put in. I 1046 00:48:21,000 --> 00:48:24,000 have 1047 00:48:24,000 --> 00:48:28,000 to say that seems like a very obvious thing to do, but in fact, I didn't even think of 1048 00:48:28,000 --> 00:48:31,000 that for years. We had the lexicon. I said you can't use the lexicon to make new 1049 00:48:31,000 --> 00:48:32,000 word lists or edit them. You can only load this one, beautiful, perfect one. You can't do 1050 00:48:32,000 --> 00:48:35,000 anything else. And 1051 00:48:35,000 --> 00:48:38,000 at some point people kept saying, ''I wish I could put some words in the lexicon. I wish I could use it 1052 00:48:38,000 --> 00:48:41,000 for the word list to keep track of the human and computer players.'' I'm like, ''Well, you just 1053 00:48:41,000 --> 00:48:41,000 can't do it. You don't understand. It's very complicated. 1054 00:48:41,000 --> 00:48:45,000 You 1055 00:48:45,000 --> 00:48:50,000 just don't understand my pain.'' Whatever. Whatever. I had excuses. And then, really, 1056 00:48:50,000 --> 00:48:51,000 after some number of years of just being a baby about it, I'm like oh, of course, why don't I 1057 00:48:51,000 --> 00:48:54,000 just 1058 00:48:54,000 --> 00:48:57,000 keep another structure off to the side? It's an abstraction. It's a word list. You don't need 1059 00:48:57,000 --> 00:49:00,000 to know what I do. The fact that I keep a DAWG for some of the words 1060 00:49:00,000 --> 00:49:04,000 and just an ordinary set for some of the other words 1061 00:49:04,000 --> 00:49:06,000 is really completely irrelevant. And as long as when you ask for words I 1062 00:49:06,000 --> 00:49:08,000 look in both and I tell you, ''Yes, it's there,'' why do you need to know 1063 00:49:08,000 --> 00:49:11,000 where I keep it? Okay, so it took 1064 00:49:11,000 --> 00:49:13,000 me a while to buy abstraction, even though I teach it all the time. It just 1065 00:49:13,000 --> 00:49:17,000 wasn't totally making sense. 1066 00:49:17,000 --> 00:49:19,000 So a neat little thing, just kinda keeping both behind. And so it turns out that I 1067 00:49:19,000 --> 00:49:21,000 learned about this, actually, from a paper 1068 00:49:21,000 --> 00:49:22,000 that I read at the - in the 1069 00:49:22,000 --> 00:49:23,000 1070 00:49:23,000 --> 00:49:27,000 early ?90s about somebody 1071 00:49:27,000 --> 00:49:30,000 building a Scrabble player, and they wanted a dictionary that 1072 00:49:30,000 --> 00:49:33,000 supported kinda finding plays on the board and being able to extend words. And so 1073 00:49:33,000 --> 00:49:34,000 knowing things about root words and how they extend and prefixes and suffixes was kinda the 1074 00:49:34,000 --> 00:49:38,000 original motivation. 1075 00:49:38,000 --> 00:49:39,000 And this kind of data structure is actually really great for any kind of word playing. 1076 00:49:39,000 --> 00:49:42,000 So you have 1077 00:49:42,000 --> 00:49:43,000 anagrams that you want to rearrange or words that you want to extend past suffix and 1078 00:49:43,000 --> 00:49:46,000 prefix that 1079 00:49:46,000 --> 00:49:50,000 this kind of data structure is really optimized for 1080 00:49:50,000 --> 00:49:52,000 doing that kind of traversal and kinda leveraging the existing patterns 1081 00:49:52,000 --> 00:49:56,000 in words to save space while doing so. 1082 00:49:56,000 --> 00:50:00,000 So it was a really fun little project to do. It was kinda neat to think about. Even though today, 1083 00:50:00,000 --> 00:50:02,000 if you needed a lexicon, you could just use a set. It's fast enough. It takes 1084 00:50:02,000 --> 00:50:05,000 a ton of memory, but you know what, memory's cheap. 1085 00:50:05,000 --> 00:50:06,000 So this is not the kind of thing you would need to do today except when you were in some kinda 1086 00:50:06,000 --> 00:50:07,000 embedded, 1087 00:50:07,000 --> 00:50:11,000 low memory, 1088 00:50:11,000 --> 00:50:14,000 very, very tight processor environment, but it still makes for kinda interesting sort of artifact 1089 00:50:14,000 --> 00:50:18,000 to think about well, how you get there and what kind of things you can use. 1090 00:50:18,000 --> 00:50:19,000 So Wednesday we will talk about some big picture design, some kinda wrap stuff, and I will 1091 00:50:19,000 --> 00:50:22,000 see 1092 00:50:22,000 --> 00:50:23,000 you