1 00:00:00,000 --> 00:00:11,000 2 00:00:11,000 --> 00:00:14,000 This presentation is delivered by the Stanford Center for Professional 3 00:00:14,000 --> 00:00:25,000 Development. 4 00:00:25,000 --> 00:00:27,000 Okay, we got volume and everything. All 5 00:00:27,000 --> 00:00:28,000 right, here we go. Lots of you think your 6 00:00:28,000 --> 00:00:32,000 learning things that aren't gonna actually someday help you in your future 7 00:00:32,000 --> 00:00:34,000 career. Let's say you plan to be president. You 8 00:00:34,000 --> 00:00:38,000 think, ''Should I take 106b; should I not take 106b?'' Watch this interview and 9 00:00:38,000 --> 00:00:39,000 you will see. [Video] 10 00:00:39,000 --> 00:00:41,000 We have questions, and we ask 11 00:00:41,000 --> 00:00:45,000 our candidates questions, and this one is from Larry Schwimmer. What 12 00:00:45,000 --> 00:00:48,000 - you 13 00:00:48,000 --> 00:00:52,000 guys think I'm kidding; it's right here. 14 00:00:52,000 --> 00:00:59,000 What is the most efficient way to sort a million 32-bit integers? 15 00:00:59,000 --> 00:01:01,000 Well - I'm 16 00:01:01,000 --> 00:01:07,000 sorry. Maybe we should - that's not a - I think the bubble sort 17 00:01:07,000 --> 00:01:09,000 would be the wrong way to go. Come on; who told him this? 18 00:01:09,000 --> 00:01:14,000 There you go. So, apparently - 19 00:01:14,000 --> 00:01:15,000 the question was 20 00:01:15,000 --> 00:01:16,000 actually 21 00:01:16,000 --> 00:01:19,000 by a former student of mine, in 22 00:01:19,000 --> 00:01:22,000 fact, Larry Schwimmer. So apparently, he's been prepped well. As part of going on the campaign trail, not only do you 23 00:01:22,000 --> 00:01:25,000 have to know all your policy statements, you also have to have your 24 00:01:25,000 --> 00:01:29,000 N2 and login sorts kept straight in your mind, so - just so 25 00:01:29,000 --> 00:01:31,000 you know. So let me give you 26 00:01:31,000 --> 00:01:35,000 some examples of things that are graphs, just to get you thinking about 27 00:01:35,000 --> 00:01:36,000 28 00:01:36,000 --> 00:01:39,000 the data structure that we're gonna be working with and playing around with 29 00:01:39,000 --> 00:01:40,000 today. Sometimes you 30 00:01:40,000 --> 00:01:44,000 think of graphs of being those, like, bar graphs, right, your graphing 31 00:01:44,000 --> 00:01:46,000 histograms. This is actually a different kind of graph. The computer scientist's form of 32 00:01:46,000 --> 00:01:50,000 graph is just a generalized, recursive data structure 33 00:01:50,000 --> 00:01:51,000 that, kind of, 34 00:01:51,000 --> 00:01:54,000 starts with something that looks a little bit like a tree and extends it to 35 00:01:54,000 --> 00:01:55,000 this notion that there are 36 00:01:55,000 --> 00:01:57,000 any number of nodes, in 37 00:01:57,000 --> 00:02:00,000 this case, represented as the cities in this airline route map, 38 00:02:00,000 --> 00:02:04,000 and then there are connections between them, sometimes called edges or arcs that wire up 39 00:02:04,000 --> 00:02:08,000 the connections that in the case, for example, of this graph are showing you 40 00:02:08,000 --> 00:02:10,000 which cities have connecting flights. You have a direct 41 00:02:10,000 --> 00:02:15,000 flight that takes you from Las Vegas to Phoenix, right, one from Phoenix to Oklahoma 42 00:02:15,000 --> 00:02:15,000 City. 43 00:02:15,000 --> 00:02:18,000 There is not a direct flight from Phoenix to Kansas City, but there's a path 44 00:02:18,000 --> 00:02:21,000 by which you could get there by taking a sequence 45 00:02:21,000 --> 00:02:25,000 of connecting flights, right, going from Phoenix, to Oklahoma City, to Kansas City is 46 00:02:25,000 --> 00:02:26,000 one way to get there. 47 00:02:26,000 --> 00:02:30,000 And so the entire, kind of, map of airline routes forms a graph of the city, 48 00:02:30,000 --> 00:02:34,000 and there are a lot of interesting questions that having represented that data you might want to 49 00:02:34,000 --> 00:02:35,000 answer about finding 50 00:02:35,000 --> 00:02:36,000 flights that are 51 00:02:36,000 --> 00:02:40,000 cheap, or meet your time specification, or get you where you want to go, 52 00:02:40,000 --> 00:02:47,000 and that this data would be, kind of, primary in solving that problem. 53 00:02:47,000 --> 00:02:48,000 Another one, 54 00:02:48,000 --> 00:02:49,000 right, another 55 00:02:49,000 --> 00:02:52,000 idea of something that's represented as a graph is something that would support 56 00:02:52,000 --> 00:02:54,000 the notion of word ladders, those little puzzles where you're 57 00:02:54,000 --> 00:02:56,000 given the word ''chord,'' 58 00:02:56,000 --> 00:03:00,000 and you want to change it into the word ''worm,'' and you have these rules that say, well, you 59 00:03:00,000 --> 00:03:02,000 can change one letter at a time. 60 00:03:02,000 --> 00:03:04,000 And so the 61 00:03:04,000 --> 00:03:07,000 connections in the case of the word letter have to do with words that are 62 00:03:07,000 --> 00:03:12,000 one character different from their neighbors, have direct connections or arcs 63 00:03:12,000 --> 00:03:16,000 between them. Each of the nodes themselves represents a word, and then 64 00:03:16,000 --> 00:03:19,000 paths between that represent word ladders. That if you can go from here 65 00:03:19,000 --> 00:03:22,000 through a sequence of steps directly connecting your way, each of those stepping 66 00:03:22,000 --> 00:03:26,000 stones is a word in the chain that makes a word ladder. So 67 00:03:26,000 --> 00:03:29,000 things you might want to solve there is what's the shortest word ladder? How many different 68 00:03:29,000 --> 00:03:31,000 word ladders - how many different ways can I go from this word to that one 69 00:03:31,000 --> 00:03:36,000 could be answered by searching around in a space like this. 70 00:03:36,000 --> 00:03:39,000 Any kind of prerequisite structure, 71 00:03:39,000 --> 00:03:42,000 such as the one for the major that might say you need to take this class before you take 72 00:03:42,000 --> 00:03:45,000 that class, and this class before that class, right, 73 00:03:45,000 --> 00:03:46,000 forms a 74 00:03:46,000 --> 00:03:49,000 structure that also fits into the main graph. 75 00:03:49,000 --> 00:03:52,000 In this case, there's a connection to those arcs. 76 00:03:52,000 --> 00:03:54,000 That the prerequisite is such that 77 00:03:54,000 --> 00:03:58,000 you take 106a, and it leads into 106b, and that connection doesn't really go in 78 00:03:58,000 --> 00:04:01,000 the reverse, right? You don't start in 106b and move backwards. 79 00:04:01,000 --> 00:04:05,000 Whereas in the case, for example, of the word ladder, all of these arcs are what we called undirected, 80 00:04:05,000 --> 00:04:09,000 right? You can traverse from wood to word and back again, right? 81 00:04:09,000 --> 00:04:10,000 They're equally 82 00:04:10,000 --> 00:04:12,000 83 00:04:12,000 --> 00:04:15,000 visitable or neighbors, in this case. So there may be a situation where you have directed 84 00:04:15,000 --> 00:04:20,000 arcs where they really imply a one-way directionality through the paths. 85 00:04:20,000 --> 00:04:22,000 Another one, lots and lots of 86 00:04:22,000 --> 00:04:23,000 arrows that 87 00:04:23,000 --> 00:04:25,000 tell you about the lifecycle of 88 00:04:25,000 --> 00:04:26,000 89 00:04:26,000 --> 00:04:30,000 Stanford relationships all captured on one 90 00:04:30,000 --> 00:04:31,000 slide 91 00:04:31,000 --> 00:04:34,000 in, sort of, the flowchart. So flowcharts have a lot of things about directionality. Like, you're 92 00:04:34,000 --> 00:04:37,000 in this state, and you can move to this state, so that's your neighboring state, and 93 00:04:37,000 --> 00:04:39,000 the arcs in there connect those things. 94 00:04:39,000 --> 00:04:43,000 And so, as you will see, like, you get to start over here and, like, be single and love 95 00:04:43,000 --> 00:04:47,000 it, and then you can, kind of, flirt with various degrees of seriousness, 96 00:04:47,000 --> 00:04:50,000 and then eventually there evolves a trombone and serenade, which, you know, 97 00:04:50,000 --> 00:04:53,000 you can't go back to flirting aimlessly after the trombone and serenade; there's no 98 00:04:53,000 --> 00:04:55,000 arc that direction. 99 00:04:55,000 --> 00:04:57,000 When somebody gets the trombone out, 100 00:04:57,000 --> 00:04:59,000 you know, you take them seriously, 101 00:04:59,000 --> 00:05:02,000 or you completely blow them off. There's the [inaudible] there. 102 00:05:02,000 --> 00:05:04,000 And then there's dating, and then there's 103 00:05:04,000 --> 00:05:08,000 two-phase commit, which is a very important part of any computer scientist's education, 104 00:05:08,000 --> 00:05:10,000 and then, potentially, other 105 00:05:10,000 --> 00:05:13,000 tracks of children and what not. But it does 106 00:05:13,000 --> 00:05:15,000 show you, kind of, places you can go, right? 107 00:05:15,000 --> 00:05:18,000 It turns out you actually don't get to go from flirt aimlessly to have baby, so 108 00:05:18,000 --> 00:05:20,000 write that down - 109 00:05:20,000 --> 00:05:23,000 not allowed in this graph. And 110 00:05:23,000 --> 00:05:28,000 then it turns out some things that you've already seen are actually graphs in disguise. 111 00:05:28,000 --> 00:05:32,000 I didn't tell you when I gave you the maze problem and its solution, the solving and 112 00:05:32,000 --> 00:05:33,000 building of a maze, 113 00:05:33,000 --> 00:05:37,000 that, in fact, what you're working on though is something that is a graph. 114 00:05:37,000 --> 00:05:39,000 That a perfect maze, right, 115 00:05:39,000 --> 00:05:40,000 so we started with this 116 00:05:40,000 --> 00:05:44,000 fully disconnected maze. If you remember how the Aldis/Broder algorithm worked, it's like 117 00:05:44,000 --> 00:05:47,000 you have all these nodes that have no connections between them, 118 00:05:47,000 --> 00:05:51,000 and then you went bopping around breaking down walls, which is effectively 119 00:05:51,000 --> 00:05:53,000 connecting up the nodes to each other, 120 00:05:53,000 --> 00:05:57,000 and we kept doing that until actually all of the nodes were connected. So, in effect, we 121 00:05:57,000 --> 00:06:00,000 were building a graph out of this big grid of cells 122 00:06:00,000 --> 00:06:03,000 by knocking down those walls to build the connections, and when we were done, then 123 00:06:03,000 --> 00:06:07,000 we made it so that it was possible to trace paths starting from that lower corner 124 00:06:07,000 --> 00:06:10,000 and working our way through the neighbors all the way over to there. 125 00:06:10,000 --> 00:06:13,000 So in the case of this, if you imagine that each of these cells was broken out in this little 126 00:06:13,000 --> 00:06:16,000 thing, there'd be these neighboring connections where this guy has this 127 00:06:16,000 --> 00:06:19,000 neighbor but no other ones because actually it has no 128 00:06:19,000 --> 00:06:22,000 other directions you can move from there, but some of them, for example like this 129 00:06:22,000 --> 00:06:23,000 one, 130 00:06:23,000 --> 00:06:26,000 has a full set of four neighbors you can reach 131 00:06:26,000 --> 00:06:28,000 depending on which walls are intact or not. 132 00:06:28,000 --> 00:06:29,000 And so the 133 00:06:29,000 --> 00:06:31,000 creation of that was creating a graph out 134 00:06:31,000 --> 00:06:34,000 of a set of disconnected nodes, and then solving it is searching that 135 00:06:34,000 --> 00:06:38,000 graph for a path. Here's one that is a little horrifying that came from the newspaper, which is social networking, kind of, a very big [inaudible] of, like, on the net who do you know, and who do they know, and how do those connections, like, linked in maybe help you get a job. This is one that was actually based on the liaisons, let's say, of a group of high school students, right, and it turns out there was a lot of more interconnected in that graph than I would expect, or appreciate, or approve of but interesting to, kind of, look at. So 136 00:06:38,000 --> 00:06:39,000 let's talk, like, concretely. 137 00:06:39,000 --> 00:06:43,000 How are we gonna make stuff work on graphs? Graphs turn out to actually - by showing you that, I'm trying to 138 00:06:43,000 --> 00:06:46,000 get you this idea that a lot of things are really just graphs in disguise. That a lot 139 00:06:46,000 --> 00:06:50,000 of the problems you may want to solve, if you can represent it as a graph, then things 140 00:06:50,000 --> 00:06:53,000 you learn about how to manipulate graphs will help you to solve that kind of 141 00:06:53,000 --> 00:06:54,000 problem. 142 00:06:54,000 --> 00:06:59,000 So the basic idea of a graph, it is a recursive data structure based on the node where 143 00:06:59,000 --> 00:07:02,000 nodes have connections to other nodes. 144 00:07:02,000 --> 00:07:04,000 Okay. So that's, kind of, like a tree, 145 00:07:04,000 --> 00:07:07,000 but it actually is more freeform than a tree, right? 146 00:07:07,000 --> 00:07:11,000 In a tree, right, there's that single root node that has, kind of, a special place, and 147 00:07:11,000 --> 00:07:15,000 there's this restriction about the connections that there's a single path 148 00:07:15,000 --> 00:07:17,000 from every other node in the tree 149 00:07:17,000 --> 00:07:20,000 starting from the root that leads to it. So you never can get there by 150 00:07:20,000 --> 00:07:24,000 multiple ways, and there's no cycles, and it's all connected, and things like 151 00:07:24,000 --> 00:07:25,000 that. 152 00:07:25,000 --> 00:07:26,000 So, in terms of a graph, 153 00:07:26,000 --> 00:07:29,000 it just doesn't place those restrictions. It says there's a node. 154 00:07:29,000 --> 00:07:32,000 It has any number of pointers to other nodes. It potentially could even 155 00:07:32,000 --> 00:07:36,000 have zero pointers to other's nodes, so it could be on a little island of its own out 156 00:07:36,000 --> 00:07:38,000 here that has no incoming or outgoing flights, right? You 157 00:07:38,000 --> 00:07:40,000 have to swim if you want to get there. 158 00:07:40,000 --> 00:07:42,000 159 00:07:42,000 --> 00:07:45,000 And then there's also not a rule that says that you can't have 160 00:07:45,000 --> 00:07:47,000 multiple ways to get to the same 161 00:07:47,000 --> 00:07:50,000 node. So if you're at B, and you're interested in getting to C, 162 00:07:50,000 --> 00:07:52,000 you could take the direction connection here, 163 00:07:52,000 --> 00:07:54,000 but you could also take a longer path 164 00:07:54,000 --> 00:07:58,000 that bops through A and hits to C, and so there's more than one way to get there, 165 00:07:58,000 --> 00:07:59,000 166 00:07:59,000 --> 00:08:02,000 which would not be true in a tree structure where it has that constraint, 167 00:08:02,000 --> 00:08:05,000 but that adds some interesting challenges to solving problems in the graphs 168 00:08:05,000 --> 00:08:08,000 because there actually are so many different ways you can 169 00:08:08,000 --> 00:08:11,000 get to the same place. We have to be careful avoiding 170 00:08:11,000 --> 00:08:13,000 getting into circularities where we're 171 00:08:13,000 --> 00:08:16,000 doing that, and also not redundantly doing work we've already done before because 172 00:08:16,000 --> 00:08:18,000 we've visited it previously. 173 00:08:18,000 --> 00:08:21,000 There can be cycles where 174 00:08:21,000 --> 00:08:22,000 you can 175 00:08:22,000 --> 00:08:24,000 completely come around. I this one, actually, 176 00:08:24,000 --> 00:08:26,000 doesn't have a cycle in it. 177 00:08:26,000 --> 00:08:27,000 Nope, it doesn't, 178 00:08:27,000 --> 00:08:31,000 but if I add it - I could add a link, for example, from 179 00:08:31,000 --> 00:08:34,000 C back to B, and then there would be this place you could just, kind of, keep 180 00:08:34,000 --> 00:08:36,000 rolling around in the B, A, C 181 00:08:36,000 --> 00:08:38,000 area. 182 00:08:38,000 --> 00:08:41,000 So we call each of those guys, the circles here, the nodes, right? 183 00:08:41,000 --> 00:08:44,000 The connection between them arcs. We 184 00:08:44,000 --> 00:08:47,000 will talk about the arcs as being directed and undirected in different situations 185 00:08:47,000 --> 00:08:51,000 where we imply that there's a directionality; you can travel the arc only one-way, 186 00:08:51,000 --> 00:08:54,000 or in cases where it's undirected, it means it goes both ways. 187 00:08:54,000 --> 00:08:56,000 We'll talk about two nodes being connected, 188 00:08:56,000 --> 00:08:58,000 meaning there is a direct connection, an arc, between them, 189 00:08:58,000 --> 00:09:02,000 and if they are not directly connected, there may possibly be a path 190 00:09:02,000 --> 00:09:05,000 between them. A sequence of arcs forms a path that you can follow to get 191 00:09:05,000 --> 00:09:07,000 from one node to another, 192 00:09:07,000 --> 00:09:11,000 and then cycles just meaning a path that revisits one of the nodes that 193 00:09:11,000 --> 00:09:14,000 was previously traveled on that path. So I've got 194 00:09:14,000 --> 00:09:16,000 a little terminology. 195 00:09:16,000 --> 00:09:18,000 Let's talk a little bit about how we're gonna represent it. 196 00:09:18,000 --> 00:09:22,000 So before we start making stuff happen on them, it's, like, well, how does this - 197 00:09:22,000 --> 00:09:26,000 in C++, right, what's the best way, or what are the reasonable ways you 198 00:09:26,000 --> 00:09:27,000 might represent 199 00:09:27,000 --> 00:09:30,000 this kind of connected structure? 200 00:09:30,000 --> 00:09:31,000 So if I have 201 00:09:31,000 --> 00:09:36,000 this four-node graph, A, B, C, D over where with the connections that are shown with 202 00:09:36,000 --> 00:09:38,000 direction, in this case, I have directed arcs. 203 00:09:38,000 --> 00:09:41,000 That one way to think of it is that what a 204 00:09:41,000 --> 00:09:45,000 graph really is is a set of nodes and a set of arcs, and by set, I mean, sort of, just the 205 00:09:45,000 --> 00:09:48,000 generalization. There might be a vector of arcs; there might be a vector of 206 00:09:48,000 --> 00:09:51,000 nodes, whatever, but some collection of arcs, some collection of nodes, 207 00:09:51,000 --> 00:09:53,000 and that the 208 00:09:53,000 --> 00:09:56,000 nodes might have information about what location they represent, or 209 00:09:56,000 --> 00:09:59,000 what word, or what entity, you know, in a social network that they are, 210 00:09:59,000 --> 00:10:03,000 and then the arcs would show who is directly connected to whom. 211 00:10:03,000 --> 00:10:06,000 And so one way to manage that, right, is just to have two independent collections - a collection 212 00:10:06,000 --> 00:10:08,000 of nodes and a 213 00:10:08,000 --> 00:10:11,000 collection of arcs, and that when I need to know things about connections, I'll have to, kind of, use both 214 00:10:11,000 --> 00:10:12,000 in tandem. 215 00:10:12,000 --> 00:10:14,000 So if I want to know if A is connected to B, 216 00:10:14,000 --> 00:10:18,000 I might actually have to just walk through the entire set of arcs trying to find 217 00:10:18,000 --> 00:10:22,000 one that connects A and B as its endpoints. If I want to see if 218 00:10:22,000 --> 00:10:25,000 C is connected to B, same thing, just walk down the whole set. 219 00:10:25,000 --> 00:10:27,000 220 00:10:27,000 --> 00:10:29,000 221 00:10:29,000 --> 00:10:31,000 That's probably the simplest thing to do, right, is to just, kind of, 222 00:10:31,000 --> 00:10:35,000 have them just be maintained as the two things that are used in conjunction. 223 00:10:35,000 --> 00:10:37,000 More likely, what you're gonna want to do 224 00:10:37,000 --> 00:10:39,000 is provide some access that 225 00:10:39,000 --> 00:10:42,000 when at a current node, you're currently trying to do some processing 226 00:10:42,000 --> 00:10:47,000 from the Node B, it would be convenient if you could easily access those nodes 227 00:10:47,000 --> 00:10:51,000 that emanate or outgo from the B node. That's typically the thing you're trying 228 00:10:51,000 --> 00:10:52,000 to do is 229 00:10:52,000 --> 00:10:55,000 at B try to visit its neighbors, 230 00:10:55,000 --> 00:10:58,000 and rather that having to, kind of, pluck them out of the entire collection, 231 00:10:58,000 --> 00:11:01,000 it might be handy if they were already, kind of, stored in such a way to make it 232 00:11:01,000 --> 00:11:03,000 easy for you to get to that. 233 00:11:03,000 --> 00:11:06,000 So the two ways that try to represent adjacency 234 00:11:06,000 --> 00:11:10,000 in a more efficient way are the adjacency list and the adjacency matrix. 235 00:11:10,000 --> 00:11:14,000 So in an adjacency list representation, we have a way of associating with each node, 236 00:11:14,000 --> 00:11:17,000 probably just by storing it in the node structure itself, 237 00:11:17,000 --> 00:11:21,000 those arcs that outgo from there. So, in this case, A has as direct 238 00:11:21,000 --> 00:11:22,000 connection to the B 239 00:11:22,000 --> 00:11:23,000 and C Nodes, 240 00:11:23,000 --> 00:11:26,000 and so I would have that information stored with the A Node, which is my 241 00:11:26,000 --> 00:11:29,000 vector or set of outgoing connections. 242 00:11:29,000 --> 00:11:33,000 Similarly, B has one outgoing connection, which goes to D. C has an outgoing connection to D, and 243 00:11:33,000 --> 00:11:37,000 then D has outgoing connections that lead back to A and 244 00:11:37,000 --> 00:11:40,000 to B. So at a particular node I could say, well, who's my neighbors? 245 00:11:40,000 --> 00:11:41,000 It would be 246 00:11:41,000 --> 00:11:46,000 easily accessible and not have to be retrieved, or searched, or filtered. 247 00:11:46,000 --> 00:11:48,000 Another way of doing that 248 00:11:48,000 --> 00:11:51,000 is to realize that, in a sense, what we have is this, kind of, 249 00:11:51,000 --> 00:11:53,000 end by end grid 250 00:11:53,000 --> 00:11:56,000 where each node is potentially connected to 251 00:11:56,000 --> 00:12:00,000 the other N -1 nodes in that graph. And so if I just build 252 00:12:00,000 --> 00:12:03,000 a 2x2 matrix that's labeled with 253 00:12:03,000 --> 00:12:06,000 all the node names on this side and all the node names across the top, that the 254 00:12:06,000 --> 00:12:08,000 intersection of this says 255 00:12:08,000 --> 00:12:11,000 is there an arc that leads away from A and 256 00:12:11,000 --> 00:12:14,000 ends at B? Yes, there is. Is there one from A to C? 257 00:12:14,000 --> 00:12:18,000 And in the places where there is not an existing connection, a self loop or 258 00:12:18,000 --> 00:12:19,000 259 00:12:19,000 --> 00:12:23,000 just a connection that doesn't exist, right, I have an empty slot. So maybe these would be Booleans 260 00:12:23,000 --> 00:12:25,000 true and false or something, or maybe there'd be some more information here 261 00:12:25,000 --> 00:12:27,000 about the distance 262 00:12:27,000 --> 00:12:30,000 or other associated arc properties for that 263 00:12:30,000 --> 00:12:31,000 particular connection. 264 00:12:31,000 --> 00:12:34,000 In this case, it's actually very easy then to actually do the really quick 265 00:12:34,000 --> 00:12:35,000 search of 266 00:12:35,000 --> 00:12:38,000 I'm at A and I want to know if I can get to B, then it's really just a 267 00:12:38,000 --> 00:12:41,000 matter of reaching right into the 268 00:12:41,000 --> 00:12:45,000 slot in constant time in that matrix to pull it out, and 269 00:12:45,000 --> 00:12:47,000 so 270 00:12:47,000 --> 00:12:49,000 this one involves a lot of searching to find things. This one involves, perhaps, a 271 00:12:49,000 --> 00:12:52,000 little bit of searching. If I want to know if A is connected to B, I might have to look 272 00:12:52,000 --> 00:12:55,000 through all its outgoing connections, right? This one gives me that direct 273 00:12:55,000 --> 00:12:56,000 access, 274 00:12:56,000 --> 00:13:00,000 but the tradeoffs here has to do with where the space starts building up, right? 275 00:13:00,000 --> 00:13:03,000 A full NxN matrix could be very big, 276 00:13:03,000 --> 00:13:06,000 and sometimes we're allocating space as though there are a full set of 277 00:13:06,000 --> 00:13:09,000 connections between all the nodes, every node connected to every other, 278 00:13:09,000 --> 00:13:12,000 and in the cases where a lot of those things are false, there's a lot of 279 00:13:12,000 --> 00:13:13,000 capacity 280 00:13:13,000 --> 00:13:16,000 that we have set aside that we're not really tapping into. 281 00:13:16,000 --> 00:13:19,000 And so in the case of a graph that we call dense, 282 00:13:19,000 --> 00:13:23,000 where there's a lot of connections, it will use a lot of that capacity, and it might make sense 283 00:13:23,000 --> 00:13:27,000 in a graph that's more sparse, where there's a few connections. So you have 284 00:13:27,000 --> 00:13:31,000 thousands of nodes, but maybe only three or four on average are connected to one. 285 00:13:31,000 --> 00:13:35,000 Then having allocated 1,000 slots of which to only fill in three 286 00:13:35,000 --> 00:13:36,000 might become 287 00:13:36,000 --> 00:13:40,000 rather space inefficient. You might prefer an adjacency list representation that 288 00:13:40,000 --> 00:13:41,000 can get you 289 00:13:41,000 --> 00:13:44,000 to the ones you're actually using versus the other, 290 00:13:44,000 --> 00:13:47,000 but in terms of the adjacency matrices is very fast, 291 00:13:47,000 --> 00:13:48,000 right? A lot of space thrown at this 292 00:13:48,000 --> 00:13:51,000 gives you this immediate 0,1 access to know 293 00:13:51,000 --> 00:13:55,000 who your neighbors are. So 294 00:13:55,000 --> 00:13:59,000 we're gonna use the adjacency list, kind of, strategy here 295 00:13:59,000 --> 00:14:02,000 for the code we're gonna do today, and it's, kind 296 00:14:02,000 --> 00:14:07,000 of, a good compromise between the two alternatives that we've seen there. 297 00:14:07,000 --> 00:14:08,000 So I have a struct node. 298 00:14:08,000 --> 00:14:11,000 It has some information for the node. Given that I'm not being very specific 299 00:14:11,000 --> 00:14:14,000 about what the graph is, I'm just gonna, kind of, leave that unspecified. It might be that it's 300 00:14:14,000 --> 00:14:17,000 the city name. It might be that it's a word. It might be that it's a person in a 301 00:14:17,000 --> 00:14:19,000 social network, you know, 302 00:14:19,000 --> 00:14:23,000 some information that represents what this node is an entity, 303 00:14:23,000 --> 00:14:28,000 and then there's gonna be a set of arcs or a set of nodes it's connected to, 304 00:14:28,000 --> 00:14:32,000 and so I'm using vector here. I could be using set. I could be using a 305 00:14:32,000 --> 00:14:35,000 raw array, or a link list, or any of these things. 306 00:14:35,000 --> 00:14:38,000 I need to use some unbounded collection though because there's no guarantee that 307 00:14:38,000 --> 00:14:42,000 they will be 0, or 1, or 2 the way there is in a link list or a tree where there's a 308 00:14:42,000 --> 00:14:45,000 specified number of outgoing links. There can be any number of them, so I'm just leaving 309 00:14:45,000 --> 00:14:46,000 310 00:14:46,000 --> 00:14:47,000 a 311 00:14:47,000 --> 00:14:51,000 variable size collection here to do that work for me. 312 00:14:51,000 --> 00:14:52,000 The graph itself, 313 00:14:52,000 --> 00:14:55,000 so I have this idea of all the nodes in the graph, right, would each get a new 314 00:14:55,000 --> 00:14:58,000 node structure and then be wired up to the other nodes that they are connected 315 00:14:58,000 --> 00:15:00,000 to, 316 00:15:00,000 --> 00:15:02,000 but then when I'm trying to operate on that graph, 317 00:15:02,000 --> 00:15:06,000 I can't just take one pointer and say here's a pointer to some node in the 318 00:15:06,000 --> 00:15:09,000 graph and say this is - and from here you have accessed everything. In the way 319 00:15:09,000 --> 00:15:11,000 that a tree, 320 00:15:11,000 --> 00:15:13,000 the only thing we need to keep track of is the pointer to the root, 321 00:15:13,000 --> 00:15:17,000 or a link list, the only thing we keep a pointer to, typically, is that frontmost 322 00:15:17,000 --> 00:15:19,000 cell, and from there, we can reach everything else. 323 00:15:19,000 --> 00:15:23,000 There is no special head or root cell in a graph. 324 00:15:23,000 --> 00:15:27,000 A graph is this 325 00:15:27,000 --> 00:15:30,000 being, kind of, a crazy collection without a lot of rules about 326 00:15:30,000 --> 00:15:35,000 how they are connected. In fact, it's not even guaranteed that they totally are 327 00:15:35,000 --> 00:15:36,000 connected. 328 00:15:36,000 --> 00:15:38,000 That I can't 329 00:15:38,000 --> 00:15:41,000 guarantee that if I had a pointer, for example, to C, 330 00:15:41,000 --> 00:15:44,000 right, I may or may not be able to reach all the nodes from there. 331 00:15:44,000 --> 00:15:49,000 If I had a pointer to A, right, A can get to C and B, and then also 332 00:15:49,000 --> 00:15:52,000 down to D, but, for example, there could just be an E node over here that was only 333 00:15:52,000 --> 00:15:55,000 connected to C or not connected to anything at all for that matter, connected 334 00:15:55,000 --> 00:15:57,000 to F in their own little island. 335 00:15:57,000 --> 00:16:01,000 That there is no way to identify some special node from which you could reach 336 00:16:01,000 --> 00:16:02,000 all the nodes. 337 00:16:02,000 --> 00:16:04,000 There isn't a special root node, 338 00:16:04,000 --> 00:16:07,000 and you can't even just arbitrarily pick one to be the root because actually there's no 339 00:16:07,000 --> 00:16:10,000 guarantee that it will be connected to all of the others. So it would not give you 340 00:16:10,000 --> 00:16:13,000 access to the full entirety of your graph if you just picked on and said I want anything 341 00:16:13,000 --> 00:16:16,000 reachable from here, it might not get you the whole thing. So, 342 00:16:16,000 --> 00:16:18,000 typically, what you need to really operate on is 343 00:16:18,000 --> 00:16:21,000 the entire collection of node stars. You'll have a set or a vector 344 00:16:21,000 --> 00:16:24,000 that contains all of the nodes that are in the graph, 345 00:16:24,000 --> 00:16:26,000 and then within that, there's a bunch of other connections that are 346 00:16:26,000 --> 00:16:35,000 being made on the graph connectivity that you're also exploring. 347 00:16:35,000 --> 00:16:37,000 So let's make it a little bit more concrete 348 00:16:37,000 --> 00:16:39,000 349 00:16:39,000 --> 00:16:40,000 and talk a little bit about 350 00:16:40,000 --> 00:16:43,000 what really is a node; what really is an arc? 351 00:16:43,000 --> 00:16:46,000 There may be, actually, be some more information I need to store than just a 352 00:16:46,000 --> 00:16:48,000 node is connected to other nodes. 353 00:16:48,000 --> 00:16:51,000 So in the case of a list or a tree, 354 00:16:51,000 --> 00:16:54,000 the next field and the left and the right field are really just pointers for the 355 00:16:54,000 --> 00:16:56,000 purpose of organization. 356 00:16:56,000 --> 00:16:58,000 That's the pointer to the next cell. 357 00:16:58,000 --> 00:17:02,000 There's no data that is really associated with that link itself. 358 00:17:02,000 --> 00:17:05,000 That's not true in the case of a graph. 359 00:17:05,000 --> 00:17:09,000 That often what the links that connect up the nodes 360 00:17:09,000 --> 00:17:12,000 actually do have a real role to play in terms of being data. 361 00:17:12,000 --> 00:17:14,000 It may be that what they tell you is 362 00:17:14,000 --> 00:17:18,000 what road you're taking here, or how long this path is, or how much this 363 00:17:18,000 --> 00:17:22,000 flight costs, or what time this flight leaves, or how full this flight already is, 364 00:17:22,000 --> 00:17:24,000 or who knows, you know, just other information 365 00:17:24,000 --> 00:17:27,000 that is more than just this nodes connected to that one. It's, like, how is it 366 00:17:27,000 --> 00:17:30,000 connected; what is it connected by, and what are the details of how that 367 00:17:30,000 --> 00:17:31,000 connection 368 00:17:31,000 --> 00:17:32,000 369 00:17:32,000 --> 00:17:34,000 is being represented? 370 00:17:34,000 --> 00:17:38,000 So it's likely that what you really want is not just pointers to other nodes but 371 00:17:38,000 --> 00:17:42,000 information about that actual link stored with an arc structure. 372 00:17:42,000 --> 00:17:44,000 So, typically, we're gonna have 373 00:17:44,000 --> 00:17:45,000 an arc 374 00:17:45,000 --> 00:17:48,000 which has information about that arc, how long, how much it costs, you know, 375 00:17:48,000 --> 00:17:50,000 what flight number it is, 376 00:17:50,000 --> 00:17:52,000 that will have pointers to the start and end node that it is connecting. 377 00:17:52,000 --> 00:17:54,000 The node 378 00:17:54,000 --> 00:17:57,000 then has a collection of arcs, 379 00:17:57,000 --> 00:18:00,000 and those arcs are expected to, in the adjacency list form, will all be arcs 380 00:18:00,000 --> 00:18:02,000 that start at this node. 381 00:18:02,000 --> 00:18:05,000 So their start node is equal to the one that's holding onto to them, and then 382 00:18:05,000 --> 00:18:06,000 they have an end 383 00:18:06,000 --> 00:18:10,000 pointer that points to the node at the other end of the arc. Well, 384 00:18:10,000 --> 00:18:13,000 this gets us into a little C++ bind 385 00:18:13,000 --> 00:18:16,000 because I've described these data structures 386 00:18:16,000 --> 00:18:17,000 in a way 387 00:18:17,000 --> 00:18:22,000 that they both depend on each other; that an arc has pointers to the start and the end 388 00:18:22,000 --> 00:18:22,000 node. 389 00:18:22,000 --> 00:18:27,000 A node has a collection of arcs, which is probably a vector or a set of arc pointers, 390 00:18:27,000 --> 00:18:28,000 and so 391 00:18:28,000 --> 00:18:31,000 I'm starting to define the structure for arc, 392 00:18:31,000 --> 00:18:33,000 and then I want to talk about node in it, 393 00:18:33,000 --> 00:18:35,000 and I think, oh, okay, well, then I better define node first because C++ always likes to 394 00:18:35,000 --> 00:18:38,000 see things in order. Remember, it always wants to see A, 395 00:18:38,000 --> 00:18:42,000 and then B if B uses A, and so in terms of how your functions and data 396 00:18:42,000 --> 00:18:45,000 structures work, you've always gotten this idea like, well, go do the one first. Well, if I say, 397 00:18:45,000 --> 00:18:48,000 okay, arc's gonna need node. I start to write arc, and I say, oh, I 398 00:18:48,000 --> 00:18:50,000 need to talk about node. Well, I better put node up here, 399 00:18:50,000 --> 00:18:54,000 right? But then when I'm starting to write node, I start talking about arc, and they have a 400 00:18:54,000 --> 00:18:57,000 circular reference to each other, right? They both 401 00:18:57,000 --> 00:19:00,000 want to depend on the other one before we've gotten around to telling the 402 00:19:00,000 --> 00:19:01,000 compiler about it. 403 00:19:01,000 --> 00:19:04,000 One of them has to go first, right? What are we gonna do? 404 00:19:04,000 --> 00:19:08,000 What we need to do is use the C++ forward reference 405 00:19:08,000 --> 00:19:09,000 as a 406 00:19:09,000 --> 00:19:12,000 little hint to the compiler 407 00:19:12,000 --> 00:19:13,000 that 408 00:19:13,000 --> 00:19:17,000 we are gonna be defining a Node T structure in a little bit; 409 00:19:17,000 --> 00:19:18,000 we're gonna get to it, 410 00:19:18,000 --> 00:19:22,000 but first we're gonna define the Arc T structure that uses that Node T, and then 411 00:19:22,000 --> 00:19:25,000 we're gonna get around to it. So we're gonna give it, like, a little heads up that says 412 00:19:25,000 --> 00:19:29,000 there later will be an Act 2 of this play. We're gonna introduce a 413 00:19:29,000 --> 00:19:31,000 character called Node T. 414 00:19:31,000 --> 00:19:33,000 So you can now 415 00:19:33,000 --> 00:19:34,000 start talking about Node 416 00:19:34,000 --> 00:19:37,000 T in some simple ways 417 00:19:37,000 --> 00:19:40,000 based on your agreement to the compiler that you plan on telling it more 418 00:19:40,000 --> 00:19:41,000 about Node T later. 419 00:19:41,000 --> 00:19:45,000 So the struct Node T says there will be a struct called Node T. 420 00:19:45,000 --> 00:19:47,000 Okay, the compiler says, okay, I'll write that down. 421 00:19:47,000 --> 00:19:51,000 You start defining struct Arc T, and it says, oh, I see you have these pointers 422 00:19:51,000 --> 00:19:54,000 to the node T. Okay. Well, you told me there'd be a struct like that; I guess that'll work out. 423 00:19:54,000 --> 00:19:58,000 And now you told it what struct Node T is. It's, like, oh, I have a vector 424 00:19:58,000 --> 00:19:59,000 with these pointers to arc, 425 00:19:59,000 --> 00:20:02,000 and then it says, okay, now I see the whole picture, and 426 00:20:02,000 --> 00:20:03,000 it all worked out. 427 00:20:03,000 --> 00:20:08,000 So it's just a little bit of a 428 00:20:08,000 --> 00:20:11,000 requirement of how the C++ compiler likes to see stuff in order without ever having 429 00:20:11,000 --> 00:20:16,000 to go backwards to check anything out again. Okay. So 430 00:20:16,000 --> 00:20:19,000 I got myself some node structures, got 431 00:20:19,000 --> 00:20:21,000 some arc structures; they're all 432 00:20:21,000 --> 00:20:22,000 working together. 433 00:20:22,000 --> 00:20:24,000 I'm gonna try to do some traversals. 434 00:20:24,000 --> 00:20:26,000 Try to do some operations that 435 00:20:26,000 --> 00:20:29,000 work their way around the graph. 436 00:20:29,000 --> 00:20:30,000 So 437 00:20:30,000 --> 00:20:33,000 tree traversals, right, the pre, and post, and in order 438 00:20:33,000 --> 00:20:36,000 are ways that you start at the root, and you visit all the nodes on the tree, 439 00:20:36,000 --> 00:20:40,000 a very common need to, kind of, process, to delete nodes, 440 00:20:40,000 --> 00:20:43,000 to print the nodes, to whatnot. So we might want to do the same thing on graphs. I've got a 441 00:20:43,000 --> 00:20:47,000 graph out there, and I'd like to go exploring it. Maybe I just want to print, and if 442 00:20:47,000 --> 00:20:50,000 I'm in the Gates building, what are all the places that I can reach from 443 00:20:50,000 --> 00:20:53,000 the Gates building? Where can I go to; what 444 00:20:53,000 --> 00:20:54,000 options do I have? 445 00:20:54,000 --> 00:20:57,000 What I'm gonna do is a traversal 446 00:20:57,000 --> 00:21:01,000 starting at a node and, kind of, working my way outward to the visible, reachable 447 00:21:01,000 --> 00:21:04,000 nodes that I can follow those paths to get there. 448 00:21:04,000 --> 00:21:07,000 We're gonna look at two different ways to do this. 449 00:21:07,000 --> 00:21:09,000 One is Depth-first and one is Breadth-first, 450 00:21:09,000 --> 00:21:13,000 and I'll show you both algorithms, and they 451 00:21:13,000 --> 00:21:15,000 will visit the same nodes; 452 00:21:15,000 --> 00:21:19,000 when all is done and said, they will visit all the reachable nodes starting from a 453 00:21:19,000 --> 00:21:20,000 position, 454 00:21:20,000 --> 00:21:21,000 but they will visit them in different order. 455 00:21:21,000 --> 00:21:25,000 So just like the pre/post in order tree traversals, they visit all the same 456 00:21:25,000 --> 00:21:28,000 notes. It's just a matter of when they get around to doing it 457 00:21:28,000 --> 00:21:29,000 is how 458 00:21:29,000 --> 00:21:32,000 we distinguish depth and breadth-first traversals. 459 00:21:32,000 --> 00:21:34,000 Because we have this graph structure 460 00:21:34,000 --> 00:21:40,000 that has this loose connectivity and the possibility of cycles and multiple paths 461 00:21:40,000 --> 00:21:41,000 to the same node, 462 00:21:41,000 --> 00:21:44,000 we are gonna have to be a little bit careful at how we do our work to make 463 00:21:44,000 --> 00:21:48,000 sure that we don't end up getting stuck in some infinite loop when we keep 464 00:21:48,000 --> 00:21:51,000 going around the ABC cycle in a particular graph. 465 00:21:51,000 --> 00:21:55,000 That we need to realize when we're revisiting something we've seen before, 466 00:21:55,000 --> 00:21:56,000 and then not 467 00:21:56,000 --> 00:21:57,000 468 00:21:57,000 --> 00:21:59,000 trigger, kind of, an 469 00:21:59,000 --> 00:22:03,000 exploration we've already done. 470 00:22:03,000 --> 00:22:06,000 So let's look at depth-first first. 471 00:22:06,000 --> 00:22:09,000 This is probably the simpler of the two. They're both, actually, pretty simple. 472 00:22:09,000 --> 00:22:13,000 Depth-first traversal uses recursion to do its work, 473 00:22:13,000 --> 00:22:18,000 and the idea is that you pick a starting node - let's say I want to start at Gates, 474 00:22:18,000 --> 00:22:21,000 and that maybe what I'm trying to find is all the reachable nodes from Gates. 475 00:22:21,000 --> 00:22:24,000 If I don't have a starting node, then I can just pick one arbitrarily from the 476 00:22:24,000 --> 00:22:27,000 graph because there is actually no root node of special status, 477 00:22:27,000 --> 00:22:30,000 and the idea is to go deep. That 478 00:22:30,000 --> 00:22:33,000 in the case of, for example, a maze, it might be that I 479 00:22:33,000 --> 00:22:36,000 choose to go north, and then I just keep going north. I just go north, north, north, north, north until 480 00:22:36,000 --> 00:22:40,000 I run into a dead wall, and then I say, oh, I have to go east, and I go east, east, east, east, east. The idea is to 481 00:22:40,000 --> 00:22:42,000 just go as far away from 482 00:22:42,000 --> 00:22:46,000 the original state as I can, just keep going outward, outward, outward, outward, 483 00:22:46,000 --> 00:22:47,000 outward, outward, outward, 484 00:22:47,000 --> 00:22:50,000 and eventually I'm gonna run into some dead end, and that dead end could be I 485 00:22:50,000 --> 00:22:52,000 get to a node that has no outgoing arcs, 486 00:22:52,000 --> 00:22:54,000 or it could be that I get to a node 487 00:22:54,000 --> 00:22:57,000 that whose only arcs come back to places I've already been, 488 00:22:57,000 --> 00:22:58,000 so it cycles back on itself, in 489 00:22:58,000 --> 00:23:01,000 which case, that also is really a dead end. 490 00:23:01,000 --> 00:23:03,000 So you go deep. 491 00:23:03,000 --> 00:23:06,000 You go north, and you see as far as you can get, and 492 00:23:06,000 --> 00:23:10,000 only after you've, kind of, fully explored everything reachable from starting in 493 00:23:10,000 --> 00:23:11,000 the north direction 494 00:23:11,000 --> 00:23:13,000 do you backtrack, 495 00:23:13,000 --> 00:23:16,000 unmake that decision, and say, well, how about not north; how about east, right? And then 496 00:23:16,000 --> 00:23:19,000 find everything you can get to from the east, and 497 00:23:19,000 --> 00:23:22,000 after, kind of, going all through that, come back and try south, and so on. So all of the 498 00:23:22,000 --> 00:23:23,000 options you have 499 00:23:23,000 --> 00:23:26,000 exhaustively getting through all your 500 00:23:26,000 --> 00:23:26,000 neighbors 501 00:23:26,000 --> 00:23:29,000 in this, kind of, go-deep strategy. Well, 502 00:23:29,000 --> 00:23:32,000 the important thing is you realize if you've got this recursion going, so what 503 00:23:32,000 --> 00:23:35,000 you're actually doing, basically, is saying I'm gonna step to my neighbor to the 504 00:23:35,000 --> 00:23:40,000 right, and I'm gonna explore all the things, so do a depth-first search from there 505 00:23:40,000 --> 00:23:43,000 to find all the places you can reach, and that one says, well, I'm gonna step through this spot 506 00:23:43,000 --> 00:23:45,000 and go to one of my neighbors. 507 00:23:45,000 --> 00:23:48,000 And so we can't do that infinitely without getting into trouble. We need 508 00:23:48,000 --> 00:23:53,000 to have a base case that stops that recursion, and the two cases that I've 509 00:23:53,000 --> 00:23:56,000 talked about - one is just when you have no neighbors left to go. 510 00:23:56,000 --> 00:23:57,000 So when you 511 00:23:57,000 --> 00:23:59,000 512 00:23:59,000 --> 00:24:04,000 get to a node that actually, kind of, is a dead end, or that neighbor only has visited 513 00:24:04,000 --> 00:24:05,000 neighbors 514 00:24:05,000 --> 00:24:09,000 that surround it. So a 515 00:24:09,000 --> 00:24:10,000 very simple little piece of 516 00:24:10,000 --> 00:24:13,000 code that depends on recursion doing its job. 517 00:24:13,000 --> 00:24:17,000 In this case, we need to know which nodes we have visited, right? We're gonna have to 518 00:24:17,000 --> 00:24:19,000 have some kind of strategy for saying I 519 00:24:19,000 --> 00:24:21,000 have been here before. 520 00:24:21,000 --> 00:24:24,000 In this case, I'm using just a set because that's an easy thing to do. 521 00:24:24,000 --> 00:24:27,000 I make a set of node pointers, and I update it 522 00:24:27,000 --> 00:24:29,000 each time I see a new node; I add it to 523 00:24:29,000 --> 00:24:34,000 that set, and if I ever get to a node who is already previously visited 524 00:24:34,000 --> 00:24:38,000 from that, there's no reason to do any work from there; we can immediately 525 00:24:38,000 --> 00:24:41,000 stop. So 526 00:24:41,000 --> 00:24:43,000 starting that step would be empty. 527 00:24:43,000 --> 00:24:46,000 If the visit already contains the node that we're currently trying to visit, 528 00:24:46,000 --> 00:24:50,000 then there's nothing to do. Otherwise, we go ahead and add it, and there's a little node 529 00:24:50,000 --> 00:24:53,000 here that says we'll do something would occur. What am I trying to do here? Am I trying to print those nodes? Am 530 00:24:53,000 --> 00:24:56,000 I trying to draw pictures on those nodes, highlight those nodes, you know, 531 00:24:56,000 --> 00:25:00,000 something I'm doing with them. I don't know what it is, but this is the structure for the depth-first search 532 00:25:00,000 --> 00:25:01,000 here, and 533 00:25:01,000 --> 00:25:02,000 then 534 00:25:02,000 --> 00:25:04,000 for all of the neighbors, 535 00:25:04,000 --> 00:25:07,000 using the vector of outgoing connections, 536 00:25:07,000 --> 00:25:08,000 then I run a depth-first search 537 00:25:08,000 --> 00:25:11,000 on each of those neighbors 538 00:25:11,000 --> 00:25:14,000 passing that visited set by reference 539 00:25:14,000 --> 00:25:16,000 through the whole operation. So I will 540 00:25:16,000 --> 00:25:20,000 always be adding to and modifying this one set, so I will 541 00:25:20,000 --> 00:25:24,000 be able to know where I am and where I still need to go. 542 00:25:24,000 --> 00:25:27,000 So if we trace this guy in action - 543 00:25:27,000 --> 00:25:31,000 if I start arbitrarily from A, I say I'd like to begin with A. 544 00:25:31,000 --> 00:25:35,000 Then the depth-first search is gonna pick a neighbor, and let's say I happen to just work 545 00:25:35,000 --> 00:25:38,000 over my neighbors in alphabetical order. 546 00:25:38,000 --> 00:25:41,000 It would say okay, well, I've picked B. Let's go to everywhere we can get to from B. So A 547 00:25:41,000 --> 00:25:42,000 gets marked as visited. 548 00:25:42,000 --> 00:25:44,000 I move onto B; I say, well, 549 00:25:44,000 --> 00:25:48,000 B, search everywhere you can to from B, and B says okay, well, I need to pick a 550 00:25:48,000 --> 00:25:50,000 neighbor, how about I pick 551 00:25:50,000 --> 00:25:51,000 C? And says 552 00:25:51,000 --> 00:25:53,000 and now, 553 00:25:53,000 --> 00:25:56,000 having visited C, go everywhere you can get to from C. Well, 554 00:25:56,000 --> 00:25:59,000 C has no neighbors, right, no outgoing connections whatsoever. 555 00:25:59,000 --> 00:26:02,000 So C says, well, okay, everywhere you can get to from C is C, 556 00:26:02,000 --> 00:26:04,000 you know, I'm done. 557 00:26:04,000 --> 00:26:09,000 That will cause it to backtrack to this place, to B, and B says okay, I explored everywhere I can get to 558 00:26:09,000 --> 00:26:10,000 from C; 559 00:26:10,000 --> 00:26:13,000 what other neighbors do I have? B has no other neighbors. So, in fact, B backtracks all the way 560 00:26:13,000 --> 00:26:15,000 back to A. 561 00:26:15,000 --> 00:26:19,000 So now A says okay, well, I went everywhere I can get to from B, 562 00:26:19,000 --> 00:26:20,000 right? Let me look at my next neighbor. 563 00:26:20,000 --> 00:26:23,000 My next neighbor is C. Ah, 564 00:26:23,000 --> 00:26:25,000 but C, already visited, 565 00:26:25,000 --> 00:26:27,000 so there's no reason to 566 00:26:27,000 --> 00:26:29,000 go crazy on that thing. So, in fact, I can 567 00:26:29,000 --> 00:26:33,000 immediately see that I've visited that, so I don't have any further path to look at 568 00:26:33,000 --> 00:26:33,000 there. 569 00:26:33,000 --> 00:26:35,000 I'll hit D then 570 00:26:35,000 --> 00:26:37,000 as the next one in sequence. So 571 00:26:37,000 --> 00:26:40,000 at this point, I've done everything reachable from the B arm, everything 572 00:26:40,000 --> 00:26:43,000 reachable from the C arm, and now I'm starting on the D arm. I said okay, where can I get to 573 00:26:43,000 --> 00:26:44,000 from D? 574 00:26:44,000 --> 00:26:46,000 Well, I can get to E. All right, 575 00:26:46,000 --> 00:26:48,000 where can I get to from E? 576 00:26:48,000 --> 00:26:52,000 E goes to F, so the idea is you think of it going deep. It's going as far away as it 577 00:26:52,000 --> 00:26:56,000 can, as deep as possible, before it, kind of, unwinds 578 00:26:56,000 --> 00:26:58,000 getting to the F that has no neighbors 579 00:26:58,000 --> 00:27:01,000 and saying okay. Well, how about - where can I get to 580 00:27:01,000 --> 00:27:04,000 also from E? I can get to G. 581 00:27:04,000 --> 00:27:07,000 From G where can I get to? Nowhere, 582 00:27:07,000 --> 00:27:09,000 so we, kind of, unwind the 583 00:27:09,000 --> 00:27:12,000 D arm, and then we will eventually get back to and hit the H. 584 00:27:12,000 --> 00:27:16,000 So the order they actually were hit in this case happened to be alphabetical. I 585 00:27:16,000 --> 00:27:18,000 assigned them such that that would come out that way. 586 00:27:18,000 --> 00:27:21,000 That's not really a property of what depth-first search does, 587 00:27:21,000 --> 00:27:24,000 but think of it as like it's going as deep as possible, as far away as possible, 588 00:27:24,000 --> 00:27:28,000 and so it makes a choice - it's pretty much like the recursive backtrackers that we've seen. In 589 00:27:28,000 --> 00:27:31,000 fact, they make a choice, commit to it, and just move forward on it, never, kind of, 590 00:27:31,000 --> 00:27:35,000 giving a backwards glance, and only if the whole process 591 00:27:35,000 --> 00:27:35,000 bottoms out 592 00:27:35,000 --> 00:27:39,000 does it come back and start unmaking decisions and try alternatives, 593 00:27:39,000 --> 00:27:43,000 eventually exploring everything reachable from A 594 00:27:43,000 --> 00:27:46,000 when all is done and said. So if 595 00:27:46,000 --> 00:27:49,000 you, for example, use that on the maze, 596 00:27:49,000 --> 00:27:52,000 right, the depth-first search on a maze is very much the same strategy we're talking about 597 00:27:52,000 --> 00:27:53,000 here. 598 00:27:53,000 --> 00:27:56,000 Instead of the way we did it was actually breadth-first search, if you 599 00:27:56,000 --> 00:27:57,000 remember back to that assignment, 600 00:27:57,000 --> 00:28:00,000 the depth-first search alternative is actually doing it the - 601 00:28:00,000 --> 00:28:01,000 just go deep, 602 00:28:01,000 --> 00:28:04,000 make a choice, go with it, run all the way until it bottoms out, 603 00:28:04,000 --> 00:28:12,000 and only then back up and try some other alternatives. 604 00:28:12,000 --> 00:28:15,000 The breadth-first traversal, gonna hit 605 00:28:15,000 --> 00:28:16,000 the same nodes 606 00:28:16,000 --> 00:28:19,000 but just gonna hit them in a different way. 607 00:28:19,000 --> 00:28:21,000 If my goal were to actually hit all of them, 608 00:28:21,000 --> 00:28:25,000 then either of them is gonna be a fine strategy. There may be some reason why 609 00:28:25,000 --> 00:28:25,000 610 00:28:25,000 --> 00:28:29,000 I'm hoping to stop the search early, and that actually might dictate why I would prefer 611 00:28:29,000 --> 00:28:31,000 which ones to look at first. 612 00:28:31,000 --> 00:28:35,000 The breadth-first traversal is going to explore things in terms of distance, 613 00:28:35,000 --> 00:28:40,000 in this case, expressed in terms of number of hops away from the starting 614 00:28:40,000 --> 00:28:40,000 node. 615 00:28:40,000 --> 00:28:44,000 So it's gonna look at everything that's one hop away, and then go back and look at 616 00:28:44,000 --> 00:28:46,000 everything two hops away, and then three hops away. 617 00:28:46,000 --> 00:28:49,000 And so if the goal, for example, was to find a node 618 00:28:49,000 --> 00:28:51,000 in a shortest path connection between it, 619 00:28:51,000 --> 00:28:55,000 then breadth-first search, by looking at all the ones one hop away, if it was a one hop 620 00:28:55,000 --> 00:28:56,000 path, it'll find it, 621 00:28:56,000 --> 00:28:59,000 and if it's a two hop path, it'll find it on that next round, and then the three hop 622 00:28:59,000 --> 00:29:00,000 path, 623 00:29:00,000 --> 00:29:02,000 and there might be paths that are 10, and 20, and 100 624 00:29:02,000 --> 00:29:03,000 steps long, 625 00:29:03,000 --> 00:29:06,000 but if I find it earlier, I won't go looking at them. It actually might prove to 626 00:29:06,000 --> 00:29:09,000 be a more efficient way to get to the shortest solution as opposed to depth-first 627 00:29:09,000 --> 00:29:11,000 search which go off and try all these deep 628 00:29:11,000 --> 00:29:14,000 paths that may not ever end up where I want it to be 629 00:29:14,000 --> 00:29:16,000 before it eventually made its way to the short 630 00:29:16,000 --> 00:29:20,000 solution I was wanting to find. So the thing 631 00:29:20,000 --> 00:29:21,000 is we have the starting node. 632 00:29:21,000 --> 00:29:24,000 We're gonna visit all the immediate neighbors. 633 00:29:24,000 --> 00:29:27,000 So, basically, a loop over the 634 00:29:27,000 --> 00:29:28,000 immediate one, 635 00:29:28,000 --> 00:29:31,000 and then while we're doing that, we're gonna actually, kind of, be gathering up that 636 00:29:31,000 --> 00:29:33,000 next generation. 637 00:29:33,000 --> 00:29:33,000 638 00:29:33,000 --> 00:29:37,000 Sometimes that's called the Frontier, sort of, advancing the frontier 639 00:29:37,000 --> 00:29:39,000 of the nodes to explore next, 640 00:29:39,000 --> 00:29:42,000 so all the nodes that are two hops away, 641 00:29:42,000 --> 00:29:44,000 and then while we're processing the ones that are two hops away, we're gonna 642 00:29:44,000 --> 00:29:47,000 be gathering the frontier 643 00:29:47,000 --> 00:29:50,000 that is three hops away. And so at each stage we'll be, kind of, processing a 644 00:29:50,000 --> 00:29:54,000 generation but while assembling the generation N+1 645 00:29:54,000 --> 00:29:58,000 in case we need to go further to find those things. 646 00:29:58,000 --> 00:30:00,000 We'll use an auxiliary data structure during 647 00:30:00,000 --> 00:30:04,000 this. That management of the frontier, keeping track of the nodes that are, kind 648 00:30:04,000 --> 00:30:05,000 of, 649 00:30:05,000 --> 00:30:07,000 staged for the 650 00:30:07,000 --> 00:30:09,000 subsequent iterations, 651 00:30:09,000 --> 00:30:12,000 the queue is a perfect data structure for that. If I put the 652 00:30:12,000 --> 00:30:14,000 initial neighbors into a queue, 653 00:30:14,000 --> 00:30:19,000 as I dequeue them, if I put their neighbors on the back of the queue, so enqueue them 654 00:30:19,000 --> 00:30:21,000 behind all the one hop neighbors, then 655 00:30:21,000 --> 00:30:24,000 if I do that with code, as I'm processing Generation 1 off the front of the queue, I'll be 656 00:30:24,000 --> 00:30:27,000 stacking Generation 2 in the back of the queue, 657 00:30:27,000 --> 00:30:30,000 and then when all of Generation 1 is exhausted, I'll be processing Generation 658 00:30:30,000 --> 00:30:31,000 2, 659 00:30:31,000 --> 00:30:35,000 but, kind of, enqueuing Generation 3 behind it. 660 00:30:35,000 --> 00:30:39,000 Same deal that we had with depth-first search though is we will have to be 661 00:30:39,000 --> 00:30:41,000 careful about cycles will pull past that 662 00:30:41,000 --> 00:30:44,000 when there's more than one possibility of how we get to something, we don't 663 00:30:44,000 --> 00:30:46,000 want to, kind of, keep going around that cycle 664 00:30:46,000 --> 00:30:52,000 or do a lot of redundant work visiting nodes that we've already seen. So 665 00:30:52,000 --> 00:30:55,000 I have the same idea of keeping track of a set 666 00:30:55,000 --> 00:30:57,000 that knows what 667 00:30:57,000 --> 00:31:00,000 we've previously visited, 668 00:31:00,000 --> 00:31:01,000 and the 669 00:31:01,000 --> 00:31:05,000 initial node gets enqueued just, kind of, by itself, just like Generation 0 gets 670 00:31:05,000 --> 00:31:07,000 put into the queue, 671 00:31:07,000 --> 00:31:10,000 and then while the queue is not empty, so while there's still neighbors I haven't 672 00:31:10,000 --> 00:31:11,000 yet visited, 673 00:31:11,000 --> 00:31:13,000 I dequeue the frontmost one, 674 00:31:13,000 --> 00:31:16,000 if it hasn't already been visited on some previous iteration, then I mark it as 675 00:31:16,000 --> 00:31:17,000 visited, 676 00:31:17,000 --> 00:31:22,000 and then I enqueue all of its children or neighbors at the back of the queue. 677 00:31:22,000 --> 00:31:26,000 And so, as of right now, I'm processing Generation 0, that means all of Generation 1, so the 678 00:31:26,000 --> 00:31:29,000 neighbors that are one hop away, get placed in the queue, and then 679 00:31:29,000 --> 00:31:34,000 the next iteration coming out here will pull off a Generation 1, 680 00:31:34,000 --> 00:31:37,000 and then enqueue all of these two hop neighbors. 681 00:31:37,000 --> 00:31:40,000 Come back and other Generation 1's will get pulled off, enqueue all their two hop 682 00:31:40,000 --> 00:31:44,000 neighbors, and so the two hop generation will, kind of, be built up behind, and then 683 00:31:44,000 --> 00:31:48,000 once the last of the one hop generation gets managed, 684 00:31:48,000 --> 00:31:51,000 start processing the Generation 2 and so on. 685 00:31:51,000 --> 00:31:55,000 And so this will keep going while the queue has some 686 00:31:55,000 --> 00:31:58,000 contents in there. So that means some neighbors 687 00:31:58,000 --> 00:32:01,000 we have connections to that we haven't yet had a chance to explore, and 688 00:32:01,000 --> 00:32:05,000 either we will find out that they have all been visited, so this will end when 689 00:32:05,000 --> 00:32:09,000 all of the nodes that are in the queue have been visited or 690 00:32:09,000 --> 00:32:14,000 there are no more neighbors to enqueue. So we've gotten to those dead ends or those cycles 691 00:32:14,000 --> 00:32:25,000 that mean we've reach everything that was reachable from that start node. 692 00:32:25,000 --> 00:32:28,000 And so I've traced this guy 693 00:32:28,000 --> 00:32:30,000 to see it doing its work. 694 00:32:30,000 --> 00:32:33,000 So if I start again with A, 695 00:32:33,000 --> 00:32:36,000 and, again, assuming that I'm gonna process my nodes 696 00:32:36,000 --> 00:32:40,000 in alphabetical order when I have children, so I will through and I 697 00:32:40,000 --> 00:32:44,000 will enqueue, so the queue right now has just node A in it. I pull it off. I say 698 00:32:44,000 --> 00:32:45,000 have I visited? No, I have not. 699 00:32:45,000 --> 00:32:47,000 So then I mark it as visited, 700 00:32:47,000 --> 00:32:52,000 and now for each of its neighboring nodes, B, C, D, H, I'm gonna put those into 701 00:32:52,000 --> 00:32:54,000 the queue, 702 00:32:54,000 --> 00:32:57,000 and so then they're loaded up, and then I come back around, I say do I still have stuff in the queue? I 703 00:32:57,000 --> 00:33:01,000 do, so let's take out the first thing that was put in; it was B. Okay, 704 00:33:01,000 --> 00:33:05,000 and I'll say okay, well, where can you get to from B? You can get to C, so I'm gonna put C in the queue. Now, 705 00:33:05,000 --> 00:33:09,000 C is actually already in the queue, but we're gonna pull it out 706 00:33:09,000 --> 00:33:12,000 earlier in terms it'll get handled by the 707 00:33:12,000 --> 00:33:14,000 visiting status later. 708 00:33:14,000 --> 00:33:16,000 So, okay, we'll put C, D, and E in 709 00:33:16,000 --> 00:33:20,000 and then put C behind it. So right now, we'll have C, D, H, and C again. 710 00:33:20,000 --> 00:33:24,000 It'll find the C that's in the front there. It'll say where can you get to from C? 711 00:33:24,000 --> 00:33:24,000 Nowhere, 712 00:33:24,000 --> 00:33:26,000 so no additional 713 00:33:26,000 --> 00:33:28,000 Generation 2's are added by that one. 714 00:33:28,000 --> 00:33:31,000 I'll look at D, and I'll say okay, well, what Generation 2's do we have from D? It says, well, I can 715 00:33:31,000 --> 00:33:32,000 get to E. All right, so put E on the 716 00:33:32,000 --> 00:33:34,000 back of the 717 00:33:34,000 --> 00:33:36,000 718 00:33:36,000 --> 00:33:39,000 queue. Look at H; where can H get to? H can get to G, 719 00:33:39,000 --> 00:33:42,000 so go ahead and put G on the back of the queue. 720 00:33:42,000 --> 00:33:45,000 So now, it comes back around to the things that are two hops away. 721 00:33:45,000 --> 00:33:47,000 The first one it's gonna pull off is C. It's 722 00:33:47,000 --> 00:33:49,000 gonna say, well, C is already visited, 723 00:33:49,000 --> 00:33:53,000 so there's no need to redundantly visit that or do anything with that. 724 00:33:53,000 --> 00:33:54,000 It passes over C. It 725 00:33:54,000 --> 00:33:56,000 finds the E 726 00:33:56,000 --> 00:33:58,000 that's behind that, 727 00:33:58,000 --> 00:34:00,000 and then it finds the G that was behind the 728 00:34:00,000 --> 00:34:03,000 H. The E also enqueued the F 729 00:34:03,000 --> 00:34:07,000 that was four hops, and the very last one, right, output will be that F that was 730 00:34:07,000 --> 00:34:09,000 only reachable by taking 731 00:34:09,000 --> 00:34:12,000 three hops away from the thing. So this one's, kind of, 732 00:34:12,000 --> 00:34:13,000 working radially. The 733 00:34:13,000 --> 00:34:16,000 idea that I'm looking at all the things that I can get to, kind of, in one hop 734 00:34:16,000 --> 00:34:19,000 are my first generation, then two hops, then three hops, 735 00:34:19,000 --> 00:34:23,000 and so it's like throwing a stone in the water and watching it ripple out 736 00:34:23,000 --> 00:34:26,000 that the visiting 737 00:34:26,000 --> 00:34:29,000 is managed in terms of, kind of, growing length of path, 738 00:34:29,000 --> 00:34:32,000 and number of hops that it took to get there. 739 00:34:32,000 --> 00:34:37,000 So this is the strategy that you used for maze solving, if you remember. 740 00:34:37,000 --> 00:34:37,000 That 741 00:34:37,000 --> 00:34:42,000 what you were tracking was the queue of paths where the path was 742 00:34:42,000 --> 00:34:44,000 AB, or ADC, or 743 00:34:44,000 --> 00:34:46,000 ADE, and that they grew 744 00:34:46,000 --> 00:34:47,000 745 00:34:47,000 --> 00:34:50,000 step wise. That, kind of, each iteration through the 746 00:34:50,000 --> 00:34:54,000 traversal there where you're saying, okay, well, I've seen all the paths that are one hop. Now I've seen all the ones of two, 747 00:34:54,000 --> 00:34:56,000 now the ones three, and four, and five, 748 00:34:56,000 --> 00:34:58,000 and when you keep doing that, eventually you get to the hop, 749 00:34:58,000 --> 00:35:01,000 you know, 75, which leads you 750 00:35:01,000 --> 00:35:05,000 all the way to the goal, and since you have looked at all the paths that were 751 00:35:05,000 --> 00:35:09,000 74 and shorter, the first path that you dequeue that leads to your goal 752 00:35:09,000 --> 00:35:12,000 must be the best or the shortest overall 753 00:35:12,000 --> 00:35:13,000 because 754 00:35:13,000 --> 00:35:15,000 of the order you processed them. 755 00:35:15,000 --> 00:35:18,000 That's not true in depth-first search, right? That depth-first search might find a much 756 00:35:18,000 --> 00:35:21,000 longer and more circuitous path that led to the goal 757 00:35:21,000 --> 00:35:25,000 when there is a shorter one that would be eventually found if I had a graph that 758 00:35:25,000 --> 00:35:28,000 looked, 759 00:35:28,000 --> 00:35:36,000 you know, had this long path, let's say, 760 00:35:36,000 --> 00:35:41,000 and so this is goal node, let's say, and this is the start node that 761 00:35:41,000 --> 00:35:44,000 there could be a two hop path, kind of, off to this angle, 762 00:35:44,000 --> 00:35:47,000 but if this was the first one explored in the depth-first, right, 763 00:35:47,000 --> 00:35:50,000 it could eventually, kind of, work its way all the way around through this 764 00:35:50,000 --> 00:35:52,000 eight hop path, and say, okay, well, I found it, right? 765 00:35:52,000 --> 00:35:55,000 It would be the first one we found, but there's no guarantee that would be the 766 00:35:55,000 --> 00:35:58,000 shortest in depth-first search, right? It would just happen to be the one that based on 767 00:35:58,000 --> 00:36:01,000 its arbitrary choice of which 768 00:36:01,000 --> 00:36:04,000 neighbors to pursue it happened to get to that there could be a shorter 769 00:36:04,000 --> 00:36:08,000 one that would be found later in the traversal. That's not true in 770 00:36:08,000 --> 00:36:12,000 the breadth-first search strategy because it will be looking at this, and then these two, 771 00:36:12,000 --> 00:36:16,000 and then these two, and, kind of, working its way outward down those edges. 772 00:36:16,000 --> 00:36:19,000 So as soon as we get to one that hits the goal, we 773 00:36:19,000 --> 00:36:25,000 know we have found the shortest we can have. Question? 774 00:36:25,000 --> 00:36:28,000 In that case, 775 00:36:28,000 --> 00:36:33,000 you could return, okay, we need two 776 00:36:33,000 --> 00:36:34,000 jumps to get to 777 00:36:34,000 --> 00:36:35,000 the goal, 778 00:36:35,000 --> 00:36:36,000 like, as far 779 00:36:36,000 --> 00:36:40,000 winning the goal. Yeah. How do we remember which path led us there? 780 00:36:40,000 --> 00:36:42,000 So if you were really doing that, you know, in terms of depth-first search, what 781 00:36:42,000 --> 00:36:46,000 you'd probably be tracking is what's the path? Probably a stack that said here's the stack 782 00:36:46,000 --> 00:36:49,000 that led to that path. Let me hold 783 00:36:49,000 --> 00:36:53,000 onto this, right, and I will compare it to these alternatives. So I'll try it on Neighbor 1, 784 00:36:53,000 --> 00:36:54,000 get the stack back from that; it's the best. 785 00:36:54,000 --> 00:36:58,000 Try it on Neighbor 2, get that stack back, see which one is smaller and use 786 00:36:58,000 --> 00:37:01,000 that as the better choice, and so, eventually, when I tried it out this side, I'd get 787 00:37:01,000 --> 00:37:03,000 the stack back that was just two, 788 00:37:03,000 --> 00:37:06,000 and I could say, yeah, that was better than this ten step path I had over here, so I 789 00:37:06,000 --> 00:37:09,000 will prefer that. So just using your standard, kind of, backtracking with 790 00:37:09,000 --> 00:37:13,000 using your return values, incorporating them, and deciding which was better. 791 00:37:13,000 --> 00:37:14,000 And so in this case, 792 00:37:14,000 --> 00:37:16,000 the depth-first search 793 00:37:16,000 --> 00:37:19,000 can't really be pruned very easily 794 00:37:19,000 --> 00:37:24,000 because it has to, kind of, explore the whole thing to know that this 795 00:37:24,000 --> 00:37:26,000 short little path right over here happened to be just 796 00:37:26,000 --> 00:37:29,000 arbitrarily examined last, whereas, the breadth-first search actually does, kind of, because 797 00:37:29,000 --> 00:37:31,000 it's prioritizing by 798 00:37:31,000 --> 00:37:35,000 order of length. If the goal was shortest path, it will find it sooner and be able to 799 00:37:35,000 --> 00:37:35,000 stop 800 00:37:35,000 --> 00:37:39,000 when it found it rather than look at these longer paths. So it won't end up - 801 00:37:39,000 --> 00:37:42,000 if there were hundreds of other paths over here, breadth-first search won't actually 802 00:37:42,000 --> 00:37:47,000 have to ever reach them or explore them. 803 00:37:47,000 --> 00:37:52,000 So there are a lot of things that actually end up being graph search in disguise. That's why 804 00:37:52,000 --> 00:37:54,000 a, kind of, a graph is a great data structure to, kind of, get to in 805 00:37:54,000 --> 00:37:57,000 106b because there are lots of problems 806 00:37:57,000 --> 00:38:01,000 that, in the end, if you can model them using the same 807 00:38:01,000 --> 00:38:03,000 setup of nodes, and arcs, and traversals, 808 00:38:03,000 --> 00:38:06,000 that you can answer interesting questions about that data by applying 809 00:38:06,000 --> 00:38:09,000 the same graph search techniques. 810 00:38:09,000 --> 00:38:12,000 So as long as you want to know things like, well, which nodes are reachable from this node at 811 00:38:12,000 --> 00:38:12,000 all? 812 00:38:12,000 --> 00:38:14,000 And then 813 00:38:14,000 --> 00:38:17,000 that just means, okay, I'm here and I want to get to these places; what places could 814 00:38:17,000 --> 00:38:18,000 I get to? Or 815 00:38:18,000 --> 00:38:21,000 knowing what courses I could take 816 00:38:21,000 --> 00:38:25,000 that would lead me to a CS degree, it's, kind of, like well, looking for the ones 817 00:38:25,000 --> 00:38:29,000 that lead to having all your graduation requirements 818 00:38:29,000 --> 00:38:32,000 satisfied, and each of those arcs could be a course you could take, right, how close did 819 00:38:32,000 --> 00:38:35,000 that get you to the goal? How can I get to the 820 00:38:35,000 --> 00:38:40,000 place I'd like to be, or what are the places that this course satisfies 821 00:38:40,000 --> 00:38:41,000 requirements that lead to? 822 00:38:41,000 --> 00:38:44,000 Knowing things about 823 00:38:44,000 --> 00:38:47,000 connectivity is, kind of, useful in things like the, kind of, bacon numbers or 824 00:38:47,000 --> 00:38:50,000 erdish numbers where people want to know, well, how close am I to greatness by - 825 00:38:50,000 --> 00:38:54,000 have I been in a movie with somebody who was in a movie, who was in a movie with Kevin Bacon? And, 826 00:38:54,000 --> 00:38:55,000 for me, it turns out 827 00:38:55,000 --> 00:38:57,000 zero; I have been in no movies. But have 828 00:38:57,000 --> 00:38:59,000 you written a paper with someone who then 829 00:38:59,000 --> 00:39:02,000 connects through a chain to some other famous person, right? Tells you about 830 00:39:02,000 --> 00:39:06,000 your, kind of, social network distance or something, or I'm linked in. You're trying to get a 831 00:39:06,000 --> 00:39:09,000 new job, and you want to find somebody to introduce you to 832 00:39:09,000 --> 00:39:13,000 the head of this cool company. It's, like, figuring who you know who knows them, right? 833 00:39:13,000 --> 00:39:18,000 It's finding paths through a social network graph. 834 00:39:18,000 --> 00:39:19,000 Finding out things like 835 00:39:19,000 --> 00:39:22,000 are there paths with cycles? What's the shortest path 836 00:39:22,000 --> 00:39:25,000 through the cycle, longest path through the cycle often tells you about things about 837 00:39:25,000 --> 00:39:27,000 redundancies in that structure 838 00:39:27,000 --> 00:39:31,000 that, for example, when you're building a network infrastructure, sometimes you do have 839 00:39:31,000 --> 00:39:34,000 cycles in it because you actually want to have multiple ways to get to things in case 840 00:39:34,000 --> 00:39:37,000 there's a breakdown. Like if there's a freeway accident, you want to have another way 841 00:39:37,000 --> 00:39:38,000 to get around it, 842 00:39:38,000 --> 00:39:41,000 but you are interested in, maybe, trying to keep your cost down is trying to find 843 00:39:41,000 --> 00:39:47,000 out where are the right places to put in those redundancies, those cycles that 844 00:39:47,000 --> 00:39:50,000 give you the best benefit with the least cost. 845 00:39:50,000 --> 00:39:54,000 Trying to find things like a continuous path that visits all the nodes once and exactly 846 00:39:54,000 --> 00:39:57,000 once, especially doing so efficiently, picking a short 847 00:39:57,000 --> 00:40:00,000 overall path for that. We talked a little bit about that in terms of what's called the 848 00:40:00,000 --> 00:40:02,000 traveling salesman problem. 849 00:40:02,000 --> 00:40:04,000 You are trying to get from - you have ten 850 00:40:04,000 --> 00:40:06,000 cities to visit on your book tour. You don't want to spend all your lifetime 851 00:40:06,000 --> 00:40:08,000 on a plane. What's the 852 00:40:08,000 --> 00:40:13,000 path of crisscrossing the country that gets you to all ten of those with the least 853 00:40:13,000 --> 00:40:15,000 time painfully 854 00:40:15,000 --> 00:40:17,000 spent on an airline? And so 855 00:40:17,000 --> 00:40:20,000 those are just graph search problems. You have these nodes. You have these arcs. 856 00:40:20,000 --> 00:40:21,000 How can you 857 00:40:21,000 --> 00:40:23,000 traverse them and visit 858 00:40:23,000 --> 00:40:25,000 and come up with 859 00:40:25,000 --> 00:40:27,000 a good solution. There's other 860 00:40:27,000 --> 00:40:30,000 problems I think that are kind of neat. Word letters, we used to give out an assignment on word letters. We 861 00:40:30,000 --> 00:40:31,000 didn't this time, but I thought it was an 862 00:40:31,000 --> 00:40:34,000 interesting problem to think about. The idea of 863 00:40:34,000 --> 00:40:36,000 how it is you can transform 864 00:40:36,000 --> 00:40:41,000 one letter at a time from one word to another is very much a 865 00:40:41,000 --> 00:40:45,000 graph problem behind the scenes, and that largely a breadth-first 866 00:40:45,000 --> 00:40:47,000 traversal is exactly what you're looking for there. How 867 00:40:47,000 --> 00:40:51,000 can I transform this with the shortest number of steps is, kind of, 868 00:40:51,000 --> 00:40:54,000 working radially out from the starting word. 869 00:40:54,000 --> 00:40:54,000 870 00:40:54,000 --> 00:40:58,000 Your boggle board turns out, really, is just a graph. If 871 00:40:58,000 --> 00:41:01,000 you think of 872 00:41:01,000 --> 00:41:07,000 having this sequence of letters 873 00:41:07,000 --> 00:41:12,000 that you're trying to work through, 874 00:41:12,000 --> 00:41:16,000 that really each of these letters can actually just be thought of as a node, 875 00:41:16,000 --> 00:41:20,000 and then the connections it has are to its eight neighbors, and that finding 876 00:41:20,000 --> 00:41:24,000 words in there, especially finding paths, starting from 877 00:41:24,000 --> 00:41:27,000 this node that lead away, that don't revisit, so, in fact, doing breadth-first 878 00:41:27,000 --> 00:41:31,000 or depth-first search. You most likely did depth-first search when you were writing the 879 00:41:31,000 --> 00:41:32,000 implementation of boggle, 880 00:41:32,000 --> 00:41:35,000 but that's, kind of, a deep search on I found an A. I found a P. I found a P, and, kind 881 00:41:35,000 --> 00:41:40,000 of, keeps going as long as that path looks viable. So, in this case, that the arc 882 00:41:40,000 --> 00:41:41,000 information we're using is 883 00:41:41,000 --> 00:41:44,000 having a sub of those letters, do they form a prefix that could be leading 884 00:41:44,000 --> 00:41:45,000 somewhere good, 885 00:41:45,000 --> 00:41:47,000 and then the, kind of, recursion bottoming out when 886 00:41:47,000 --> 00:41:52,000 we have run ourselves into a corner where we have no unvisited nodes 887 00:41:52,000 --> 00:41:53,000 that neighbor us, 888 00:41:53,000 --> 00:41:57,000 or only nodes that would extend to a prefix that no longer forms any 889 00:41:57,000 --> 00:41:58,000 viable words 890 00:41:58,000 --> 00:42:01,000 as a search problem. 891 00:42:01,000 --> 00:42:04,000 So it feels like everything you've done this quarter has actually been a graph in disguise; I just didn't 892 00:42:04,000 --> 00:42:08,000 tell you. The maze problem is very much a graph search problem, building a 893 00:42:08,000 --> 00:42:10,000 graph and searching it. 894 00:42:10,000 --> 00:42:13,000 Another one that I think is, kind of, neat to think about is, kind of, related to the idea of word letters 895 00:42:13,000 --> 00:42:18,000 is that often when you mistype a word, a word 896 00:42:18,000 --> 00:42:19,000 processor will suggest to you 897 00:42:19,000 --> 00:42:22,000 here's a word that you might have meant instead. You type a word that it doesn't 898 00:42:22,000 --> 00:42:24,000 have in its lexicon, and it says, well, 899 00:42:24,000 --> 00:42:25,000 that could just be 900 00:42:25,000 --> 00:42:28,000 a word I don't know, but it also could mean that you actually mistyped something, and so 901 00:42:28,000 --> 00:42:31,000 what you want to look at is having neighboring nodes, 902 00:42:31,000 --> 00:42:33,000 sort of, diagramming a graph 903 00:42:33,000 --> 00:42:35,000 of all the words you know of 904 00:42:35,000 --> 00:42:39,000 and then the, kind of, permutations, the tiny little twiddles to that word that 905 00:42:39,000 --> 00:42:41,000 are neighbors to it so that 906 00:42:41,000 --> 00:42:44,000 you could describe a set of suggestions 907 00:42:44,000 --> 00:42:47,000 as being those things that are within a certain 908 00:42:47,000 --> 00:42:50,000 hop distance from the original mistyped word. And 909 00:42:50,000 --> 00:42:53,000 so you could say, yeah, things that have two letters different are probably 910 00:42:53,000 --> 00:42:56,000 close enough, but things that have three or more letters different aren't actually 911 00:42:56,000 --> 00:42:58,000 likely to be 912 00:42:58,000 --> 00:43:00,000 the mistyped word, 913 00:43:00,000 --> 00:43:04,000 and then that, kind of, leads to supporting things like wildcard searches 914 00:43:04,000 --> 00:43:08,000 where you're trying to remember whether it's E-R-A-T or A-R-T-E at the 915 00:43:08,000 --> 00:43:10,000 end of desperate. 916 00:43:10,000 --> 00:43:14,000 Those things could be modeled as, kind of, like they search through a graph space where 917 00:43:14,000 --> 00:43:16,000 you're modeling those letters, and then there's, like, a 918 00:43:16,000 --> 00:43:20,000 branching where you try all of the letters coming out of D-E-S-P, seeing if any 919 00:43:20,000 --> 00:43:23,000 of them lead to an R-A-T-E. 920 00:43:23,000 --> 00:43:26,000 So the last thing I wanted 921 00:43:26,000 --> 00:43:28,000 to just bring up, and this is the one that's going into this week's assignment. I'm just 922 00:43:28,000 --> 00:43:31,000 gonna actually show it to you to, kind of, get you thinking about it because it's 923 00:43:31,000 --> 00:43:35,000 actually the problem at hand that you're gonna be solving 924 00:43:35,000 --> 00:43:36,000 is that 925 00:43:36,000 --> 00:43:41,000 the one of where I'm interested in finding the shortest path, but I'm gonna add in 926 00:43:41,000 --> 00:43:43,000 this new idea that 927 00:43:43,000 --> 00:43:45,000 all hops are not created equal. 928 00:43:45,000 --> 00:43:49,000 That a flight from San Francisco to Seattle is a shorter distance than the 929 00:43:49,000 --> 00:43:52,000 one that goes from San Francisco to Boston, 930 00:43:52,000 --> 00:43:53,000 and if I'm interested in getting 931 00:43:53,000 --> 00:43:57,000 to someplace, it's not just that I would like to take a direct flight. I'd also 932 00:43:57,000 --> 00:44:00,000 like to take the set of flights that involves the least, kind of, going out of my way, 933 00:44:00,000 --> 00:44:03,000 and going through other cities, and other directions. 934 00:44:03,000 --> 00:44:07,000 And so in this case, a breadth-first search is a little bit too simplistic. 935 00:44:07,000 --> 00:44:10,000 That if I just look at all the places I can get to in one hop, 936 00:44:10,000 --> 00:44:11,000 937 00:44:11,000 --> 00:44:15,000 those are not really equally 938 00:44:15,000 --> 00:44:16,000 weighted 939 00:44:16,000 --> 00:44:19,000 in terms of my goal of minimizing my total path distance, 940 00:44:19,000 --> 00:44:22,000 but if I, instead, prioritize - 941 00:44:22,000 --> 00:44:25,000 prioritize in your mind, immediately conjure up images of the 942 00:44:25,000 --> 00:44:26,000 priority queue, 943 00:44:26,000 --> 00:44:28,000 to look at paths that are, 944 00:44:28,000 --> 00:44:32,000 if I'm trying to minimize total distance, the shortest path, even if they 945 00:44:32,000 --> 00:44:36,000 represent several hops, if they're still shorter - if I have three flights, and each are 946 00:44:36,000 --> 00:44:37,000 100 miles, 947 00:44:37,000 --> 00:44:40,000 then they represent a flight of 300 total miles, 948 00:44:40,000 --> 00:44:44,000 and that's still preferred to one that actually is 2,000 miles or 7,000 miles 949 00:44:44,000 --> 00:44:45,000 going to Asia 950 00:44:45,000 --> 00:44:49,000 alternatively. And so if I worked on a breadth-first that's been modified by a total 951 00:44:49,000 --> 00:44:50,000 distance, 952 00:44:50,000 --> 00:44:51,000 that it would be a way 953 00:44:51,000 --> 00:44:54,000 to work out the shortest path 954 00:44:54,000 --> 00:44:55,000 in terms of total weight 955 00:44:55,000 --> 00:44:58,000 from San Francisco to my destination 956 00:44:58,000 --> 00:45:01,000 by a 957 00:45:01,000 --> 00:45:03,000 modified breadth-first, kind of, weighted strategy, 958 00:45:03,000 --> 00:45:06,000 and that is exactly what you're gonna be doing in this assignment. I'm not gonna 959 00:45:06,000 --> 00:45:08,000 give away too much by, kind of, 960 00:45:08,000 --> 00:45:11,000 delving too much in it, but the handout does a pretty good job of talking you through what 961 00:45:11,000 --> 00:45:15,000 you're doing, but it's a, kind of, neat way to think about that's how the GPS, the MapQuest, and things like 962 00:45:15,000 --> 00:45:16,000 that are working is 963 00:45:16,000 --> 00:45:17,000 you're at a site; 964 00:45:17,000 --> 00:45:21,000 you have a goal, and you have this information that you're trying to 965 00:45:21,000 --> 00:45:24,000 guide that search, and you're gonna use heuristics about 966 00:45:24,000 --> 00:45:28,000 appears to be the, kind of, optimal path to, kind of, pick that, and go with it, 967 00:45:28,000 --> 00:45:31,000 and see where it leads you. So 968 00:45:31,000 --> 00:45:32,000 that will be your 969 00:45:32,000 --> 00:45:40,000 task for this week. What we'll talk about on Friday is caching.