New York Times' Connections: A Case Study on NLP in Word Games

00:00:00.000 | yes AI and a case study on the New York Times connections which is the team that I work on is

00:00:20.160 | our games team and I work on our popular games if anyone's heard of connections before or played

00:00:28.200 | connections or not since some people might have so again a little bit about me a game developer at

00:00:35.400 | the Times I worked previously in media for most of my career and I have a background in machine

00:00:42.720 | learning mobile development and data science so the majority of this talk is going to probably

00:00:47.940 | involve a little bit of all of those so caveats about you're about to see and I'll get through as

00:00:54.660 | much of it as I can because it's some of it's a little bit dense but this is my own independent

00:00:59.280 | research and experimentation it's not based on our internal research we do have internal research and

00:01:04.200 | just need to clarify that separation results of this work are preliminary and subject to refinement with

00:01:10.620 | additional experimentation and the purpose of this work is investigatory not authoritative so it's mostly

00:01:18.720 | me trying to kind of look into the realm of AI problem solving in the game space through our games and

00:01:27.600 | it's mostly like a fun exercise there does exist third-party research and AI as it resolves as it

00:01:33.660 | relates to solving connections and other games but and that's an inspiration here for this project but the

00:01:38.940 | processes and everything that I'm describing here I did on my own with my own research so for those of you

00:01:45.120 | who are not familiar with the game connections was launched by the New York Times in June of 2023 into beta and

00:01:51.820 | officially released in August of 2023 to the greater world the game is edited by Winna Lu who's a awesome editor at the

00:01:59.500 | times that I work with on games and she creates the puzzles for connections it quickly became one of the New York Times most played game second only to wordle with hundreds of millions of plays within its first year and

00:02:03.120 | just to mention all the connections puzzles and game itself game the game mechanics and the programming etc are human made now and forever so all of our properties that fall into that category so the game is on the right in case you have your mobile apps want to download it so how connections works each so you have a daily puzzle and the game is on the right in case you have your mobile apps want to download it so how connections works each so you have a daily puzzle and the game is on the right in case you have your mobile apps want to download it so how connections works

00:02:30.120 | you have your mobile apps want to download it so you have your mobile apps want to download it so you have a daily puzzle and each puzzle provides 16 words to be grouped the visualization on the right is kind of demonstrating play the goal is to form four groups of words of four groups of four related words

00:02:42.120 | each word belongs only to one group and each word belongs only to one group no overlaps and players can make up to four incorrect guesses before losing so you can see that as the game finishes this particular situation the person has won the game

00:02:57.120 | so this is where the AI and data science and etc start to come into play there's a difficulty structure to connections so people who play the game probably know that the easiest category is yellow where yellow is like the most obvious when you actually solve the game the most obvious connections are within the yellow category

00:03:15.120 | green is the slightly less obvious connections where you have to kind of stretch a little to think but once you get those it's kind of like okay this is kind of obvious

00:03:23.120 | why how these things are related the blue category is trickier themes where you have things like sayings idiomatic lexical or trivia based themes

00:03:33.120 | and then purple is the infamous purple for those of who you think of the game is a difficult game is the category that usually trips up most people from a perfect win and

00:03:42.120 | that's because there's a concept of decoy slash overlap slash misleads that are associated with the purple category

00:03:49.120 | so why is it an interesting game via AI analysis and standards so it can actually challenge AI's ability to abstractly reason

00:04:00.120 | and now at some point in time we all believe that LLMs can do absolutely everything and some people think that's true

00:04:06.120 | but one thing that connections has done throughout the course of its time period is actually challenged LLMs to actually have a hundred percent solve rate

00:04:14.120 | the game's intentional decoys tests whether AI can over avoid overfitting or superficial similarities

00:04:21.120 | each solution requires clean and explainable well aligned in with AI transparency so like for example if you want to figure out how AI reasoned it is

00:04:29.120 | how AI reasoned through a game this is an actual good like process because you do need some sort of reasoning

00:04:35.120 | and perhaps abstract reasoning to solves like the difficult or purple category and you get to see how AI thinks about that

00:04:42.120 | there's fixed input solutions make it a reproducible and scalable testbed so one of the interesting things that I like about this game particularly is because it is a could be a potential benchmarking tool and people have uses as such to test the capabilities of AI and since the puzzles are the same and

00:04:58.120 | playable you can repeat this process and this is just me entering the puzzle in ChatGPT and ChatGPT giving the wrong solution

00:05:06.120 | this is sort of like unfair comparison because this is the 4.0 model the higher models do give better reasoning but that's not exactly I'll go into that a little later if I get to time

00:05:15.120 | and so how do humans solve connections there's the concept of system one versus system two thinking I'm guessing many people might be familiar with this concept it came out in the 90s and it's so system one thinking is when you're trying to intuit relationships very fast when you see two things that are obviously part of the same category you don't have to think so much your brain sort of makes that automatic judgment

00:05:43.120 | system two which is slow and deliberate thinking is where you have like the sort of deep kind of reasoning which you need to perform and that's where you're like oh I'm struggling I don't know if this belongs to one or two let me use my knowledge base and then in order to solve the games most effectively most people use a high

00:06:01.120 | where they use both their fast intuitive thinking and then the slow deep thinking and that's what takes a lot of time in the actual playing of the game and then you can suffer from canonical failures like system one failures you thought it looked obvious but you messed up because it wasn't actually obvious stuff belong to a different category or system two thinking where you're like oh I thought too deeply about this but it was actually the obvious one and so you know you over thought that one so that's one of the fun parts of the game if you

00:06:29.120 | that's one of the fun parts of the game if you think that's fun so connection studies and benchmarks do exist there's several benchmarks for solving connections with the later models that are come out that have come out they're not by us they're third party benchmarks but they do demonstrate progressive capability of LLMs to solve the game but it's still not perfect and as I mentioned before there are caveats that solvability

00:06:36.120 | so there's now I'm going into the like deeper part and I'm going into the like deeper part and I'm going to see how far I get through the rest of this because this is where my AI analysis goes into play

00:06:43.120 | so if this is mathematics this is your chances of winning the game if you're winning the game and you're going to see how far I get through the rest of this because this is where my AI analysis goes into play

00:06:48.120 | so if this is mathematics this is your chances of winning the game if you're guessing randomly right if you use no intuition whatsoever so if you're totally randomly guessing you have

00:06:55.120 | if you have the complete initial board you have the complete initial board you have about a zero percent chance of winning then the mathematics or combinatorics of the connections board indicate that after maybe if you get one category right and you get one category right

00:07:10.120 | So if you're totally randomly guessing, you have, if you have the complete initial board,

00:07:14.900 | you have about a 0% chance of winning.

00:07:18.240 | Then the mathematics or combinatorics of the connections board indicate that after maybe

00:07:23.760 | if you get one category right and you still need to randomly guess through the next three,

00:07:27.480 | you have about a 1 in 5,000, 1 in 6,000 chance of winning, which is tiny, but it's, you know,

00:07:32.900 | you could maybe do it, I don't know.

00:07:34.440 | And then most people get stuck on the third and fourth category where it's like, oh, I

00:07:38.360 | don't know, like I got two, and now all these words, how do they relate to each other?

00:07:42.180 | So if you're like, screw it, I'm going to guess randomly, then you have a 1 in 35% chance,

00:07:47.120 | 1 in 35% chance of winning or about 10% chance.

00:07:49.960 | And that means that, oh, okay, well, you're doing pretty good, I think.

00:07:55.200 | Now the graph coloring problem, we're going to go into a little bit of CS.

00:07:58.480 | Who's familiar with the graph coloring problem?

00:08:00.700 | Okay, great.

00:08:01.700 | You have CS people here, yes.

00:08:02.880 | So graph coloring involves assigning colors to vertices of a graph.

00:08:06.720 | The graph is the structure on the right, which is basically a bunch of nodes and edges.

00:08:11.220 | There's some numbers associated with the graph, not going to get too deep into that, but I'm

00:08:15.100 | doing that chromatic number is important.

00:08:18.340 | And when you want to solve a graph coloring problem, you use algorithms that are popularly known,

00:08:23.220 | backtracking, greedy coloring.

00:08:25.260 | And the graph coloring problem, just for interest, has applications in all kinds of areas outside

00:08:31.400 | of, you know, just science, scheduling, frequency assignment, wireless networks, and in connections.

00:08:37.900 | But you can model connections as a graph coloring, an augmented graph coloring problem.

00:08:42.520 | So this is a connection solver that I built on the right, that's a bunch of puzzles and attempts

00:08:46.280 | to organize them into graph coloring groups.

00:08:49.660 | So each of the words, 16 words and connections can be a vertex in this graph.

00:08:54.160 | The hidden categories that we already went over, they're color coded.

00:08:59.380 | The goal is to color each word node with one of the four categories, such that all four words

00:09:03.740 | belong to a specific category, receive the same color.

00:09:06.780 | And then edge is the strength of the connection that it's thought of to exist between words.

00:09:10.820 | So that is basically how related they are, is the edge.

00:09:14.360 | So the reason why that's important is because that creates the search space for an algorithm

00:09:19.180 | or an AI to actually, you know, play the game effectively, or solve the game effectively.

00:09:24.680 | Without that, it's falling into the random range of sorting words and it becomes much more difficult.

00:09:30.160 | So we have this idea of semantic similarity.

00:09:33.500 | It's not enough.

00:09:34.500 | I've renamed it semantic similarity is not all you need if you get the transformer joke.

00:09:40.760 | So this is like, it's hard to see here, but there's a tree of word relationships between

00:09:45.180 | different words.

00:09:46.180 | For example, anagrams are related, could be a type of category called orthography of words.

00:09:51.840 | So you know, that could be an entire category.

00:09:54.820 | Morphology, meaning things like that have the same suffix, like something like kingdom, fiefdom,

00:10:01.420 | or connectedness, things like that.

00:10:04.480 | Then you have other semantic relationships that are not so obvious, which could be things

00:10:07.940 | like encyclopedic relationships.

00:10:10.100 | Like for example, globe, mirror, post, and sun are all part of the newspaper category.

00:10:14.960 | And then you have things like associative relationships, things that are red or things that are green,

00:10:18.980 | right?

00:10:19.980 | And strawberry rose, Mars, et cetera.

00:10:21.560 | And so again, the most I just going back to this polysemy, things that can like, for example,

00:10:26.260 | what a mole can be, an animal, birthmark, spy, or unit.

00:10:30.320 | That is where the connections actually, just for interest's sake, is the most complicated

00:10:35.720 | for most AIs and people.

00:10:37.760 | Because that's the polysemy, multiple meaning section is where on a base level, people get

00:10:42.540 | tripped up.

00:10:43.540 | Any intelligence can get tripped up.

00:10:45.620 | So you have this concept of relational alignment.

00:10:48.040 | So relational alignment is if you can create a metric that associates two words together,

00:10:53.320 | you can have a relational alignment score.

00:10:55.520 | So there's different metrics that are associated with relational alignment.

00:10:59.420 | And so I created this heat map simulation on the right, which can pick different metrics

00:11:03.280 | and show how based on the metric calculations, which I'm not showing here, that these two things

00:11:07.480 | words from a large category are related.

00:11:12.320 | Now the story I'm building up is that relational alignment between puzzles can help you determine

00:11:19.940 | on a computational way whether a puzzle is easy or whether it's hard.

00:11:23.700 | So this is an example from an easy puzzle years ago that was done versus a hard puzzle.

00:11:30.280 | And this is like people have described this as the solve rate is 19% for the hard puzzle,

00:11:34.940 | but like 70 something for the easier puzzle.

00:11:37.540 | And you can see that there's a rough overarching coherence or relational alignment score differential

00:11:42.880 | between easy and hard.

00:11:44.420 | So that differential lets you understand there's a computational process which you can apply.

00:11:49.500 | So you can see that time variant relational alignment scores can go across categories and time.

00:11:55.520 | So puzzles are easy, puzzles are hard, and you see some sort of time variant metric, which,

00:11:59.840 | you know, you can compute this and you can draw a graph over time.

00:12:03.560 | I'm not sure what happened here on 12-12, I think, oh we had a broken puzzle that day,

00:12:07.480 | so it was zero.

00:12:09.340 | But the idea is that you can see that there could be patterns established from this alignment

00:12:14.120 | score that you compute.

00:12:15.620 | Now if it was that easy, that would be great, but it's not actually that easy because as we

00:12:19.300 | have multiple different semantic relationships, we have multiple relational alignment scores.

00:12:24.520 | So basically two words can be related in multiple ways and they can have different scores

00:12:29.400 | across different categories.

00:12:31.120 | So some can be semantically or morphologically more strongly related than they are, for example,

00:12:37.040 | categorically or encyclopedically related.

00:12:39.580 | But the idea is that you can create sort of this radar chart, which lets you kind of map

00:12:44.040 | out some sort of diagram or surface which you can analyze to see how the semantic space of

00:12:50.120 | the word across different categories looks.

00:12:52.360 | So that means you have yet another dimension to analyze how your AI can analyze how to solve

00:12:58.540 | a puzzle or your AI or solver or you yourself if you want to think about this stuff more deeply

00:13:04.000 | using your reasoning.

00:13:06.080 | So multi-dimensional relational alignment distribution was part of the things that I was kind of

00:13:11.180 | looking at.

00:13:12.260 | So I basically, another component of the system I built was the semantic distribution evaluation

00:13:17.200 | framework where I'm taking a bunch of our puzzles and then I'm building this categorical distribution

00:13:23.060 | over time where you can see on the lower right, it's hard maybe to see a little bit, but there's

00:13:26.900 | the different categories like hypernomy, morphology, orthography, things I talked about.

00:13:31.800 | And the distribution of categories for the connections puzzles over time.

00:13:37.080 | And so you can see I'm counting like which categories over the days fall into which of these buckets

00:13:41.920 | with these dots on this right and I can build a sort of like histogram or count of this association.

00:13:47.260 | And so you can look at trends and then you can use that data, that trend data.

00:13:51.520 | So now here it gets a little bit more involved.

00:13:54.820 | The graph coloring approximation is a search space reduction process, but then you get to a more

00:14:00.520 | complicated idea of graph clustering.

00:14:02.680 | Once you add the semantic relationships into the actual formulation, you actually start to

00:14:07.800 | build multi-dimensional or hypergraphs.

00:14:10.680 | And this is just a three-dimensional hypergraph, but if you have multiple semantic relationships,

00:14:14.520 | you actually have multi-dimensional hypergraphs.

00:14:16.860 | And this is kind of just a demonstration of how the hypergraph converges in three dimensions,

00:14:21.140 | meaning that if you have all these semantic relationships, you're going to have these inter-cluster

00:14:26.400 | strengths between different nodes and different categories, but you also have the intra-cluster

00:14:31.900 | strength, which shows you how strong the relationships are within the clusters that form due to the

00:14:36.400 | algorithm you're using.

00:14:37.560 | That's important because this graph gives you more dimensionality and, again, is a more computational

00:14:43.620 | way of using AI to solve this problem or any, again, solver to solve the problem.

00:14:50.280 | So how do you build these semantic graphs?

00:14:52.120 | You can use different types of lexical databases or lexical constructors.

00:14:56.900 | For example, WordNet, ConceptNet, and other word embeddings can be used to construct these relationships.

00:15:01.840 | And so this is a flat example of a 2D graph of one puzzle where you see relationships between, like, different words.

00:15:08.500 | Like, for example, two words co-occur together, a mouse hunts a cat, a dog is related to a cat, a cat is capable of play, a cat is used for pet.

00:15:17.640 | And you can see the complex dimensionality of these conceptual semantic graphs is derived from WordNet, ConceptNet, and word embeddings allows you even more space.

00:15:26.000 | So we're increasing the intelligence here.

00:15:27.760 | You see, I'm not just doing a reasoning model or dumbly put it into an LLM.

00:15:31.680 | I'm actually trying to increase the intelligence space, like, procedurally so that you can have a trackable and explainable way.

00:15:39.000 | You know, explainable AI, that's what I'm all about.

00:15:41.440 | This is part of that.

00:15:43.240 | So to get to the actual model which does that, anyone familiar with graph neural networks?

00:15:49.400 | Okay, so if you've used GNNs before, that's the primary, since it's a geometric problem in multi-space, you--

00:15:55.960 | Well, the primary solver that I'm using is a graph convolutional neural network, which allows you to kind of take in the graph as an input and then put, like, candidate subgraphs that could be solutions as outputs.

00:16:10.880 | This is one part of the problem.

00:16:12.880 | It's part of a two-part problem where I'm using the graph--

00:16:15.880 | the reinforcement learning system.

00:16:17.800 | So once we have the graph neural network that's actually outputting candidate graphs, you have edge weights and node weights that are being optimized, and then you find out which graph, like, fits into the candidate solutions, then you can track the actual structure of the graph, which is cluster morphology, or, like, how the graph looks.

00:16:34.800 | The next visualization-- this is the system diagram, I'm not going to go into it, it's a lot, but I'll just say it's the combination of a reinforcement learning agent with a graph-based system for actually--

00:16:47.720 | with the graph neural network for kind of isolating how-- what candidate graphs you have.

00:16:52.720 | And so this is-- I'm almost out of time, but I'll just say that this is kind of, like, how the visualization looks in 3Space.

00:16:58.640 | Once your semantic graphs are-- once your semantic graphs are constructed, your graph candidates and the reinforcement learning system is kind of the appropriate way to kind of navigate these subclusters.

00:17:11.640 | And you can't actually see the cluster points here because the visualization wasn't super great, but the actual output is kind of like, OK, now this-- this traversal is allowed to exist.

00:17:20.640 | So, again, after all this stuff, the solvability rate from before to after for a short-- a small subset of hard puzzles increases somewhat reasonably.

00:17:30.960 | But now, again, this is a work in progress, and I tried this against a few puzzles before and after, and now the idea is to extend this and, like, make it more involved for even more puzzles and then get to game development.

00:17:41.520 | So why do all this?

00:17:42.720 | Or-- well, our LLMs are great at a bunch of stuff, but, you know, they can often make, you know, their own mistakes.

00:17:48.880 | LLMs often are trained on internet data and the puzzle solutions are available on the internet, so who knows if they're just pulling the solutions from the internet.

00:17:55.760 | That's one thing that bothered me about LLMs solving these problems.

00:17:59.200 | And so-- and that's-- LLM solutions are still a black box.

00:18:03.200 | And so the idea is to connect this to the ArcGi benchmark and that kind of thing.

00:18:09.360 | So, you know, that benchmark has a partition-- like a-- yeah, a solvability association.

00:18:17.240 | And so we have some next steps, and this is pretty much the end of my presentation.

00:18:20.760 | I ran a few seconds over, but, yeah, if you're interested, talk to me later.

00:18:24.080 | LLMs. All right, thanks.

00:18:24.960 | LLMs. All right, thanks.

00:18:25.840 | We'll see you next time.

New York Times' Connections: A Case Study on NLP in Word Games — Shafik Quoraishee, NYT Games

Chapters