back to indexNew York Times' Connections: A Case Study on NLP in Word Games — Shafik Quoraishee, NYT Games

Chapters
0:0 Timestamps
1:45 Introduction to Connections
3:50 Why Connections is Interesting for AI
5:18 Human vs. AI Problem Solving
6:55 AI Analysis and Methodology
9:26 Semantic Similarity
10:45 Relational Alignment
12:16 Multi-dimensional Analysis
15:44 Graph Neural Networks (GNNs)
17:42 Motivation and Next Steps
00:00:00.000 |
yes AI and a case study on the New York Times connections which is the team that I work on is 00:00:20.160 |
our games team and I work on our popular games if anyone's heard of connections before or played 00:00:28.200 |
connections or not since some people might have so again a little bit about me a game developer at 00:00:35.400 |
the Times I worked previously in media for most of my career and I have a background in machine 00:00:42.720 |
learning mobile development and data science so the majority of this talk is going to probably 00:00:47.940 |
involve a little bit of all of those so caveats about you're about to see and I'll get through as 00:00:54.660 |
much of it as I can because it's some of it's a little bit dense but this is my own independent 00:00:59.280 |
research and experimentation it's not based on our internal research we do have internal research and 00:01:04.200 |
just need to clarify that separation results of this work are preliminary and subject to refinement with 00:01:10.620 |
additional experimentation and the purpose of this work is investigatory not authoritative so it's mostly 00:01:18.720 |
me trying to kind of look into the realm of AI problem solving in the game space through our games and 00:01:27.600 |
it's mostly like a fun exercise there does exist third-party research and AI as it resolves as it 00:01:33.660 |
relates to solving connections and other games but and that's an inspiration here for this project but the 00:01:38.940 |
processes and everything that I'm describing here I did on my own with my own research so for those of you 00:01:45.120 |
who are not familiar with the game connections was launched by the New York Times in June of 2023 into beta and 00:01:51.820 |
officially released in August of 2023 to the greater world the game is edited by Winna Lu who's a awesome editor at the 00:01:59.500 |
times that I work with on games and she creates the puzzles for connections it quickly became one of the New York Times most played game second only to wordle with hundreds of millions of plays within its first year and 00:02:03.120 |
just to mention all the connections puzzles and game itself game the game mechanics and the programming etc are human made now and forever so all of our properties that fall into that category so the game is on the right in case you have your mobile apps want to download it so how connections works each so you have a daily puzzle and the game is on the right in case you have your mobile apps want to download it so how connections works each so you have a daily puzzle and the game is on the right in case you have your mobile apps want to download it so how connections works 00:02:30.120 |
you have your mobile apps want to download it so you have your mobile apps want to download it so you have a daily puzzle and each puzzle provides 16 words to be grouped the visualization on the right is kind of demonstrating play the goal is to form four groups of words of four groups of four related words 00:02:42.120 |
each word belongs only to one group and each word belongs only to one group no overlaps and players can make up to four incorrect guesses before losing so you can see that as the game finishes this particular situation the person has won the game 00:02:57.120 |
so this is where the AI and data science and etc start to come into play there's a difficulty structure to connections so people who play the game probably know that the easiest category is yellow where yellow is like the most obvious when you actually solve the game the most obvious connections are within the yellow category 00:03:15.120 |
green is the slightly less obvious connections where you have to kind of stretch a little to think but once you get those it's kind of like okay this is kind of obvious 00:03:23.120 |
why how these things are related the blue category is trickier themes where you have things like sayings idiomatic lexical or trivia based themes 00:03:33.120 |
and then purple is the infamous purple for those of who you think of the game is a difficult game is the category that usually trips up most people from a perfect win and 00:03:42.120 |
that's because there's a concept of decoy slash overlap slash misleads that are associated with the purple category 00:03:49.120 |
so why is it an interesting game via AI analysis and standards so it can actually challenge AI's ability to abstractly reason 00:04:00.120 |
and now at some point in time we all believe that LLMs can do absolutely everything and some people think that's true 00:04:06.120 |
but one thing that connections has done throughout the course of its time period is actually challenged LLMs to actually have a hundred percent solve rate 00:04:14.120 |
the game's intentional decoys tests whether AI can over avoid overfitting or superficial similarities 00:04:21.120 |
each solution requires clean and explainable well aligned in with AI transparency so like for example if you want to figure out how AI reasoned it is 00:04:29.120 |
how AI reasoned through a game this is an actual good like process because you do need some sort of reasoning 00:04:35.120 |
and perhaps abstract reasoning to solves like the difficult or purple category and you get to see how AI thinks about that 00:04:42.120 |
there's fixed input solutions make it a reproducible and scalable testbed so one of the interesting things that I like about this game particularly is because it is a could be a potential benchmarking tool and people have uses as such to test the capabilities of AI and since the puzzles are the same and 00:04:58.120 |
playable you can repeat this process and this is just me entering the puzzle in ChatGPT and ChatGPT giving the wrong solution 00:05:06.120 |
this is sort of like unfair comparison because this is the 4.0 model the higher models do give better reasoning but that's not exactly I'll go into that a little later if I get to time 00:05:15.120 |
and so how do humans solve connections there's the concept of system one versus system two thinking I'm guessing many people might be familiar with this concept it came out in the 90s and it's so system one thinking is when you're trying to intuit relationships very fast when you see two things that are obviously part of the same category you don't have to think so much your brain sort of makes that automatic judgment 00:05:43.120 |
system two which is slow and deliberate thinking is where you have like the sort of deep kind of reasoning which you need to perform and that's where you're like oh I'm struggling I don't know if this belongs to one or two let me use my knowledge base and then in order to solve the games most effectively most people use a high 00:06:01.120 |
where they use both their fast intuitive thinking and then the slow deep thinking and that's what takes a lot of time in the actual playing of the game and then you can suffer from canonical failures like system one failures you thought it looked obvious but you messed up because it wasn't actually obvious stuff belong to a different category or system two thinking where you're like oh I thought too deeply about this but it was actually the obvious one and so you know you over thought that one so that's one of the fun parts of the game if you 00:06:29.120 |
that's one of the fun parts of the game if you think that's fun so connection studies and benchmarks do exist there's several benchmarks for solving connections with the later models that are come out that have come out they're not by us they're third party benchmarks but they do demonstrate progressive capability of LLMs to solve the game but it's still not perfect and as I mentioned before there are caveats that solvability 00:06:36.120 |
so there's now I'm going into the like deeper part and I'm going into the like deeper part and I'm going to see how far I get through the rest of this because this is where my AI analysis goes into play 00:06:43.120 |
so if this is mathematics this is your chances of winning the game if you're winning the game and you're going to see how far I get through the rest of this because this is where my AI analysis goes into play 00:06:48.120 |
so if this is mathematics this is your chances of winning the game if you're guessing randomly right if you use no intuition whatsoever so if you're totally randomly guessing you have 00:06:55.120 |
if you have the complete initial board you have the complete initial board you have about a zero percent chance of winning then the mathematics or combinatorics of the connections board indicate that after maybe if you get one category right and you get one category right 00:07:10.120 |
So if you're totally randomly guessing, you have, if you have the complete initial board, 00:07:18.240 |
Then the mathematics or combinatorics of the connections board indicate that after maybe 00:07:23.760 |
if you get one category right and you still need to randomly guess through the next three, 00:07:27.480 |
you have about a 1 in 5,000, 1 in 6,000 chance of winning, which is tiny, but it's, you know, 00:07:34.440 |
And then most people get stuck on the third and fourth category where it's like, oh, I 00:07:38.360 |
don't know, like I got two, and now all these words, how do they relate to each other? 00:07:42.180 |
So if you're like, screw it, I'm going to guess randomly, then you have a 1 in 35% chance, 00:07:47.120 |
1 in 35% chance of winning or about 10% chance. 00:07:49.960 |
And that means that, oh, okay, well, you're doing pretty good, I think. 00:07:55.200 |
Now the graph coloring problem, we're going to go into a little bit of CS. 00:07:58.480 |
Who's familiar with the graph coloring problem? 00:08:02.880 |
So graph coloring involves assigning colors to vertices of a graph. 00:08:06.720 |
The graph is the structure on the right, which is basically a bunch of nodes and edges. 00:08:11.220 |
There's some numbers associated with the graph, not going to get too deep into that, but I'm 00:08:18.340 |
And when you want to solve a graph coloring problem, you use algorithms that are popularly known, 00:08:25.260 |
And the graph coloring problem, just for interest, has applications in all kinds of areas outside 00:08:31.400 |
of, you know, just science, scheduling, frequency assignment, wireless networks, and in connections. 00:08:37.900 |
But you can model connections as a graph coloring, an augmented graph coloring problem. 00:08:42.520 |
So this is a connection solver that I built on the right, that's a bunch of puzzles and attempts 00:08:49.660 |
So each of the words, 16 words and connections can be a vertex in this graph. 00:08:54.160 |
The hidden categories that we already went over, they're color coded. 00:08:59.380 |
The goal is to color each word node with one of the four categories, such that all four words 00:09:03.740 |
belong to a specific category, receive the same color. 00:09:06.780 |
And then edge is the strength of the connection that it's thought of to exist between words. 00:09:10.820 |
So that is basically how related they are, is the edge. 00:09:14.360 |
So the reason why that's important is because that creates the search space for an algorithm 00:09:19.180 |
or an AI to actually, you know, play the game effectively, or solve the game effectively. 00:09:24.680 |
Without that, it's falling into the random range of sorting words and it becomes much more difficult. 00:09:34.500 |
I've renamed it semantic similarity is not all you need if you get the transformer joke. 00:09:40.760 |
So this is like, it's hard to see here, but there's a tree of word relationships between 00:09:46.180 |
For example, anagrams are related, could be a type of category called orthography of words. 00:09:51.840 |
So you know, that could be an entire category. 00:09:54.820 |
Morphology, meaning things like that have the same suffix, like something like kingdom, fiefdom, 00:10:04.480 |
Then you have other semantic relationships that are not so obvious, which could be things 00:10:10.100 |
Like for example, globe, mirror, post, and sun are all part of the newspaper category. 00:10:14.960 |
And then you have things like associative relationships, things that are red or things that are green, 00:10:21.560 |
And so again, the most I just going back to this polysemy, things that can like, for example, 00:10:26.260 |
what a mole can be, an animal, birthmark, spy, or unit. 00:10:30.320 |
That is where the connections actually, just for interest's sake, is the most complicated 00:10:37.760 |
Because that's the polysemy, multiple meaning section is where on a base level, people get 00:10:45.620 |
So you have this concept of relational alignment. 00:10:48.040 |
So relational alignment is if you can create a metric that associates two words together, 00:10:55.520 |
So there's different metrics that are associated with relational alignment. 00:10:59.420 |
And so I created this heat map simulation on the right, which can pick different metrics 00:11:03.280 |
and show how based on the metric calculations, which I'm not showing here, that these two things 00:11:12.320 |
Now the story I'm building up is that relational alignment between puzzles can help you determine 00:11:19.940 |
on a computational way whether a puzzle is easy or whether it's hard. 00:11:23.700 |
So this is an example from an easy puzzle years ago that was done versus a hard puzzle. 00:11:30.280 |
And this is like people have described this as the solve rate is 19% for the hard puzzle, 00:11:37.540 |
And you can see that there's a rough overarching coherence or relational alignment score differential 00:11:44.420 |
So that differential lets you understand there's a computational process which you can apply. 00:11:49.500 |
So you can see that time variant relational alignment scores can go across categories and time. 00:11:55.520 |
So puzzles are easy, puzzles are hard, and you see some sort of time variant metric, which, 00:11:59.840 |
you know, you can compute this and you can draw a graph over time. 00:12:03.560 |
I'm not sure what happened here on 12-12, I think, oh we had a broken puzzle that day, 00:12:09.340 |
But the idea is that you can see that there could be patterns established from this alignment 00:12:15.620 |
Now if it was that easy, that would be great, but it's not actually that easy because as we 00:12:19.300 |
have multiple different semantic relationships, we have multiple relational alignment scores. 00:12:24.520 |
So basically two words can be related in multiple ways and they can have different scores 00:12:31.120 |
So some can be semantically or morphologically more strongly related than they are, for example, 00:12:39.580 |
But the idea is that you can create sort of this radar chart, which lets you kind of map 00:12:44.040 |
out some sort of diagram or surface which you can analyze to see how the semantic space of 00:12:52.360 |
So that means you have yet another dimension to analyze how your AI can analyze how to solve 00:12:58.540 |
a puzzle or your AI or solver or you yourself if you want to think about this stuff more deeply 00:13:06.080 |
So multi-dimensional relational alignment distribution was part of the things that I was kind of 00:13:12.260 |
So I basically, another component of the system I built was the semantic distribution evaluation 00:13:17.200 |
framework where I'm taking a bunch of our puzzles and then I'm building this categorical distribution 00:13:23.060 |
over time where you can see on the lower right, it's hard maybe to see a little bit, but there's 00:13:26.900 |
the different categories like hypernomy, morphology, orthography, things I talked about. 00:13:31.800 |
And the distribution of categories for the connections puzzles over time. 00:13:37.080 |
And so you can see I'm counting like which categories over the days fall into which of these buckets 00:13:41.920 |
with these dots on this right and I can build a sort of like histogram or count of this association. 00:13:47.260 |
And so you can look at trends and then you can use that data, that trend data. 00:13:51.520 |
So now here it gets a little bit more involved. 00:13:54.820 |
The graph coloring approximation is a search space reduction process, but then you get to a more 00:14:02.680 |
Once you add the semantic relationships into the actual formulation, you actually start to 00:14:10.680 |
And this is just a three-dimensional hypergraph, but if you have multiple semantic relationships, 00:14:14.520 |
you actually have multi-dimensional hypergraphs. 00:14:16.860 |
And this is kind of just a demonstration of how the hypergraph converges in three dimensions, 00:14:21.140 |
meaning that if you have all these semantic relationships, you're going to have these inter-cluster 00:14:26.400 |
strengths between different nodes and different categories, but you also have the intra-cluster 00:14:31.900 |
strength, which shows you how strong the relationships are within the clusters that form due to the 00:14:37.560 |
That's important because this graph gives you more dimensionality and, again, is a more computational 00:14:43.620 |
way of using AI to solve this problem or any, again, solver to solve the problem. 00:14:52.120 |
You can use different types of lexical databases or lexical constructors. 00:14:56.900 |
For example, WordNet, ConceptNet, and other word embeddings can be used to construct these relationships. 00:15:01.840 |
And so this is a flat example of a 2D graph of one puzzle where you see relationships between, like, different words. 00:15:08.500 |
Like, for example, two words co-occur together, a mouse hunts a cat, a dog is related to a cat, a cat is capable of play, a cat is used for pet. 00:15:17.640 |
And you can see the complex dimensionality of these conceptual semantic graphs is derived from WordNet, ConceptNet, and word embeddings allows you even more space. 00:15:27.760 |
You see, I'm not just doing a reasoning model or dumbly put it into an LLM. 00:15:31.680 |
I'm actually trying to increase the intelligence space, like, procedurally so that you can have a trackable and explainable way. 00:15:39.000 |
You know, explainable AI, that's what I'm all about. 00:15:43.240 |
So to get to the actual model which does that, anyone familiar with graph neural networks? 00:15:49.400 |
Okay, so if you've used GNNs before, that's the primary, since it's a geometric problem in multi-space, you-- 00:15:55.960 |
Well, the primary solver that I'm using is a graph convolutional neural network, which allows you to kind of take in the graph as an input and then put, like, candidate subgraphs that could be solutions as outputs. 00:16:12.880 |
It's part of a two-part problem where I'm using the graph-- 00:16:17.800 |
So once we have the graph neural network that's actually outputting candidate graphs, you have edge weights and node weights that are being optimized, and then you find out which graph, like, fits into the candidate solutions, then you can track the actual structure of the graph, which is cluster morphology, or, like, how the graph looks. 00:16:34.800 |
The next visualization-- this is the system diagram, I'm not going to go into it, it's a lot, but I'll just say it's the combination of a reinforcement learning agent with a graph-based system for actually-- 00:16:47.720 |
with the graph neural network for kind of isolating how-- what candidate graphs you have. 00:16:52.720 |
And so this is-- I'm almost out of time, but I'll just say that this is kind of, like, how the visualization looks in 3Space. 00:16:58.640 |
Once your semantic graphs are-- once your semantic graphs are constructed, your graph candidates and the reinforcement learning system is kind of the appropriate way to kind of navigate these subclusters. 00:17:11.640 |
And you can't actually see the cluster points here because the visualization wasn't super great, but the actual output is kind of like, OK, now this-- this traversal is allowed to exist. 00:17:20.640 |
So, again, after all this stuff, the solvability rate from before to after for a short-- a small subset of hard puzzles increases somewhat reasonably. 00:17:30.960 |
But now, again, this is a work in progress, and I tried this against a few puzzles before and after, and now the idea is to extend this and, like, make it more involved for even more puzzles and then get to game development. 00:17:42.720 |
Or-- well, our LLMs are great at a bunch of stuff, but, you know, they can often make, you know, their own mistakes. 00:17:48.880 |
LLMs often are trained on internet data and the puzzle solutions are available on the internet, so who knows if they're just pulling the solutions from the internet. 00:17:55.760 |
That's one thing that bothered me about LLMs solving these problems. 00:17:59.200 |
And so-- and that's-- LLM solutions are still a black box. 00:18:03.200 |
And so the idea is to connect this to the ArcGi benchmark and that kind of thing. 00:18:09.360 |
So, you know, that benchmark has a partition-- like a-- yeah, a solvability association. 00:18:17.240 |
And so we have some next steps, and this is pretty much the end of my presentation. 00:18:20.760 |
I ran a few seconds over, but, yeah, if you're interested, talk to me later.