Monday, January 17, 2022

Investigating Wordle guesses

Introduction

Those of you who know me know that I like to play somewhat complicated games. One of the reasons I enjoy that is the challenge of figuring out how the game works and how I might play the game better by using my understanding to develop rules of thumb for making decisions during the game. 

I also enjoy writing Python scripts when I've found an interesting problem to solve. In this particular case, I ended up spending most of the MLK holiday weekend working on a Python script to investigate the best words to use as guesses in Wordle, that online game that has become viral. 

I've been playing for less than two weeks, and during that time I've had an excessive number of discussions with my wife and two daughters about strategies for word choices. One such discussion was about "what's the best first word?" It seemed clear that one would like a word that contained common letters in order to increase the probability of getting correct guesses early. A much longer discussion was centered around whether it would be a better strategy to pick a second word that was complementary to the first word (containing different common letters, e.g. vowels) in order to get the most information during the first two guesses, or whether the second move should capitalize on information gained in the first guess (e.g. limiting the second guess to words that would both help discover new letters and determine the positions of any letters discovered in the first guess).

One appealing feature of this problem is that it is highly tractable using a modern computer. The number of 5-letter English words is limited and a script could sort them out in a minuscule amount of time. Thus a simple and brute-force approach would be totally practical.

Background knowledge

 There are several key things that one would want to know before starting out. The first is whether game creator Josh Wardle is tricky and tries to pick "hard" words as a human opponent would, or whether the game words are random. In an interview on The World, he said that he wanted the words to be random so that he could play himself. 

Another important thing to know is what set of words are actually used as a source for the game. I don't know the answer to that question, but in the interview, he mentioned that the words were drawn from a list of about 2500 English words. There was also a controversy about the January 12 word "favor", which raised the ire of British users who didn't consider that a proper five-letter word because they thought it should be spelled "favour". This is an indication that the word list might be derived from an American English rather than British English word list.

When I first started playing around, I tried extracting words from several rando English word lists that I found on the Internet. However, they either included a lot of questionable words to give Scrabble players an edge, or were polluted with capitalized proper names, abbreviations, etc. I finally came across a very high quality curated list of words called the "6 of 12" list (named so because it contains words found in at least 6 of the 12 dictionaries used as sources). This list is heavily curated and very clean: proper names are all capitalized and abbreviations without periods or spaces are specially marked so they can be excluded. After running some code to screen out names, abbreviations, and words with lengths other than 5, I came up with a clean list of 2529 words as a source for experimentation. Not only is this list about the same size as the list used by Wardle for the game, all of the words used in the game since 1 January 2022 are on it. (Spoiler alert: the previous link is updated daily, so following it may reveal the answer to today's puzzle if you haven't done it already.) So I consider my list to be a very good proxy for the possible words in the game.

The script

If you want to look at and try the code I'm going to discuss, you can run it yourself on a Colab notebook without installing anything. (You must have and be logged in to a Google account, however.) If you want to edit and save the code, select "Save a copy in Drive" from the file menu and you will be able to save your work. The data are in GitHub, so you don't have to download anything either.  Before all of you software engineers start sharpening your knives, I'm not a professional coder, so be kind.

A major part of the code is a Wordle_list object. If you instantiate it without any arguments, you just get the full word list from GitHub. If you pass in a "guess code" and a Python list when you instantiate a Wordle_list, the code will be used to screen out words from the input list according to the information given in the guess code (equivalent to what you see on the screen after you've made a guess in the game.Values of instance variables and results of methods on a Wordle_list instance will tell you things about the screened list, such as how many words are in it and information about the frequencies of letters in the words.

The other major part of the code is the score_words_by_frequency() function. This is the actual "experimental" part of the code where I played around with various ways to score the words on the list to identify which words would make the best next guess. I will talk a lot more about this later.

To actually use the code, run the first cell to define everything, then scroll down to the "Screening algorithm test" section and start running the code in the next cell. There are some instructions you can read there, but here's the TLDR information:

1. To use the code with an actual puzzle, change the value of actual_word to an empty string (two consecutive single quote characters: '').  That will stop the script from trying to suggest guess codes for some other word. If you want to test one of the previous puzzle words (or any word) set the value of  actual_word to the word you want. 

2. Before you run each of the guess cells, you have to set the value of the next_guess_code variable. If you are testing using an actual known word entered in step 1, a guess code will be suggested for you at the end of the output of the previous cell using the highest ranking word based on the criteria used by the word-scoring function.  It will be used automatically if you run the next cell without changing the value of next_guess_code. If you are using the output from the game, set the value of next_guess_code using an uppercase letter for a correct letter in the correct position, a lowercase letter for a correct letter in the wrong place, and a lowercase letter prefixed by a dash for an incorrect letter. Separate the individual letter codes by commas but no spaces.

Here's the code for the example above: 'S,-e,r,-g,E'.

3. As you run each cell, it will show you how it has reduced the number of possible answer words by applying various screens based on the guess:


then show you the top five scoring words for the next guess:

4. Repeat entering guess codes and running subsequent cells until there is only a single word remaining.

Scoring words

The most important open question is how words should be scored for selecting the next guess. Here's the general approach I took:

1. Determine the distribution of letters in the words by counting up how many times they occurred.

2. Order the letters of the alphabet from highest frequency to lowest.

3. Assign a rank to each letter based on its position (0=most frequently used, 25=least frequently used). Ties get the same rank, then missing ranks are skipped until the next non-tied letter (e.g. 1, 2, 2, 4, ...)

4. Score a word by adding up the ranks for each of the five letters in the word. The lowest score is the best word.

This sounds pretty simple and most people use some variation of this in their head based on knowledge they have about letter use from everyday life, playing Scrabble, trying to crack cyphers (didn't everyone do that?), etc. A lot of our early discussion about Wordle guesses revolved around what were the most common letters. For example we realized that what counts is frequency of dictionary words, not frequency of words "in the wild" since a relatively small number of words are used a lot in normal text (e.g. "their", "there", "where", etc.). But what matters the most is the frequency of letters in the 2529 words at play in the game, and since we have the list and a computer, we can know anything we want about letter distributions.

A key consideration is whether we should care more about the overall distribution of letters anywhere in the words or if we should be more concerned about the distribution of letters in particular positions within the words. For example, the letter "y" is not that common overall, but is a lot more frequent in the last position of the word. The overall distribution is particularly important early in the game when nothing is known about any position, but later in the game, the distribution in particular positions becomes more important as only some positions remain undetermined. 

The other consideration is that on any particular turn, we don't really care about the distributions of letters in all 2529 words, but only about the distributions of the words that haven't yet been eliminated. An unassisted human player would lack concrete information about this, but with a script, it's easily known. 

When a new Wordle_list object is created after each guess, the frequencies and ordinal positions of every letter are calculated both for the words as a whole and for each of the five positions, using the words that remain after eliminating words using the guess code. The ordinal positions for the letters are then available for calculating the word scores.

When I first started playing with this, I calculated two sets of scores: "overall scores" for words with unique letter combinations (no repeated letters) based on the overall letter frequencies, and "position scores" for all words based on the separate letter frequencies of each of the five positions in the word. Words that scored well by the first system were the efficient at discovering/eliminating common letters. Words that scored well by the second system weren't as efficient at discovering letters, but were more efficient at placing letters in the correct position. There are some pairs of words like "aster" and "stare" that have equally good overall scores because they have the same letters. But "stare" has a much better position score than "aster" (14 vs. 30), partly because a lot of words end in "e". So if two words have the same overall score, pick the one with the better position score. 

Having two different scores was not a good solution, so I then tried to think how I could combine them into a single score. The easy solution was to assign weights to the two scores. I somewhat arbitrarily chose 0.7 for the overall score and 0.3 for the position score. This gave more weight to efficient letter discovery, while having some weight to the position as a sort of "tiebreaker". This solution worked fairly well, but I quickly discovered two problems. 

The most obvious problem was that I was only calculating the overall scores for words with unique letters. How should I score words in which the same letter occurred more than once? The answer came to me when I realized that from the standpoint of letter discovery, repeating a letter is worse than using even the worst letter because it gives you no new information. Therefore, I assigned a score of 26 to the second (or more) instance of a letter in a word. 

However, this caused an unintended consequence. Particularly late in the game, it may actually be more efficient to use words with repeated letters depending on the distribution of letters in the undetermined positions. The 26 point penalty on the overall score for repeat letters was so severe, it basically prevented repeat letter words from ever being selected. The solution that I chose was to reduce the overall score weight (from 0.7 to 0) and increase the position weight (from 0.3 to 1) as the number of words decreased towards zero. That made the repeat letter penalty disappear late in the game and had a secondary benefit of emphasizing letter selection (through the overall score) in the first few guesses and emphasizing position selection in the later guesses. 

This "sliding scale" of weights is what is used in the final version of the script, although you can manually adjust the base scoring weights of 0.7 and 0.3 in the initial setup.

The optimal second guess

One of the longest points of discussion between my wife and me was whether one should always use two great "letter discovery" words in the first two guesses, or to vary the second guess by choosing the best word based on information gained in the first guess. I could see benefits to both strategies. The "letter discovery" system could be very efficient at nailing down the letters and eliminating a lot of words in the first two guesses. It is also a very simple strategy that doesn't require any thinking about what words might or might not have been eliminated until the third guess. The "information based" second-guess system makes use of what has been learned about letters in the first guess to allow the selection of a tailored second guess based on the remaining words (at least if you have a computer to keep track of the words for you). Once the program was finished, I had an opportunity to test the two approaches using the script. 

One thing that you'll notice if you use the script is that the first guess is always "arose", since the selection of the best scoring word from the set is deterministic. There is a section of the notebook labeled "Test of rating words by frequencies" whose first cell finds the 10 best-scoring words with unique letters. The second cell then eliminates all of the unique-letter words that have any of the letters from the first word in order to score the best "complimentary" second word to go with the first word. The default is to use "arose" as the first word, and it results in "glint" as the best second word to go with it. You can hard code one of the other first words as the value of first_word in the second cell to find its complement, e.g. "stare" (first word) and "doily" (second word). 

If you run the cell in the "Calculate stats for raw list" section, you'll find that the overall order of frequencies for letters in the five letter-word set is e, a, r, o, t, s, l, i, n, u, c, y, d, h, ... . "arose" includes five of the six most common letters and adding "glint" gets nine of the top ten. "stare" includes five of the top six, and adding "doily" gets all of the top eight and ten of the top thirteen. 

I ran a test using the 17 game words from 1 January to today, comparing the strategies of always using "arose" as the first word and "glint" as the second word vs. using "arose" as the first word and letting the script's scoring system to select the second word. The metric was the number of words that remained after screening words out using the guesses. In three cases, it was a tie, with both strategies resulting in the same number of words (1 or 2 words left). In seven cases, the "glint" choice did better and in seven cases, using the best scoring word did better. 

Based on this metric, both systems worked about equally well. However, in 6 of the seven cases where "glint" as a second choice won, the number of possible words remaining after the second guess was 1 (meaning that the strategy would have produced a pretty amazing score of 3 for each of those games). Only two of the cases where letting the scoring algorithm choose the word resulted in only one word remaining after the second guess. 

Although this is a very low sample size, the somewhat surprising take-home message of this test is that the extremely simple strategy of always guessing "arose" and "glint" as your first two guesses is highly effective at bringing the number of words down to a low number for the third guess.

Other possible investigations

 The obvious follow-up to this is to automate the process of guessing to the point where many tests could be run to compare how various strategies work. Since there are only 2529 words, it wouldn't even have to be a random sampling exercise -- one could literally try the strategy on every possible word. There are a few complications. One is that I'm not entirely confident about how the script is handling the evaluation of choices where the guess word contains two of the same letter (neither in the correct position) and the word has only one of that letter. I was having some problems with coding that and with uncertainty about how the game would indicate that result. 

The other problem is that it is common near the end of the game for there to be two or even three possible guess words with the same score. So to have the computer fully play out the game would require a random selection. Many human players have been in this situation where there are two possible words that could fit in the final guess and it was necessary to just pick one and hope to get lucky (today for example when my last two choices were "spire" and "shire" -- I got unlucky and incorrectly guessed "spire"). There would probably be a more graceful way to handle the situation, but one could just have the script guess at random, then run it enough times for the probabilities to come out in the wash.

With this kind of automation, one could try many possible combinations of fixed first two words, adjust the base scoring weights to be something other than 0.7 and 0.3, or try out some other strategies I haven't though of yet.

However, I've totally burned up my holiday weekend, so if I do this it will have to be later! :-)