Crossword Times

With a word on density plots

A consideration of cruciverbal complexity.
Published

January 2, 2023

Modified

May 11, 2024

Introduction

Down For Across is a website which allows you to solve user-uploaded crosswords solo or with friends.

The crosswords published in major American newspapers often differ in difficulty according to the day of the week. For example, the New York Times’ guide explains that their Monday puzzles are the easiest and their Saturday puzzles the hardest. Meanwhile, the Sunday puzzles have easy clues but a large grid, and the Thursday puzzles have some unique gimmick. By contrast, the New Yorker’s crosswords decrease in difficulty over the week.

I’ve started keeping track of my solo completion times for the New York Times, Los Angeles Times, New Yorker, and Wall Street Journal crosswords.

Data

I’m accustomed to thinking of the week as beginning with Sunday, but for the purposes of crossword difficulty analysis, Monday is better. We’ll convert the solve times from 00:00-style to minutes.

read_csv("crossword_times.csv", col_types = "ccD") |>
  mutate(Publisher,
         day = factor(wday(Date, week_start = 1, label = TRUE, abbr = FALSE)),
         minutes = period_to_seconds(ms(Time))/60,
         .keep = "none") |>
  rename(publisher = Publisher) ->
  crossword_times
Code
crossword_times |>
  mutate(minutes = round(minutes, 2)) |>
  datatable()
Code
crossword_times |>
  group_by(publisher, day) |>
  summarize(n = n(), mean = mean(minutes), sd = sd(minutes)) |>
  mutate(across(c("mean", "sd"), \(x) round(x, 2))) |>
  mutate(publisher, 
         day, 
         stats = paste0("n: ", n, "<br/>mean: ", mean, "<br/>sd: ", sd),
         .keep = "none") |>
  pivot_wider(names_from = publisher, values_from = stats) |>
  datatable(rownames = FALSE, escape = FALSE)

Modelling

It’s plausible to think that crossword solve times would be distributed log-normally. As the name suggests, a log-normally distributed variable has a normally distributed logarithm. The relevant Wikipedia article lists a variety of phenomena which seem to follow a log-normal distribution, including the length of chess games, the incubation period of diseases, and incomes (except those of the very rich).

Fitting a Bayesian Log-Normal Regression

If we imagine that the distribution of solves times within each combination of publisher and day of the week is log-normal where the underlying normal distribution has some particular mean and standard deviation, we can provide that model specification to brms, accept some default priors, and have it estimate posterior distributions of those means and standard deviations based on the observed solve times.

crossword_times |>
  brm(formula = minutes ~ (1 + day | publisher),
      data = _,
      family = lognormal(),
      warmup = 1000,
      iter = 2000,
      chains = 4,
      control = list(adapt_delta = 0.95)) ->
  solve_time_model

Visualization

It’s a matter of some debate among data visualization practitioners about when to use density plots, such as violin plots, instead of histograms. Or indeed whether density plots should ever be used. Or even whether histograms should ever be used!

I disagree with several of the points Angela Collier makes in her video “violin plots should not exist”, but one that I find compelling is that drawing density plots usually involves what amounts to fitting an unjustified model.

In most situations, ggplot uses locally estimated scatterplot smoothing (LOESS) by default, which involves fitting a separate polynomial regression model on a weighted neighbourhood around each data point and evaluating it there. It (usually) makes nice looking violin plots, but you wouldn’t expect it to reflect that “actual” theoretical distribution of the data.

It seems to me that this sort of thing is a symptom of a general desire to avoid having to actually specify models by pretending that there’s some bright-line distinction between descriptive statistics and statistical inference.

Since we were willing to actually specify a model, we can make density plots that show something meaningful: the posterior predictive distributions corresponding to our model. We’ll use add_predicted_draws() from tidybayes to generate draws from those models and use these to draw violins that reflect the data we expect to observe in the future based on those we have observed so far. (We also enlist the help of data_grid from modelr.)

Code
crossword_times |>
  data_grid(publisher, day) |>
  add_predicted_draws(solve_time_model) |>
  # filter out predictions for groups that do not actually occur
  filter(!(publisher == "New Yorker" & as.integer(day) %in% c(6, 7)),
         !(publisher == "Wall Street Journal" & as.integer(day) == 7)) |>
  ggplot() +
  # plot structure
  facet_grid(cols = vars(publisher)) +
  # aesthetic mapping for the violins
  aes(y = .prediction, x = day) +
  # visual elements representing the posterior predictive model
  stat_eye(colour = NA, fill = "grey") +
  # visual elements representing the observed data
  geom_sina(data = crossword_times,
            aes(y = minutes, fill = publisher),
            pch = 21, 
            alpha = 0.5) +
  # scales
  scale_x_discrete(labels = c("M", "T", "W", "T", "F", "S", "S")) +
  scale_y_continuous(breaks = seq(0, 70, 5), 
                     limits = c(0, 70), 
                     expand = c(0, 0)) +
  scale_fill_viridis(discrete = TRUE) +
  # labels
  labs(title = "Crossword Solve Times",
       subtitle = "With posterior predictive log-normal distributions",
       x = "Day of the Week",
       y = "Solve Time (Minutes)") +
  guides(fill = "none") +
  # theming
  theme_bw() +
  theme(panel.grid.minor.y = element_blank(),
        panel.grid.major.x = element_blank(),
        strip.text = element_text(hjust = 0.5))

Code
crossword_times |>
  # group by publisher and day, add per-group row numbers
  group_by(publisher, day) |>
  mutate(id = row_number()) |>
  ggplot() +
  # plot structure
  facet_grid2(cols = vars(publisher), 
              rows = vars(day),
              labeller = labeller(day = ~ substr(., 1, 1)),
              scales = "free_x",
              independent = "x") +
  # aesthetic mapping
  aes(x = id, y = minutes, colour = publisher) +
  geom_point() +
  # visual elements representing the data
  geom_line() +
  # scales
  scale_y_continuous(n.breaks = 6) +
  scale_colour_viridis(discrete = TRUE) +
  # labels
  labs(title = "Solve Time Trend by Publisher and Day of Week",
       y = "Solve Time (Minutes)") +
  # theming
  theme_bw() +
  theme(legend.position = "none", 
        panel.spacing.y = unit(0.02, "npc"),
        axis.text.y = element_text(size = rel(0.65)),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.margin = margin(0, 0, 0, 0),
        strip.text = element_text(hjust = 0.5))

geom_violin vs Log-Normal Densities

What’s the difference between plotting LOESS-based densities and posterior predictive distributions? I think this image illustrates the shortcomings of the former approach well:

geom_violin (black) vs our posterior predictive distributions (grey).

References