read_csv("crossword_times.csv", col_types = "ccD") |>
mutate(Publisher,
day = factor(wday(Date, week_start = 1, label = TRUE, abbr = FALSE)),
minutes = period_to_seconds(ms(Time))/60,
.keep = "none") |>
rename(publisher = Publisher) ->
crossword_times
Crossword Times
With a word on density plots
Introduction
Down For Across is a website which allows you to solve user-uploaded crosswords solo or with friends.
The crosswords published in major American newspapers often differ in difficulty according to the day of the week. For example, the New York Times’ guide explains that their Monday puzzles are the easiest and their Saturday puzzles the hardest. Meanwhile, the Sunday puzzles have easy clues but a large grid, and the Thursday puzzles have some unique gimmick. By contrast, the New Yorker’s crosswords decrease in difficulty over the week.
I’ve started keeping track of my solo completion times for the New York Times, Los Angeles Times, New Yorker, and Wall Street Journal crosswords.
Data
I’m accustomed to thinking of the week as beginning with Sunday, but for the purposes of crossword difficulty analysis, Monday is better. We’ll convert the solve times from 00:00
-style to minutes.
Code
|>
crossword_times mutate(minutes = round(minutes, 2)) |>
datatable()
Code
|>
crossword_times group_by(publisher, day) |>
summarize(n = n(), mean = mean(minutes), sd = sd(minutes)) |>
mutate(across(c("mean", "sd"), \(x) round(x, 2))) |>
mutate(publisher,
day, stats = paste0("n: ", n, "<br/>mean: ", mean, "<br/>sd: ", sd),
.keep = "none") |>
pivot_wider(names_from = publisher, values_from = stats) |>
datatable(rownames = FALSE, escape = FALSE)
Modelling
It’s plausible to think that crossword solve times would be distributed log-normally. As the name suggests, a log-normally distributed variable has a normally distributed logarithm. The relevant Wikipedia article lists a variety of phenomena which seem to follow a log-normal distribution, including the length of chess games, the incubation period of diseases, and incomes (except those of the very rich).
Fitting a Bayesian Log-Normal Regression
If we imagine that the distribution of solves times within each combination of publisher and day of the week is log-normal where the underlying normal distribution has some particular mean and standard deviation, we can provide that model specification to brms
, accept some default priors, and have it estimate posterior distributions of those means and standard deviations based on the observed solve times.
|>
crossword_times brm(formula = minutes ~ (1 + day | publisher),
data = _,
family = lognormal(),
warmup = 1000,
iter = 2000,
chains = 4,
control = list(adapt_delta = 0.95)) ->
solve_time_model
Visualization
It’s a matter of some debate among data visualization practitioners about when to use density plots, such as violin plots, instead of histograms. Or indeed whether density plots should ever be used. Or even whether histograms should ever be used!
I disagree with several of the points Angela Collier makes in her video “violin plots should not exist”, but one that I find compelling is that drawing density plots usually involves what amounts to fitting an unjustified model.
In most situations, ggplot
uses locally estimated scatterplot smoothing (LOESS) by default, which involves fitting a separate polynomial regression model on a weighted neighbourhood around each data point and evaluating it there. It (usually) makes nice looking violin plots, but you wouldn’t expect it to reflect that “actual” theoretical distribution of the data.
It seems to me that this sort of thing is a symptom of a general desire to avoid having to actually specify models by pretending that there’s some bright-line distinction between descriptive statistics and statistical inference.
Since we were willing to actually specify a model, we can make density plots that show something meaningful: the posterior predictive distributions corresponding to our model. We’ll use add_predicted_draws()
from tidybayes
to generate draws from those models and use these to draw violins that reflect the data we expect to observe in the future based on those we have observed so far. (We also enlist the help of data_grid
from modelr
.)
Code
|>
crossword_times data_grid(publisher, day) |>
add_predicted_draws(solve_time_model) |>
# filter out predictions for groups that do not actually occur
filter(!(publisher == "New Yorker" & as.integer(day) %in% c(6, 7)),
!(publisher == "Wall Street Journal" & as.integer(day) == 7)) |>
ggplot() +
# plot structure
facet_grid(cols = vars(publisher)) +
# aesthetic mapping for the violins
aes(y = .prediction, x = day) +
# visual elements representing the posterior predictive model
stat_eye(colour = NA, fill = "grey") +
# visual elements representing the observed data
geom_sina(data = crossword_times,
aes(y = minutes, fill = publisher),
pch = 21,
alpha = 0.5) +
# scales
scale_x_discrete(labels = c("M", "T", "W", "T", "F", "S", "S")) +
scale_y_continuous(breaks = seq(0, 70, 5),
limits = c(0, 70),
expand = c(0, 0)) +
scale_fill_viridis(discrete = TRUE) +
# labels
labs(title = "Crossword Solve Times",
subtitle = "With posterior predictive log-normal distributions",
x = "Day of the Week",
y = "Solve Time (Minutes)") +
guides(fill = "none") +
# theming
theme_bw() +
theme(panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
strip.text = element_text(hjust = 0.5))
Code
|>
crossword_times # group by publisher and day, add per-group row numbers
group_by(publisher, day) |>
mutate(id = row_number()) |>
ggplot() +
# plot structure
facet_grid2(cols = vars(publisher),
rows = vars(day),
labeller = labeller(day = ~ substr(., 1, 1)),
scales = "free_x",
independent = "x") +
# aesthetic mapping
aes(x = id, y = minutes, colour = publisher) +
geom_point() +
# visual elements representing the data
geom_line() +
# scales
scale_y_continuous(n.breaks = 6) +
scale_colour_viridis(discrete = TRUE) +
# labels
labs(title = "Solve Time Trend by Publisher and Day of Week",
y = "Solve Time (Minutes)") +
# theming
theme_bw() +
theme(legend.position = "none",
panel.spacing.y = unit(0.02, "npc"),
axis.text.y = element_text(size = rel(0.65)),
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
legend.margin = margin(0, 0, 0, 0),
strip.text = element_text(hjust = 0.5))
geom_violin vs Log-Normal Densities
What’s the difference between plotting LOESS-based densities and posterior predictive distributions? I think this image illustrates the shortcomings of the former approach well: