---
title: "Using the R package anticlust for stimulus selection in experiments"
output: 
  rmarkdown::html_vignette:
    number_sections: false
    df_print: kable
author: Martin Papenberg
vignette: >
  %\VignetteIndexEntry{Using the R package anticlust for stimulus selection in experiments}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# define a method for objects of the class data.frame
knit_print.matrix = function(x, ...) {
    res = paste(c("", "", kable(x, row.names = TRUE)), collapse = "\n")
    asis_output(res)
}
# register the method
registerS3method("knit_print", "matrix", knit_print.matrix)

```

This tutorial teaches you how to use the `R` package `anticlust` for 
stimulus selection in psychological experiments. All code can easily be 
reproduced via Copy & Paste. The tutorial discusses the following 
functionalities:

- Match similar stimuli based on covariates of interest
- Minimize differences between stimulus sets with regard to some variables
- Maximize differences between stimulus sets with regard to some variables
- Balance the occurrence of a categorical variable between stimulus sets

To follow the code in this tutorial, load the `anticlust` 
package first:

```{r}
library(anticlust)
```

For the examples in this document, we use norming data for a stimulus 
set provided by Schaper, Kuhlmann and Bayen (2019a, 2019b). It is 
available when the package `anticlust` is loaded: 

```{r}
data("schaper2019")
# look at the data
head(schaper2019)
```

```{r, echo = FALSE}
cols <- toString(paste0("\`", names(schaper2019)[3:6], "\`"))
```

The item pool consists of 96 German words, given in the column `item`. 
Each word represents an object that is either typically found in a 
bathroom or in a kitchen. For their experiments, Schaper et al. 
partitioned the pool into 3 word lists that should be as similar as 
possible with regard to four numeric criteria (that is: `r cols`). 
Typically, stimulus sets may contain more than 96 elements (and the 
selection usually becomes more effective when the pool is larger), but 
the logic for stimulus selection that is applied in this tutorial can be 
transfered to arbitrarily large stimulus sets.

## Example 1a: Maximize differences in frequency

In the first example, I select two word lists that differ on frequency 
but are as similar as possible with regard to consistency ratings and 
the number of syllables. First, I need to define the boundaries that 
define "high" and "low" frequency. Note that `frequency` is 
reverse-coded such that low values actually indicate high frequency. 
Here, I arbitrarily define values below 18 as "high" frequency and above 
19 as "low" frequency, but any user-defined splits are possible.

```{r}
schaper2019 <- within(schaper2019, {
  freq <- ifelse(frequency < 18, "high", NA)
  freq <- ifelse(frequency > 19, "low", freq)
})
```

This code defined a new column `freq` for the data set `schaper2019` 
that is either "low", "high" or missing (`NA`). Let's check it out:

```{r}
schaper2019$freq
```

Before matching high and low frequency words, it is necessary to remove 
cases that are not selected, i.e. where the entry in `freq` is `NA`:

```{r}
selected <- subset(schaper2019, !is.na(freq))
# see how many cases were selected:
table(selected$freq)
```

Now, I use the `anticlust` function `matching()` to find paired words of 
high and low frequency that are as similar as possible on consistency 
and the number of syllables:

```{r}
# Match the conditions based on covariates
covariates <- scale(selected[, 3:5])
selected$matches <- matching(
  covariates, 
  match_between = selected$freq,
  match_within = selected$room,
  match_extreme_first = FALSE
)
```

The first argument defines the covariates that should be similar in both 
sets. I used the function `scale()` before passing the covariates to the 
matching function to standardize the variables. This way, each variable 
has the same weight in the matching process, which is usually desirable. 
The argument `match_between` is the grouping variable based on `frequency` 
that I just defined. The argument `match_within` is used to ensure that 
matches are selected between words belonging to the same room. The 
argument `match_extreme_first` is used to guide the behaviour of the 
matching algorithm, but its precise meaning is not important for now 
(for more information, see `?matching`). Generally, if we plan to only 
keep a subset of all matches -- as we do in the current application --, 
we set `match_extreme_first` to `FALSE`. If we want to keep all 
elements, `match_extreme_first = TRUE` is usually better. 

Next, let us check out some of the matches we created:

```{r}
subset(selected, matches == 1)
subset(selected, matches == 2)
```

The matches are numbered by similarity, meaning that matches with 
grouping number 1 are most similar. Therefore, we can easily select the 
8 best-matched groups of words that we would like include in our 
experiment:

```{r}
# Select the 8 best matches:
final_selection <- subset(selected, matches <= 8)
```

Last, we check the quality of the results by investigating the 
descriptive statistics -- means and standard deviations (in brackets) --
by condition:

```{r}
# Check quality of the selection:
mean_sd_tab(
  final_selection[, 3:6], 
  final_selection$freq
)
```

The descriptive statistics are similar across the sets for the 
consistency ratings and the number of syllables, and dissimilar for 
frequency, which is good. In general, the results can even be improved 
if we can select from a larger item pool (then, more matches are 
possible).

If we are not sure how many items should be part of our experiment, the 
function `plot_similarity()` may help. It plots an index of similarity 
(see `?plot_similarity`) for each match:

```{r, similarity-plot}
plot_similarity(
  covariates, 
  groups = selected$matches
)
```

The figure depicts the sum of pairwise dissimilarities (based on the 
Euclidean distance) variance per match; the larger the value, the less 
homogenous the match. It seems indeed that the first eight matches are 
most similar; after the 8th match, there is already a notable decline in 
similarity.

## Example 1b: Two-factorial design

```{r}
# Reload the data for next example
data("schaper2019")
```

In another example, we construct a two-factorial design: we create 
groups that differ on `rating_consistent` and `rating_inconsistent` 
(these properties are orthogonally crossed) but are similar on frequency 
and the number of syllables. First, we categorize the variables 
frequency and syllables into two levels, respectively. This time, I just 
use median splits, which means that I do not exclude any data in this 
first step:

```{r}
schaper2019 <- within(schaper2019, {
  incon <- ifelse(rating_inconsistent < median(rating_inconsistent), "low incon", NA)
  incon <- ifelse(rating_inconsistent >= median(rating_inconsistent), "high incon", incon)
  con <- ifelse(rating_consistent <= median(rating_consistent), "low con", NA)
  con <- ifelse(rating_consistent > median(rating_consistent), "high con", con)
})
```

Let us check how many words are left per condition: 

```{r}
table(schaper2019$con, schaper2019$incon)
```

Next, we conduct a matching between the four groups that resulted from 
crossing frequency and syllables:

```{r}
# Match the conditions based on covariates
covariates <- scale(schaper2019[, c("frequency", "syllables")])
schaper2019$matches <- matching(
  covariates, 
  match_between = schaper2019[, c("con", "incon")],
  match_extreme_first = FALSE
)
```

In this application, each match consists of 4 words because we have 4 
conditions, e.g.:

```{r}
subset(schaper2019, matches == 1)
```

To decide how many matched groups we would like to keep, let's check out 
the similarity plot:

```{r}
# Plot covariate similarity by match:
plot_similarity(covariates, schaper2019$matches)
```

Based on the plot, we select the 10 best matches:

```{r}
# Select the 5 best matches:
final_selection <- subset(schaper2019, matches <= 10)
```

Last, we check quality of the selection by printing the descriptive 
statistics by condition:

```{r}
mean_sd_tab(
  subset(final_selection, select = 3:6), 
  paste(final_selection$con, final_selection$incon)
)
```

Again, the covariates are quite similar between sets, but with a larger 
item pool even better results could be achieved.

## Example 2: Anticlustering

```{r}
# Reload the data for next example
data("schaper2019")
```

In the next example, we wish to partition the entire pool of 96 items 
into 3 sets that are as similar as possible on all variables. That means 
that there is no independent variable that varies between conditions; 
the experimental manipulation is independent of the intrinsic stimulus 
properties. Creating stimulus sets that are overall similar to each 
other can be done using anticlustering (Papenberg & Klau, 2020):

```{r}
## Conduct anticlustering (assign all items to three similar groups)
schaper2019$anticluster <- anticlustering(
  schaper2019[, 3:6], 
  K = 3,
  objective = "variance"
)

## check out quality of the solution
mean_sd_tab(
  subset(schaper2019, select = 3:6), 
  schaper2019$anticluster
)
```

## Example 2b: Anticlustering on subset selection

```{r}
# Reload the data for next example
data("schaper2019")
```

If we do not want to include all 96 words in our experiment, we can 
again use the `matching()` function to select a subset of similar 
stimuli that we employ. Again, we wish the select three groups that 
are as similar as possible on all variables, but we only use a subset
of all 96 items. To achieve this goal, we create triplets of similar
stimuli using `matching()`; afterwards, the items belonging to the 
same triplet will be assigned to different experimental sets. 
We use the argument `match_within` to ensure that all triplets
consist of items belonging to the same room:

```{r}
# First, identify triplets of similar word, within room
covariates <- scale(schaper2019[, 3:6])
schaper2019$triplet <- matching(
  covariates,
  p = 3,
  match_within = schaper2019$room
)

# check out the two most similar triplets:
subset(schaper2019, triplet == 1)
subset(schaper2019, triplet == 2)

# Select the 10 best triplets
best <- subset(schaper2019, triplet <= 10)
```

Now, we use `anticlustering()` to assign the matched words to different
sets:


```{r}
best$anticluster <- anticlustering(
  best[, 3:6], 
  K = 3,
  categories = best$triplet,
  objective = "variance"
)
```

In this function call, we pass the `triplets` to the argument `categories`, 
thus ensuring that the matched items are assigned to different sets. We can confirm
this worked by looking at the cross table of `triplet` and `anticluster`:

```{r}
table(best$triplet, best$anticluster)
```

Anticlustering strives to assign the matched triplets to the different sets in 
such a way that all sets are as similar as possible. Note that by ensuring
that the triplets only consisted of items from the same room, the room
was also balanced across anticlusters: 

```{r}
table(best$room, best$anticluster)
```

Last, we check out the descriptive statistics by anticluster, confirming 
that the sets are indeed quite similar on all numeric attributes:

```{r}
## check out quality of the solution
mean_sd_tab(
  subset(best, select = 3:6), 
  best$anticluster
)
```

## References

Papenberg, M., & Klau, G. W. (2021). Using anticlustering to partition data sets into equivalent parts. *Psychological Methods, 26*(2), 161–174. https://doi.org/10.1037/met0000301

Schaper, M. L., Kuhlmann, B. G., & Bayen, U. J. (2019a). Metacognitive expectancy effects in source monitoring: Beliefs, in-the-moment experiences, or both? Journal of Memory and Language, 107, 95–110. https://doi.org/10.1016/j.jml.2019.03.009

Schaper, M. L., Kuhlmann, B. G., & Bayen, U. J. (2019b). Metamemory expectancy illusion and schema-consistent guessing in source monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45, 470. https://doi.org/10.1037/xlm0000602