---
title: Introduction to `accumulate`
author: Mark P.J. van der Loo
css: "style.css"
---
Package version `packageVersion("accumulate")`{.R}.
Use `citation('accumulate')` to cite the package.
## Introduction
`Accumulate` is a package for grouped aggregation, where the groups can be
dynamically collapsed into larger groups. When this collapsing takes place and
how collapsing takes place is user-defined.
## Installation
The latest CRAN release can be installed as follows.
```
install.packages("accumulate")
```
Next, the package can be loaded. You can use `packageVersion` (from base R) to
check which version you have installed.
```{#load_package .R}
library(accumulate)
# check the package version
packageVersion("accumulate")
```
## A first example
We will use a built-in dataset as example.
```{#loading_data .R}
data(producers)
head(producers)
```
This synthetic dataset contains information on various sources of turnover from
producers, that are labeled with an economic activity classification (`sbi`)
and a `size` class (0-9).
We wish to find a group mean by `sbi x size`. However, we demand that the group has
at least five records, otherwise we combine the size classes of a single `sbi` group.
This can be done as follows.
```{#first_example .R}
a <- accumulate(producers
, collapse = sbi*size ~ sbi
, test = min_records(5)
, fun = mean, na.rm=TRUE)
head(round(a))
```
The accumulate function does the following:
- For each combination of `sbi` and `size` occurring in the data, it checks whether
`test` is satisfied. Here, it tests whether there are at least five records.
- If the test is satisfied, the mean is computed for each non-grouping variable
in the data. The output column `level` is set to 0 (no collapsing took place).
- If the test is _not_ satisfied, it will only use `sbi` as grouping variable
for the current combination of `sbi` and `size`. Then, if there are enough
records, the mean is computed for each variable and the output variable `level`
is set to 1 (first level of collapsing has been used).
- If the test is still not satisfied, no computation is possible
and all outputs are `NA` for the current `sbi` and `size` combination.
Explicitly, for this example we see that for `(sbi,size)==(2752,5)` no
satisfactory group of records was found under the current collapsing scheme.
Therefore the `level` variable equals `NA` and all aggregated variables are
missing as well. For `(sbi,size)==(2840,7)` there are sufficient records, and
since `level=0` no collapsing was necessary. For the group
`(sbi,size)=(3410,8)` there were not enough records to compute a mean, but
taking all records in `sbi==3410` gave enough records. This is signified by
`level=1`, meaning that one collapsing step has taken place (from `sbi x size`
to `sbi`).
Let us see how we specified this call to `accumulate`
- The first argument is the data to be aggregated.
- The second argument is a formula of the form `target groups ~ collapsing scheme`.
The output is always at the level of the target groups. The collapsing scheme determines
which records are used to compute a value for the target groups if the `test` is not
satisfied.
- The third argument, called `test` is a function that should accept any subset of
records of `producers` and return `TRUE` or `FALSE`. In this case we used the convenience
function `min_records(5)` provided by `accumulate`. The function `min_records()` creates
a testing function for us that we can pass as testing function.
- Finally, the argument `fun` is the aggregation function that will be applied to each
group.
Observe that the accumulate function is similar to R's built-in `aggregate` function (this is
by design). There is a second function called `cumulate` that has an interface that
is similar to `dplyr::summarise`.
```{#cumulate_formula .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
, test = function(d) nrow(d) >= 5
, mu_industrial = mean(industrial, na.rm=TRUE)
, sd_industrial = sd(industrial, na.rm=TRUE))
head(round(a))
```
Notice that here, we wrote our own test function.
### Exercises
1. How many combinations of `(sbi, size)` could not be computed, even when
collapsing to `sbi`? (You need to run the code and investigate the output).
2. Compute the trimmed mean of all numeric variables where you trim
5% of each side the distribution. See `?mean` on how to compute trimmed
means.
## The formula interface for specifying collapsing schemes
A collapsing scheme can be defined in a data frame or with a
formula of the form
```
target grouping ~ collapse1 + collapse2 + ... + collapseN
```
Here, the `target grouping` is a variable or product of variables. Each
`collapse` term is also a variable or product of variables. Each subsequent
term defines the next collapsing step. Let us show the idea with a
more involved example.
The `sbi` variable in the `producers` dataset encodes a hierarchical classification
where longer digit sequences indicate higher level of detail. Hence we can collapse
to lower levels of detail by deleting digits at the end. Let us enrich the
`producers` dataset with extra grouping levels.
```{#derive_sbi_levels .R}
producers$sbi3 <- substr(producers$sbi,1,3)
producers$sbi2 <- substr(producers$sbi,1,2)
head(producers,3)
```
We can now use a more involved collapsing scheme as follows.
```{#accumulate_formula .R}
a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
, test = min_records(5), fun = mean, na.rm=TRUE)
head(round(a))
```
For `(sbi,size) == (2752,5)` we have 2 levels of collapsing. In other
words, for that aggregate, all records in `sbi3 == 275` were used.
### Exercises
1. Compute standard deviation for `trade` and `total` using the `cumulate` function
under the same collapsing scheme as defined above.
2. What is the maximum collapsing level in the collapsing scheme above?
3. Find out how many combinations of `(sbi,size)` have been collapsed to
level 0, 1, 2, or 3. Tabulate them.
4. Define a collapsing scheme that ends with a single-digit `sbi` code and compute
the means of all variables.
## The data frame interface for defining collapsing schemes
Collapsing schemes can be represented in data frames that have the
form
```
[target group, parent of target group, parent of parent of target group,...].
```
The package comes with a helper function that creates such a scheme
from hierarchical classifications that are encoded as digits.
For the `sbi` example we can do the following to derive a collapsing scheme.
```{#dataframe_construction .R}
sbi <- unique(producers$sbi)
csh <- csh_from_digits(sbi)
names(csh)[1] <- "sbi"
head(csh)
```
Here, the column `sbi` denotes the original (maximally) 5-digit codes,
`A1` the 4-digit codes, and so on. It is important that the name of
the first column matches a column in the data to be agregated.
Both `cumlate` and `accumulate` accept such a data frame as an argument.
Here is an example with `cumulate`.
```{#dataframe_cumulate .R}
a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
, mu_total = mean(total, na.rm=TRUE)
, sd_total = sd(total, na.rm=TRUE))
head(a)
```
In this representation is is not possible to use multiple grouping
variables, unless you combine multiple grouping variables into a single
one, for example by pasting them together.
The advantage of this representation is that it allows users to externally
define a (manually edited) collapsing scheme.
### Exercises
1. Use `csh` to compute the median of all numerical variables of
the `producers` dataset with `accumulate` (hint: you need to remove
the `size` variable).
## Convenience functions to define tests
There are several options to define test on groups of records:
1. Use one of the built-in functions to specify common test conditions:
`min_records()`, `min_complete()`, or `frac_complete()`.
2. Use a ruleset defined with the [validate](https://cran.r-project.org/package=validate)
package, with the `from_validator()` function.
3. Write your own custom test function.
Let us look at a small example for each case. For comparison we will
always test that there are a minimum of five records.
```{#helpers .R}
# load the data again to loose columns 'sbi2' and 'sbi3' and work
# with the original data.
data(producers)
# 1. using a helper function
a <- accumulate(producers, collapse = sbi*size ~ sbi
, test = min_records(5)
, fun = mean, na.rm=TRUE)
# 2. using a 'validator' object
rules <- validate::validator(nrow(.) >= 5)
a <- accumulate(producers, collapse = sbi*size ~ sbi
, test = from_validator(rules)
, fun = mean, na.rm=TRUE)
# 3. using a custom function
a <- accumulate(producers, collapse=sbi*size ~ sbi
, test = function(d) nrow(d) >= 5
, fun = mean, na.rm=TRUE)
```
## Complex aggregates
An aggregate may be something more complex than a scalar. The `accumulate`
package also supports complex aggregates such as linear models.
```{#complex .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
, test = min_complete(5, c("other_income","trade"))
, model = lm(other_income ~ trade)
, mean_other = mean(other_income, na.rm=TRUE))
head(a)
```
Here, we demand that there are at least five records available for estimating
the model.
The linear models are stored in a `list` of type `object_list`. Subsets or individual
elements can be accessed as usual with data frames.
```{#objlist .R}
a$model[[1]]
a$model[[2]]
```
### Smoke-testing your test function
If you write your own test function from scratch, it is easy to overlook some
edge cases like the occurrence of missing data, a column that is completely
`NA`, or receiving zero records. The function `smoke_test()` accepts a data set
and a test function and runs the test function on several common edge cases
based on the dataset. It does _not_ check whether the test function works as
expected, but it checks that the output is `TRUE` or `FALSE` in all cases and
reports errors, warnings and mesages if they occur.
As an example we construct a test function that checks whether one
of the variables has sufficient non-zero values.
```{#smoketest1 .R}
my_test <- function(d) sum(other != 0) > 3
smoke_test(producers, my_test)
```
Oops, we forgot to refer to the data set. Let's try it again.
```{#smoketest2 .R}
my_test <- function(d) sum(d$other != 0) > 3
smoke_test(producers, my_test)
```
Our function is not robust against occurrence of `NA`. Here's a third attempt.
```{#smoketest3 .R}
my_test <- function(d) sum(d$other != 0,na.rm=TRUE) > 3
smoke_test(producers, my_test)
```
### Exercises
1. Compute the mean of all variables using `sbi*size ~ sbi1 + sbi2` as collapsing
scheme. Make sure there are at least 10 records in each group.
2. Compute the mean of the ratio between `industrial` and `total`, but demand
that there are not more than 20% zeros in `other`. Use `csh` as collapsing scheme.