The aim of this guide is to give a brief introduction and explanation to the functions contained inside the vivid package. vivid (variable importance and variable interaction displays) is used for investigating relationships within a machine learning model fit. All of the visualisations in this vignette are highly customizable (see the long-form vivid vignette to see examples) and return ggplot objects which can be customized using the normal ggplot options.
Note: For the purposes of speed, the grid size (i.e., gridSize - the size of the gid on which the evaluations are made) and the number of rows subsetted (nmax) are small. This achieve more accurate results, incerease both the grid size and the number of rows used.
Some of the plots used by vivid are built upon the zenplots package which requires the graph package from BioConductor. To install the graph and zenplots packages use:
if (!requireNamespace("graph", quietly = TRUE)){
install.packages("BiocManager")
BiocManager::install("graph")
}
install.packages("zenplots")
Now we can install and load vivid by using:
install.packages("vivid")
We then load the other required packages.
library(vivid) # for visualisations
library(randomForest) # for model fit
library(ranger) # for model fit
library(ggplot2) The data used in the following examples is simulated from the Friedman benchmark problem 11. This benchmark problem is commonly used for testing purposes. The output is created according to the equation:
Create the data:
set.seed(101)
genFriedman <- function(noFeatures = 10,
noSamples = 100,
sigma = 1) {
# Set Values
n <- noSamples # no of rows
p <- noFeatures # no of variables
e <- rnorm(n, sd = sigma)
# Create matrix of values
xValues <- matrix(runif(n * p, 0, 1), nrow = n) # Create matrix
colnames(xValues) <- paste0("x", 1:p) # Name columns
df <- data.frame(xValues) # Create dataframe
# Equation:
# y = 10sin(πx1x2) + 20(x3−0.5)^2 + 10x4 + 5x5 + ε
y <- (10 * sin(pi * df$x1 * df$x2) + 20 * (df$x3 - 0.5)^2 + 10 * df$x4 + 5 * df$x5 + e)
# Adding y to df
df$y <- y
df
}
myData <- genFriedman(noFeatures = 9, noSamples = 350, sigma = 1)In the following examples, we use a ranger random forest model fit on the data, with the importance set to permutation.
set.seed(101)
fit <- randomForest(y ~ ., data = myData)Next, we create the ‘vivi-matrix’, which will contain variable importance on the diagonal and variable interactions in the upper and lower triangle. This matrix can then be supplied to the vivid plotting functions.
set.seed(101)
viFit <- vivi(
fit = fit,
data = myData,
response = "y",
gridSize = 10,
importanceType = NULL,
nmax = 100,
reorder = TRUE,
class = 1,
predictFun = NULL
)#Section 2: Visualizing the results
To create a heatmap of the vivi-matrix, we use:
viviHeatmap(mat = viFit) + ggtitle("random forest fit heatmap")To create a network graph of the vivi-matrix, we use:
viviNetwork(mat = viFit)In this plot we use a generalized pairs plot matrix style layout (which we call GPDP) to display partial dependence plots (PDPs) in the upper triangle, individual conditional exception curves (along with the aggregated 1-way partial dependence) on the diagonal and a scatterplot in the lower triangle.
To create the plot we supply the model fit to the plotting function.
set.seed(1701)
pdpPairs(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdpFor this plot, we calculate the bivariate partial dependence and display them in a zenplots layout, which we call (ZPDP). The ZPDP is based on graph Eulerians and focuses on key subsets. ‘Zenplots’ create a zigzag expanded navigation plot (‘zenplot’) of the partial dependence values. This results in an alternating sequence of two-dimensional plots laid out in a zigzag structure, as shown in Fig 4.0 below and can be used as a useful space-saving plot that displays the most influential variables.
set.seed(1701)
pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdpIn Fig 4.0, we can see PDPs laid out in a zigzag structure, with the most influential variable pairs displayed at the top. As we move down the plot, we also move down in influence of the variable pairs.
Using the zpath argument, we can filter out any interactions below a set value. zpath takes the vivi matrix as a function argument and then, using cutoff, we can filter out any interactions below the chosen value. For example:
set.seed(1701)
zpath <- zPath(viv = viFit, cutoff = 0.1)
pdpZen(data = myData, fit = fit, response = "y", nmax = 50, gridSize = 10, zpath = zpath)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdpIn this section, we briefly describe how to apply the above visualisations to a classification example using the iris data set.
To begin we fit a ranger random forest model with “Species” as the response and create the vivi matrix setting the category for classification to be “setosa” using class.
set.seed(1701)
rfClassif <- ranger(Species ~ .,
data = iris, probability = T,
importance = "impurity"
)
set.seed(101)
viviClassif <- vivi(
fit = rfClassif,
data = iris,
response = "Species",
gridSize = 10,
importanceType = NULL,
nmax = 50,
reorder = TRUE,
class = "setosa",
predictFun = NULL
)
#> Embedded impurity variable importance method used.
#> Calculating interactions...Next we plot the heatmap and network plot of the iris data.
set.seed(1701)
viviHeatmap(mat = viviClassif)set.seed(1701)
viviNetwork(mat = viviClassif)As PDPs are evaluated on a grid, they can extrapolate where there is no data. To solve this issue we calculate a convex hull around the data and remove any points that fall outside the convex hull. This can be seen in the GPDP in Fig 3.2 below.
set.seed(1701)
pdpPairs(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdpFinally, a ZPDP for the random forest fit on the iris data with extrapolated data removed:
set.seed(1701)
pdpZen(data = iris, fit = rfClassif, response = "Species", class = "setosa", convexHull = T, gridSize = 10, nmax = 50)
#> Generating ice/pdp fits... waiting...
#> Finished ice/pdpFriedman, Jerome H. (1991) Multivariate adaptive regression splines. The Annals of Statistics 19 (1), pages 1-67.↩︎