Variable selection in DEA is a question that requires full attention before the results of an analysis can be used in a real case, because its results can be significantly modified depending on the variables included in the model. So, variable selection is a keystone step in each DEA application.
adea provides a measure called load of the contribution of a variable into a DEA model. In an ideal case, when all variables contribute in same way, all loads will be 1. Thus, for example, if an output variable load is 0.75, means that its contribution is 75% of the average value for all outputs. A value for variable load lower than 0.6 means that its contribution to DEA model is negligible.
For more information see (Fernandez-Palacin, Lopez-Sanchez, and Munoz-Marquez 2018) and (Villanueva-Cantillo and Munoz-Marquez 2021).
Let’s load and have a look at the tokyo_libraries dataset with
data(tokyo_libraries)
head(tokyo_libraries)
#> Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1 2.249 163.523 26 49.196 5.561 105.321
#> 2 4.617 338.671 30 78.599 18.106 314.682
#> 3 3.873 281.655 51 176.381 16.498 542.349
#> 4 5.541 400.993 78 189.397 30.810 847.872
#> 5 11.381 363.116 69 192.235 57.279 758.704
#> 6 10.086 541.658 114 194.091 66.137 1438.746Two step wise variable selection functions are provided. The first one drops variables one by one giving a set of nested models. The following code setup input and output variables and do the call
input <- tokyo_libraries[, 1:4]
output <- tokyo_libraries[, 5:6]
adea_hierarchical(input, output)
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 inoutput 6 6 4 2
#> 5 inoutput 6 5 3 2
#> 4 inoutput 4 4 3 1
#> 3 inoutput 2 3 2 1
#> 2 inoutput 1 2 1 1
#> 1 inoutput 0 1 0 0
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 4 Books.I2, Staff.I3, Populations.I4 Borrow.O2
#> 3 Books.I2, Populations.I4 Borrow.O2
#> 2 Books.I2 Borrow.O2
#> 1The load of the first model is 0.455467 which is under the minimum significance level, so Area.I1 can be removed from the model.
When a variable is removed what one can expect is that the load of all variables raise, but after the second model this not happen. So third model is poorer than second and there is no statistical reason to select it.
To avoid that a second step wise selection variable is provided, the new call is
adea_parametric(input, output)
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 0.455467 6 6 4 2
#> 5 0.990164 6 5 3 2
#> 2 1.000000 1 2 1 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 2 Books.I2 Borrow.O2In both case, all variables have been taken into account to remove them, but load.orientation parameter allows to select which variables have to be included in load analysis, input for only input variables, output for only output variables, and inoutput, the default value for all variables. The next call consider only output variables as candidate variables to be removed:
adea_parametric(input, output, load.orientation = 'output')
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 1 6 6 4 2
#> 5 1 4 5 4 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Area.I1, Books.I2, Staff.I3, Populations.I4 Borrow.O2adea_hierarchical and adea_parametric return a list, called models, with all computed model that can be accessed through the following call
m <- adea_hierarchical(input, output)
m4 <- m$models[[4]]
m4
#> 1 2 3 4 5 6 7 8
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942
#> 9 10 11 12 13 14 15 16
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710
#> 17 18 19 20 21 22 23
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000where the number in square brackets is the number of total variables in the model.
By default, when print function is called with an adea model, it prints only efficiencies. summary results in a wider output:
summary(m4)
#> Model name:
#> Orientation is input
#> Inputs: Books.I2 Staff.I3 Populations.I4
#> Outputs: Borrow.O2
#> Input loads: 1.193651 0.9031744 0.9031744
#> Output loads: 1
#> Model load: 0.903174350658053
#> #Efficients: 4
#> Efficiencies:
#> 1 2 3 4 5 6 7 8
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942
#> 9 10 11 12 13 14 15 16
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710
#> 17 18 19 20 21 22 23
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000
#> Summary of efficiencies:
#> Mean sd Min. 1st Qu. Median 3rd Qu. Max.
#> 0.7270638 0.1793772 0.3026132 0.6170450 0.7215430 0.8159097 1.0000000Fernandez-Palacin, Fernando, Marı́a Auxiliadora Lopez-Sanchez, and Manuel Munoz-Marquez. 2018. “Stepwise selection of variables in DEA using contribution loads.” Pesquisa Operacional 38 (1): 31–52. http://dx.doi.org/10.1590/0101-7438.2018.038.01.0031.
Villanueva-Cantillo, Jeyms, and Manuel Munoz-Marquez. 2021. “Methodology for Calculating Critical Values of Relevance Measures in Variable Selection Methods in Data Envelopment Analysis.” European Journal of Operational Research 290 (2): 657–70. https://doi.org/10.1016/j.ejor.2020.08.021.
Universidad de Cádiz, fernando.fernandez@uca.es↩︎
Universidad de Cádiz, manuel.munoz@uca.es↩︎