The Shapley value is a concept from game theory that quantifies how much each player contributes to the game outcome (Shapley 1953). The concept, however, has many more use cases: it provides a method to quantify the importance of predictors in regression analysis or machine learning models, and can be used in a wide variety of decomposition problems (Shorrocks 2013). Most implementations focus on one narrow use case, although the algorithm for the Shapley value decomposition is always the same – it is just the concrete value function that varies. This package provides a simple algorithm for the Shapley value decomposition, and also supports hierarchical decomposition using the Owen value.
The key advantage of the Shapley decomposition framework is the connection with counterfactuals: Once appropriate counterfactuals for each combination of factors have been identified, the method will produce an appropriate decomposition.
devtools::install_github("elbersb/shapley")
The package provides a shapley
function that takes two main arguments:
the value function and a vector of factor names. The value function
needs to be an R function that takes one or more arguments, where the
first argument defines the factors that are included in the calculation
of the outcome value. The shapley
function will call the value
function repeatedly, each time with a different set of factors.
For a very simple example, consider that an outcome is determined by two factors “A” and “B”, which contribute 1 and 2, respectively. (The factors are linearly additive, which makes the use of Shapley value decomposition unnecessary, but it works as an illustration.) The value function is thus defined as:
simple <- function(factors = c()) {
value <- 0
if ("A" %in% factors) value <- value + 1
if ("B" %in% factors) value <- value + 2
return(value)
}
We now supply the value function to shapley
, along with the factor
names:
shapley(simple, c("A", "B"), silent = TRUE)
#> factor value
#> 1 A 1
#> 2 B 2
As expected, the marginal contributions of the two factors are 1 and 2, respectively. For the two factors, we can manually compute the contribution as follows:
# A:
1/2 * (simple("A") - simple()) + 1/2 * (simple(c("A", "B")) - simple("B"))
#> [1] 1
# B:
1/2 * (simple("B") - simple()) + 1/2 * (simple(c("B", "A")) - simple("A"))
#> [1] 2
Across the two computations, most terms occur twice. Also note that
simple(c("A", "B")) == simple(c("B", "A"))
. The shapley
function
only calculates each term once, and then caches the result. This leads
to great speed improvements once we consider a greater number of
factors.
For this example (taken from
Wikipedia),
consider three players. Players 1 and 2 supply right-hand gloves, while
Player 3 supplies a left-hand glove. The game is only successful if
players with both types of gloves enter into a coalition. We thus define
the value function as 1 if pairs {1,3}
, {2,3}
or {1,2,3}
are
formed, and 0 otherwise. In R code:
glove <- function(factors) {
if (length(factors) > 1 & 3 %in% factors) return(1)
return(0)
}
To compute the marginal contributions of each player, use:
shapley(glove, c(1, 2, 3), silent = TRUE)
#> factor value
#> 1 1 0.1666667
#> 2 2 0.1666667
#> 3 3 0.6666667
Consider this simple regression model and its R2:
model <- lm(mpg ~ wt + qsec + am, data = mtcars)
summary(model)$r.squared
#> [1] 0.8496636
The Shapley value decomposition allows us to determine how much each predictor contributes to the R2. To do this, we need to define the value function in a way that it runs the regression with the appropriate subset of predictors. It should return 0 when there are no predictors:
reg_mtcars <- function(factors) {
if (length(factors) == 0) return(0)
formula <- paste("mpg ~", paste(factors, collapse = "+"))
m <- lm(formula, data = mtcars)
summary(m)$r.squared
}
# test - should be the same as above:
reg_mtcars(c("wt", "qsec", "am"))
#> [1] 0.8496636
shapley(reg_mtcars, c("wt", "qsec", "am"), silent = TRUE)
#> factor value
#> 1 wt 0.4792448
#> 2 qsec 0.1574791
#> 3 am 0.2129397
We can also generalize the value function to apply to any dataset and dependent variable:
reg <- function(factors, dv, data) {
if (length(factors) == 0) return(0)
formula <- paste(dv, "~", paste(factors, collapse = "+"))
m <- lm(formula, data = data)
summary(m)$r.squared
}
shapley(reg, c("cyl", "hp", "am"), silent = TRUE, dv = "wt", data = mtcars)
#> factor value
#> 1 cyl 0.2791418
#> 2 hp 0.1960524
#> 3 am 0.2740727
Note that there are many packages (e.g., relaimpo) that provide this functionality specifically for regression analysis.
Another classic use case for the Shapley value is the decomposition of inequality indices (see Shorrocks 2013 among others). Enami et al. (2018) provide a simple example to show such a decomposition in the context of measuring the impact of taxes and transfers on income inequality.
Consider the following dataset income
, showing the market incomes,
taxes paid, transfers received, and the resulting final incomes for five
individuals:
MarketIncome | Tax | Transfer | FinalIncome |
---|---|---|---|
1 | -5 | 9 | 5 |
20 | -5 | 7 | 22 |
30 | -5 | 5 | 30 |
40 | -5 | 3 | 38 |
50 | -5 | 1 | 46 |
The Gini indices of the market and final incomes are:
gini_market <- ineq::Gini(income[["MarketIncome"]])
gini_final <- ineq::Gini(income[["FinalIncome"]])
round(c(gini_market, gini_final, gini_final - gini_market), 3)
#> [1] 0.335 0.278 -0.057
Taxes and transfers combined thus reduced inequality by about 0.057. There are now two different approaches to dividing this difference among the two factors (i.e., taxes and transfers). In what Sastre and Trannoy (2002) call the “zero income decomposition” (ZID), sources not under consideration are set to zero. In the alternative scenario, “equalized income decomposition” (EID), those sources are distributed evenly among the population. Both scenarios are easily implemented using different value functions:
zid <- function(factors, data) {
cntf <- data[["MarketIncome"]] # baseline for counterfactual income
for (f in factors)
cntf <- cntf + data[[f]]
ineq::Gini(cntf)
}
eid <- function(factors, data) {
cntf <- data[["MarketIncome"]]
if ("Tax" %in% factors)
cntf <- cntf + data[["Tax"]]
else
cntf <- cntf + mean(data[["Tax"]])
if ("Transfer" %in% factors)
cntf <- cntf + data[["Transfer"]]
else
cntf <- cntf + mean(data[["Transfer"]])
ineq::Gini(cntf)
}
These equalities hold in both scenarios:
zid(c(), income) == gini_market
#> [1] TRUE
zid(c("Tax", "Transfer"), income) == gini_final
#> [1] TRUE
eid(c(), income) == gini_market
#> [1] TRUE
eid(c("Tax", "Transfer"), income) == gini_final
#> [1] TRUE
Note that for EID, the first equality only holds because the sum of taxes and transfers is zero, i.e., those two sources cancel each other out. Once this is no longer the case, the EID method runs into problems (see Enami et al. for a detailed discussion). In any case, ZID and EID give different answers when only one factor is included:
zid("Tax", income)
#> [1] 0.4068966
eid("Tax", income)
#> [1] 0.3347518
This is because in the zero income scenario, transfers are set to zero when only taxes are considered, while in the equalized income scenario, transfers are distributed equally among the individuals. The Shapley values of the two scenarios are the following:
shapley(zid, c("Tax", "Transfer"), silent = TRUE, data = income)
#> factor value
#> 1 Tax 0.05700719
#> 2 Transfer -0.11374478
shapley(eid, c("Tax", "Transfer"), silent = TRUE, data = income)
#> factor value
#> 1 Tax 0.00000000
#> 2 Transfer -0.05673759
Whether ZID or EID is appropriate depends on the context. Sastre and Trannoy (2002) and Enami et al. (2018) address this question in further detail.
Continuing from the previous example (and again borrowing from Enami et
al.), consider the case that the tax shown above is actually composed of
two different taxes, Tax1
and Tax2
(these two columns sum to the
column Tax
in the previous example):
MarketIncome | Tax1 | Tax2 | Transfer | FinalIncome |
---|---|---|---|---|
1 | 0 | -5 | 9 | 5 |
20 | -1 | -4 | 7 | 22 |
30 | -2 | -3 | 5 | 30 |
40 | -3 | -2 | 3 | 38 |
50 | -4 | -1 | 1 | 46 |
When we now decompose this dataset (income2
) by three factors, we get
the following results:
# we can reuse the `zid` function from above,
# while the `eid` function would need to be adapted
owen(zid, c("Tax1", "Tax2", "Transfer"), silent = TRUE, data = income2)
#> factor value
#> 1 Tax1 -0.006012472
#> 2 Tax2 0.062502480
#> 3 Transfer -0.113227597
Note that the sum of the contributions of the two taxes does not equal the contribution for the tax above, although this is just the sum of the two separate taxes. Furthermore, the size of the transfer component is affected. As Enami et al. (2018, p. 108) write:
Given that no new tax has been added and that the only change is that some additional information about the sources of taxes has been included in the analysis, it is inconvenient that the Shapley value for transfers has also changed.
This is a unfortunate property of the Shapley decomposition, but it can
be partially remedied by using a hierarchical procedure, the Owen value
decomposition (Owen 1977). (An alternative is the Nested Shapley
decomposition recommended by Sastre and Trannoy (2002), which introduces
a new set of problems, though.) The shapley
package allows the
computation of Owen values by specifying the group structure using a
list of vectors:
owen(zid, list(c("Tax1", "Tax2"), c("Transfer")), silent = TRUE, data = income2)
#> group factor value
#> 1 1 Tax1 -0.00575388
#> 2 1 Tax2 0.06276107
#> 3 2 Transfer -0.11374478
Using this notation, we have grouped Tax1
and Tax2
together in one
group, while Transfer
is a group in itself. The results now line up
with the results of the Shapley decomposition above, where the taxes
were jointly entered as a single factor.
Note that the hierarchical procedure can also be used as an effective tool to increase the speed of computation when a large number of factors is included. For instance, when 8 factors are considered, 8! = 40320 permutations need to be calculated for each factor. Once the 8 factors are grouped into two groups with 4 factors each, the number of permutations that need to be calculated for each factor is only 2! * 4! * 4! = 1152.
Enami, A., N. Lustig, and R. Aranda. 2018. Analytic Foundations: Measuring the Redistributive Impact of Taxes and Transfers. In: N. Lustig (Ed.), Commitment to Equity Handbook. Estimating the Impact of Fiscal Policy on Inequality and Poverty, Washington, D.C.: Brookings, 56-115.
Owen, G. 1977. Values of Games with a Priori Unions. In: R. Henn and O. Moeschlin (Eds.), Mathematical Economics and Game Theory, Berlin and Heidelberg: Springer, 76-88.
Sastre, M. and A. Trannoy. 2002. Shapley Inequality Decomposition by Factor Components: Some Methodological Issues. Journal of Economics 77(1): 51-89. https://doi.org/10.1007/BF03052500
Shapley, L. S. 1953. A value for n-person games. In: A. W. Tucker and H. W. Kuhn (Eds.), Contributions to the theory of games (Vol. II), Princeton: Princeton University Press, 307–317.
Shorrocks, A. F. 2013. Decomposition procedures for distributional analysis: a unified framework based on the Shapley value. Journal of Economic Inequality 11: 1-28. https://doi.org/10.1007/s10888-011-9214-z