Skip to content

Latest commit



491 lines (325 loc) · 162 KB

Summary of PDP, ALE, H-statistic

File metadata and controls

491 lines (325 loc) · 162 KB

Summary of PDP, ALE, H-statistic Experiments


  1. All experiment data are the subset of cslg data set which contains 92938 papers and labeled by Arxiv with cs.LG category.
  2. All experiments are provided with the same query string Machine Learnin .
  3. All experiments data with the same amount of paper are referring the same data set. For instance, the 500 paper data set used in PDP experiments are the same data set used in ALE experiments. And there are subsets:
    1. 100 papers;
    2. 200 papers;
    3. 500 papers;
    4. 1000 papers;
    5. 5000 papers;
    6. 92938 Papers


Partial Dependence Evaluation

The partial function evaluation formula:


Notation Explanation
![img](data:image/svg+xml, the partial function
![img](data:image/svg+xml, the features that we are interested in
![img](data:image/svg+xml, the features that are not correlated with features ![img](data:image/svg+xml,
![img](data:image/svg+xml, the set of features' values that partial dependence function should be plotted
![img](data:image/svg+xml, the set of other features’ values
![img](data:image/svg+xml, the number of the instances

If the $x_S$ contains one feature, then it's a 1-way PDP.

PDP-based Feature Importance Evaluation

The basic motivation is that a flat PDP indicates that the feature is not important, and the more the PDP varies, the more important the feature is. $$ I(x_S) = \sqrt{\frac{1}{K-1}\sum_{k=1}^K(\hat{f}_S(x^{(k)}S) - \frac{1}{K}\sum{k=1}^K \hat{f}_S({x^{(k)}_S))^2}} $$ Note that here the $x_S^{(k)}$ are the $K$ unique values of feature the $x_S$.

PDP-based Feature Interaction Evaluation

When we are computing the partial dependence function for two features, we will get a matrix for every possible values pair of two features.


Figure 1. 2-way PDPs Matrix for feature i and feature j

We can get the importance $I(x_i)$ for each unique value of $x_j^{(k)}$ denoted as $I(x_i|x_j)$, and take the standard deviation of those importance values. The same can be done for $x_j$ and get the average of those two results. Large values would be indicative of the possible interaction strength. $$ Std_{X_i} = std(I(x_i|x_j)) \ Std_{X_j} = std(I(x_j|x_i)) \ Interaction(x_i, x_j) = \frac{(Std_{X_i} + Std_{X_j})}{2} $$ This is why we should estimate the 2-way partial dependence value for the feature pairs.

To estimate the partial dependence value for 2 features, the time complexity of the functions will be: $$ O(N\times UNI(x_i)\times UNI(x_j)) $$ where $UNI(x_i)$ denote the number of unique values in feature space of feature $x_i$. In our case, For categorical features title, abstract, authors, the values are same as the number of the sample number. Hence for the 2-way PDP of those three features, the time complexities will be $O(N^3)$.

In that manner, calculating 2-way PDP for even a 300 papers will be a heavy work. For now I just performed the experiment for 100 papers and 200 papers.

1-way PDP Experiment For Feature Importance

100 Papers

The following picture shows the partial dependence plots on 100 paper for categorical features.


Figure 2. PDPs of for categorical features for 100 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the sequence number of the feature values sorted by the predicted scores.

A total of 100 paper are used, so there are 100 feature values for each categorical feature.

The following picture shows the partial dependence plots on 100 paper for numerical features.


Figure 3. PDPs of for numerical features for 100 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the ascending feature values.

200 Papers


Figure 4. PDPs of for categorical features for 200 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the sequence number of the feature values sorted by the predicted scores.


Figure 5. PDPs of for numerical features for 200 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the ascending feature values.

500 Papers


Figure 6. PDPs of for categorical features for 500 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the sequence number of the feature values sorted by the predicted scores.


Figure 7. PDPs of for numerical features for 500 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the ascending feature values.

1000 Papers


Figure 8. PDPs of for categorical features for 1000 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the sequence number of the feature values sorted by the predicted scores.


Figure 9. PDPs of for numerical features for 1000 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the ascending feature values.

5000 Papers

Because this experiment use the most large data set of 1-way PDP experiments, it is considered as the most representative experiment hence some plots have zoomed sub-plot for better illustration.



Figure 10. PDPs of for categorical features for 5000 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the sequence number of the feature values sorted by the predicted scores.


Figure 11. PDPs of for numerical features for 5000 papers data set.
The values of Y-axis are the average partial dependence scores of the feature value.
The values of X-axis are the ascending feature values.


After computing and putting the PDP-based importance values together, we can have:

Feature name 100p. PDP-based Imp. 200 P. PDP-based Imp. 500p. PDP-based Imp. 1000p. PDP-based Imp. 5000p. PDP-based Imp.
abstract 5.683929 6.195750 6.238987 6.366324 6.155994
title 3.660768 3.486766 4.274033 3.769771 3.788171
venue 1.663986 0.892033 1.340470 1.218480 3.788171
year 0.454161 0.445873 0.474727 0.478728 0.470416
n_citations 0.190772 0.191021 0.184865 0.199331 0.193581
authors 0.000000 0.000000 0.000000 0.000000 0.000000

Table 1. PDP-based Feature Importance list for all data set.

All five PDP experiments are showing the same feature importance order: $$ abstract > title > venue > year > n_citations > authors $$

2-way PDP Experiment For Feature Interaction

Since too many plots for two experiments, you can refer to the Github to check the plots:


The joint result of feature interaction estimation is shown as follow:

Feature Pairs 100p. PDP-based Int. 200p. PDP-based Int.
$title \times abstract$ 1.937379 2.016726
$title\times venue$ 0.309480 0.207367
$title\times year$ 0.048401 0.045000
$title\times n_citations$ 0.014476 0.019629
$abstract\times venue$ 0.603521 0.438959
$abstract\times year$ 0.029938 0.030917
$abstract\times n_citations$ 0.007853 0.008880
$venue\times year$ 0.038115 0.005358
$venue\times n_citations$ 0.004152 0.036003
$year\times n_citations$ 0.056796 0.056069

Table 2. PDP-based Feature Interaction Strength for data set of 100 Papers and 200 Papers.

Notice that any pair that contains feature authors will not be shown in the table because their values are all zero.


From the result, we can draw the conclusion that the top 3 interactions are:

  1. $title \times abstract$
  2. $abstract\times venue$
  3. $title\times venue$

and the first interaction strength is much stronger than the sum of all the others.

Heat-map and Network Graph

Using the idea from, we can have the heat-map and network graph for showing the feature importance and the feature interaction strength at the same time.


Figure 12. PDP-based Heat-map for Showing Feature Importance and the Feature Interaction Strength.
Cells with Green Color Indicate the Feature Importance Values. Deeper color indicates larger value.
Cells with Purple Color Indicate the Feature Interaction Strength. Deeper color indicates larger value.
The Cells Are Ordered by the Sum of Two Metrics. The Top Left Most Cells Contribute More than the Others.

And the graph:


Figure 13. PDP-based Network Graph for Showing Feature Importance and the Feature Interaction Strength.
Notes with Green Color Indicate the Feature Importance Values. Deeper color and bigger size indicates larger value.
Edges with Purple Color Indicate the Feature Interaction Strength. Deeper color and wider edge width indicates larger value.


Accumulated Local Effect Evaluation

The 1-way ale evaluation formula is:


Notion Explanation
![img](data:image/svg+xml, the accumulated local effect function
![img](data:image/svg+xml, the sufficiently fine interval of the feature ![img](data:image/svg+xml,
![img](data:image/svg+xml, the interval ![img](data:image/svg+xml,
![img](data:image/svg+xml, the number of points falling into the interval ![img](data:image/svg+xml,
![img](data:image/svg+xml, the index of the interval which ![img](data:image/svg+xml, falls into
![img](data:image/svg+xml, for instance ![img](data:image/svg+xml, we replace ![img](data:image/svg+xml, the value of the right interval end-point ![img](data:image/svg+xml,
![img](data:image/svg+xml, for instance ![img](data:image/svg+xml, we replace ![img](data:image/svg+xml, the value of the left interval end-point ![img](data:image/svg+xml,
![img](data:image/svg+xml, the mean of all the uncentered ALE values of the interval

ALE-based Feature Importance Evaluation

Same idea as the PDP-based is used, feature importance can be measured by how flatness the plots are.

ALE-based Feature Interaction Evaluation

Same idea as the PDP-based is used, feature interaction strength can be measured standard deviation of the 2-way ale matrix.

$$ Interact(x_i, x_j) = Std(\widehat{ALE}(x_i,x_j)) $$

1-way ALE Experiment

1000 Papers

For 1000 paper data set, the size of the interval for every feature is 1% of the total 1000 paper.

This means that for every ALE plot, a total of 100 intervals are used for ALE estimation.

The following plots show the ALE of categorical features.


Figure 14. ALE Plots of for categorical features for 1000 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the sequence numbers of the intervals.


Figure 15. ALE Plots of for numerical features for 1000 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the maximum value of the intervals.

5000 Papers

For 5000 paper data set, the size of the interval for every feature is 1% of the total 5000 paper.

This means that for every ALE plot, a total of 100 intervals are used for ALE estimation.

The following plots show the ALE of categorical features.


Figure 16. ALE Plots of for categorical features for 5000 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the sequence numbers of the intervals.


Figure 17. ALE Plots of for numerical features for 5000 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the maximum value of the intervals.

All cslg Papers

For 92938 paper data set, the size of the interval for every feature is 1000 papers.

This means that for every ALE plot, a total of 93 intervals are used for ALE estimation.

The following plots show the ALE of categorical features. Again, since it used the larger size of the data set, zoomed sub-plots are used for better illustration.


Figure 18. ALE Plots of for categorical features for 92938 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the sequence numbers of the intervals.


Figure 19. ALE Plots of for numerical features for 92938 papers data set.
The values of Y-axis are the acculumated local effect values of the intervals.
The values of X-axis are the maximum value of the intervals.


The previous two experiments show the same importance order as we got in PDP experiments.

But when we used the biggest data set we have, feature $author$ shows more contribution than the $venue$ does. This might due to the fact that when we have small data set, non of the authors value can contribute the predictions. And the others' importance position remain the same.

Feature name 1000p. ALE-based Imp. 5000p. PDP-based Imp. 92938p. PDP-based Imp.
abstract 4.243614 5.085414 5.664805
title 4.289228 3.992325 2.382722
venue 0.829445 1.340829 1.316654
authors 0.000000 0.000000 1.327196
year 0.460942 0.449724 0.433338
n_citations 0.217293 0.238238 0.236456

Table 3. ALE-based Feature Importance list for 1000 Papers, 5000 Papers and 92938 Papers data set.

From the last result, we can have the feature importance ordering as follow: $$ abstract > title > authors > venue > year > n_citations $$

2-way ALE Experiment

Since too many plots for two experiments, you can refer to the Github to check the plots:


The joint result of feature interaction estimation is shown as follow:

Feature Pairs 1000p. ALE-based Int. 5000p. ALE-based Int. 92938p. ALE-based Int.
$title \times abstract$ 3.852272 4.310072 5.933907
$title\times venue$ 0.000000 0.407969 0.262588
$title\times authors$ 0.000000 0.000000 0.211917
$title\times year$ 0.024064 0.069945 0.146681
$title\times n_citations$ 0.017638 0.633668 0.056150
$abstract\times venue$ 0.000000 0.438959 0.630956
$abstract\times authors$ 0.000000 0.000000 0.188118
$abstract\times year$ 0.006473 0.024601 0.088027
$abstract\times n_citations$ 0.006053 0.015580 0.029473
$venue\times authors$ 0.000000 0.000000 0.130406
$venue\times year$ 0.000659 0.005818 0.003074
$venue\times n_citations$ 0.000554 0.002192 0.006128
$authors\times year$ 0.000000 0.000000 0.005345
$authors\times n_citations$ 0.000000 0.000000 0.011496
$year\times n_citations$ 0.044821 0.108069 0.180502

Table 4. ALE-based Feature Interaction Strength for data set of 1000 Papers and 5000 Papers and 92938 Papers.


From the result, we can draw the conclusion that the top 3 interactions are:

  1. $title \times abstract$
  2. $abstract\times venue$
  3. $title\times venue$

and the first interaction strength is much stronger than the sum of all the others.

Heat-map and Network Graph

Since the first two ale result experiments are unstable, here we only show the last experiment.


Figure 20. ALE-based Heat-map for Showing Feature Importance and the Feature Interaction Strength.
Cells with Green Color Indicate the Feature Importance Values. Deeper color indicates larger value.
Cells with Purple Color Indicate the Feature Interaction Strength. Deeper color indicates larger value.
The Cells Are Ordered by the Sum of Two Metrics. The Top Left Most Cells Contribute More than the Others.


Figure 21. ALE-based Network Graph for Showing Feature Importance
and the Feature Interaction Strength.
Notes with Green Color Indicate the Feature Importance Values.
Deeper color and bigger size indicates larger value.
Edges with Purple Color Indicate the Feature Interaction Strength.
Deeper color and wider edge width indicates larger value.


Figure 22. ALE-based Network Graph for Showing Feature Importance
and the Feature Interaction Strength.
This Graph is altered by removing the Abstract Node for showing the metrics for the rest of the features.

H-statistic Feature Interaction

This technique is based on Partial dependence function. The formula of it is: $$ H^2_{jk} = \frac{\sum_{i=1}^n\left[PD_{jk}(x_{j}^{(i)},x_k^{(i)})-PD_j(x_j^{(i)}) - PD_k(x_{k}^{(i)})\right]^2}{\sum_{i=1}^n{PD}^2_{jk}(x_j^{(i)},x_k^{(i)})} $$ This is used for estimating the interaction between feature $j$ and $k$.

The experiment result is presented as following table:

Features H-statistic for 500p. H-statistic for 1000p. H-statistic for 5000p.
$title \times abstract$ 0.2174814336 0.2664149087 0.2510489798
$title \times venue$ 0.9459569320 0.9906408142 1.0314859797
$title \times authors$ 0.9691326399 1.0153224425 1.0475620936
$title \times years$ 0.9624149158 1.0030108805 1.0306418632
$title \times n_citations$ 0.9608109213 1.0028524626 1.0363504646
$abstract \times venue$ 0.6530887459 0.6480989307 0.6929262084
$abstract \times authors$ 0.6731584914 0.6620560612 0.7052068789
$abstract \times year$ 0.6706629075 0.6604602498 0.6956860299
$abstract \times n_citations$ 0.6677419869 0.6571619591 0.7020495081
$venue \times authors$ 0.9688473428 0.9769444729 0.9874531648
$venue \times year$ 0.9567614749 0.9649247107 0.9764694790
$venue \times n_citations$ 0.9589898713 0.9679587055 0.9792065804
$authors \times year$ 0.9843642076 0.9860892220 0.9887802334
$authors \times n_citations$ 0.9900262401 0.9904366160 0.9915394058
$year \times n_citations$ 1.0115519498 1.0136085292 1.0139521223

Table 5. H-Statistic Feature Interaction Strength for data set of 5000 Papers and 1000 Papers and 5000 Papers.

Although the result is stable, but they are not promising, take a look at the heat-map for building the intuition.


Figure 23. H-Statistic Heat-map for Showing Feature Interaction Strength.
Cells with Purple Color Indicate the Feature Interaction Strength. Deeper color indicates larger value.
The Cells Are Ordered by the Sum of Two Metrics.
The Top Left Most Cells Contribute More than the Others.


Figure 24. H-Statistic Network Graph for Showing Feature Interaction Strength.
Edges with Purple Color Indicate the Feature Interaction Strength.
Deeper color and wider edge width indicates larger value.

According to the book - <Intepretable Machine Learning>:

"The statistic is 0 if there is no interaction at all and 1 if all of the variance of the $PD_{{jk}}$ or $f$ is explained by the sum of the partial dependence functions. An interaction statistic of 1 between two features means that each single PD function is constant and the effect on the prediction only comes through the interaction. The H-statistic can also be larger than 1, which is more difficult to interpret."

The result of h-statistic show that $title \times authors$ has the strongest interaction and the $title \times abstract$ has the weakest interaction. This result is quite different compare to what we excepted and the common sense.

Also from the paper A Simple and Effective Model-Based Variable Importance Measure, it says: "To our surprise, the H-statistic did not seem to catch the true interaction between x1 and x2.". Which means that the results of H-statistic are misleading to some extend.

So we have to turn to the PDP-based Feature Interaction Detection and compare it to ALE-based Interaction Detection.