R でロジスティック回帰分析の変数選択

Rを使って、多重ロジスティック回帰分析でBICを使って、簡単に変数選択ができる。

＞＞もう統計で悩むのを終わりにしませんか？

↑1万人以上の医療従事者が購読中

変数選択の関数の前に BIC とは
ロジスティック回帰でbestglm()を使う準備とサンプルデータ
ロジスティック回帰でbestglm()の使い方
まとめ

変数選択の関数の前に BIC とは

BICは、 Bayesian Information Criterionの頭文字語。

統計モデルへのあてはまりを検討するときに、変数が多すぎると評価が下がる規準になっている。

変数が多ければ多いほど、統計モデルへのあてはまりはよくなるが、新たなデータでの予測には向かなくなるし、そもそも複雑より単純なモデルで記述できたほうがいい。

AIC (赤池情報量規準) と似たような概念だ。

ロジスティック回帰でbestglm()を使う準備とサンプルデータ

変数選択には、R の bestglm() 関数を使用する。

まずbestglmをインストールしておく。

install.packages("bestglm")

bestglmを呼び出す。

library(bestglm)

SAheartというデータを使う。

"A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa."

SAheartの構造を確認する。

Endpointのchd（2値データ）が最後の列にあることを確認する。

それがbestglmの特殊なところ。

最後のカラムに従属変数。

これが必須。

> str(SAheart)
'data.frame':   462 obs. of  10 variables:
 $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
 $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
 $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
 $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
 $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
 $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
 $ obesity  : num  25.3 28.9 29.1 32 26 ...
 $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
 $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
 $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

＞＞もう統計で悩むのを終わりにしませんか？

↑1万人以上の医療従事者が購読中

ロジスティック回帰でbestglm()の使い方

data frameのSAheartを指定して、 familyにbinomialを指定すればOK.

res1 <- bestglm(SAheart, family=binomial)

結果を表示する。

残った変数はタバコとLDL、家族歴、TypeA性格、年齢。

> res1 <- bestglm(SAheart, family=binomial)
Morgan-Tatar search since family is non-gaussian.
> res1
BIC
BICq equivalent for q in (0.190525988534159, 0.901583162187443)
Best Model:
                  Estimate Std. Error   z value     Pr(>|z|)
(Intercept)    -6.44644451 0.92087165 -7.000372 2.552830e-12
tobacco         0.08037533 0.02587968  3.105731 1.898095e-03
ldl             0.16199164 0.05496893  2.946967 3.209074e-03
famhistPresent  0.90817526 0.22575844  4.022774 5.751659e-05
typea           0.03711521 0.01216676  3.050542 2.284290e-03
age             0.05046038 0.01020606  4.944159 7.647325e-07

検討した変数の組み合わせ一覧結果を見るなら、res1$Subsets

> res1$Subsets
   Intercept   sbp tobacco   ldl adiposity famhist typea obesity alcohol   age logLikelihood      BIC
0       TRUE FALSE   FALSE FALSE     FALSE   FALSE FALSE   FALSE   FALSE FALSE     -298.0542 596.1084
1       TRUE FALSE   FALSE FALSE     FALSE   FALSE FALSE   FALSE   FALSE  TRUE     -262.7812 531.6979
2       TRUE FALSE   FALSE FALSE     FALSE    TRUE FALSE   FALSE   FALSE  TRUE     -253.3291 518.9293
3       TRUE FALSE    TRUE FALSE     FALSE    TRUE FALSE   FALSE   FALSE  TRUE     -247.6927 513.7921
4       TRUE FALSE    TRUE FALSE     FALSE    TRUE  TRUE   FALSE   FALSE  TRUE     -242.3572 509.2566
5*      TRUE FALSE    TRUE  TRUE     FALSE    TRUE  TRUE   FALSE   FALSE  TRUE     -237.8428 506.3634
6       TRUE FALSE    TRUE  TRUE     FALSE    TRUE  TRUE    TRUE   FALSE  TRUE     -236.9899 510.7933
7       TRUE  TRUE    TRUE  TRUE     FALSE    TRUE  TRUE    TRUE   FALSE  TRUE     -236.2745 515.4979
8       TRUE  TRUE    TRUE  TRUE      TRUE    TRUE  TRUE    TRUE   FALSE  TRUE     -236.0704 521.2253
9       TRUE  TRUE    TRUE  TRUE      TRUE    TRUE  TRUE    TRUE    TRUE  TRUE     -236.0700 527.3601

Best modelから5番目までの表示させるときは、res1$BestModels

> res1$BestModels
    sbp tobacco   ldl adiposity famhist typea obesity alcohol  age Criterion
1 FALSE    TRUE  TRUE     FALSE    TRUE  TRUE   FALSE   FALSE TRUE  506.3634
2 FALSE    TRUE FALSE     FALSE    TRUE  TRUE   FALSE   FALSE TRUE  509.2566
3 FALSE    TRUE  TRUE     FALSE    TRUE FALSE   FALSE   FALSE TRUE  509.9861
4 FALSE   FALSE  TRUE     FALSE    TRUE  TRUE   FALSE   FALSE TRUE  510.5745
5 FALSE    TRUE  TRUE     FALSE    TRUE  TRUE    TRUE   FALSE TRUE  510.7933

Best modelのestimates (coefficients)を表示させるときは、res1$BestModel

> res1$BestModel

Call:  glm(formula = y ~ ., family = family, data = Xi, weights = weights)

Coefficients:
   (Intercept)         tobacco             ldl  famhistPresent           typea             age  
      -6.44644         0.08038         0.16199         0.90818         0.03712         0.05046  

Degrees of Freedom: 461 Total (i.e. Null);  456 Residual
Null Deviance:      596.1 
Residual Deviance: 475.7        AIC: 487.7