统计（九）_置换检验

对于正态分布或其他已知分布的数据，有相应的假设检验与置信区间的计算方法，但是当数据抽样自未知或混合分布、样本量过小、存在离群点、基于理论分布设计合适的统计检验过于复杂且数学上难以处理等情况，就需要使用基于随机化和重抽样的统计方法。

本次推文主要介绍置换检验，下次推文主要介绍自助法。

置换检验

置换检验也称随机检验或重随机化检验

步骤：（引自R语言实战，此时数据分为A和B两组，每组有5个得分）

（1）与参数方法类似，计算观测数据的t统计量，称为t0;

（2）将10个得分放在一个组中；

（3）随机分配五个得分到A处理中，并分配五个得分到B处理中；

（4）计算并记录新观测的t统计量；

（5）对每一种可能随机分配重复（3）~（4）步，此处有252种可能的分配组合；

（6）将252个统计量按升级序排列，这便是基于样本数据的经验分布；

（7）若t0落在经验分布中间95%的外面，则在0.05的显著性水平下，拒绝两个处理组的总体均值相等的零假设。

此处并非将统计量与理论分布进行比较，而是将其与转换观测数据后获得的经验分布进行比较，根据统计量值的极端性判断是否有理由拒绝零假设。因此，下面可以比较一下参数法与置换检验结果是否有差异。

当样本量非常大时，可以使用蒙特卡洛模拟，得到近似的检验，降低运算时间。

使用R包coin和lmPerm可实现置换检验。

coin独立性问题

coin包有一个参数：若distribution。若distribution=“exact”，则在零假设条件下，分布的计算是精确的（即依据所有可能的排列组合）；也可以根据它的渐进分布（distribution=“asymptotic”）或蒙特卡洛重抽样（distribution=”approximate(B=#)”）来做近似计算，其中#指所需重复的次数。distribution=”exact”当前仅可用于两样本问题。

与t检验比较

set.seed(1234)
#coin包与传统t检验比较

#t检验
library(coin)
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.3
score<-c(40,57,45,55,58,57,64,55,62,65)
treatment<-factor(c(rep("A",5),rep("B",5)))
mydata<-data.frame(treatment,score)
t.test(score~treatment,data=mydata,var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  score by treatment
## t = -2.345, df = 8, p-value = 0.04705
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.0405455  -0.1594545
## sample estimates:
## mean in group A mean in group B 
##            51.0            60.6
##精确随机置换
oneway_test(score~treatment,data=mydata,distribution="exact")
## 
##  Exact Two-Sample Fisher-Pitman Permutation Test
## 
## data:  score by treatment (A, B)
## Z = -1.9147, p-value = 0.07143
## alternative hypothesis: true mu is not equal to 0

传统t检验p<0.05，而精确置换检验却p>0.05，由于样本量较小，更倾向于相信置换检验，即结果没有统计学差异，可以尝试增大样本量。

与Wilcox秩和检验比较

set.seed(1234)
#coin包与Wilcox秩和检验比较

library(MASS)
UScrime<-transform(UScrime,So=factor(So)) #So转换为因子类型

##Wilcox秩和检验
wilcox.test(Prob~So,data=UScrime) 
## 
##  Wilcoxon rank sum test
## 
## data:  Prob by So
## W = 81, p-value = 8.488e-05
## alternative hypothesis: true location shift is not equal to 0
##精确随机置换
wilcox_test(Prob~So,data=UScrime,distribution="exact") 
## 
##  Exact Wilcoxon-Mann-Whitney Test
## 
## data:  Prob by So (0, 1)
## Z = -3.7493, p-value = 8.488e-05
## alternative hypothesis: true mu is not equal to 0

可以看到二者结果一致，因为二者本身就是同一种统计方法。

K样本检验与方差分析比较

library(multcomp)
## Loading required package: mvtnorm
## Loading required package: TH.data
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
set.seed(1234)
#K样本检验与方差分析比较

##方差分析
fit<-aov(response~trt,data=cholesterol)
summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## trt          4 1351.4   337.8   32.43 9.82e-13 ***
## Residuals   45  468.8    10.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##K样本检验

oneway_test(response~trt,data=cholesterol,distribution=approximate(9999))
## 
##  Approximative K-Sample Fisher-Pitman Permutation Test
## 
## data:  response by
##   trt (1time, 2times, 4times, drugD, drugE)
## chi-squared = 36.381, p-value < 1e-04

二者均有统计学差异。

置换检验判断两类别型变量的独立性，与卡方检验比较

set.seed(1234)
library(coin)
library(vcd)
## Loading required package: grid
Arthritis<-transform(Arthritis,Improved=as.factor(as.numeric(Improved)))
#K样本检验与方差分析比较

##卡方检验
mytable<-xtabs(~Treatment+Improved,data=Arthritis)
chisq.test(mytable)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 13.055, df = 2, p-value = 0.001463
##置换检验
chisq_test(Treatment~Improved,data=Arthritis,distribution=approximate(B=9999))
## Warning in approximate(B = 9999): 'B' is deprecated; use 'nresample' instead
## 
##  Approximative Pearson Chi-Squared Test
## 
## data:  Treatment by Improved (1, 2, 3)
## chi-squared = 13.055, p-value = 0.0018

二者结果类似。

置换检验数值变量间的相关性(独立性）与斯皮尔曼相关比较

states<-as.data.frame(state.x77)
set.seed(1234)

##斯皮尔曼相关
cor.test(states$Illiteracy,states$Murder,method = "spearman")
## Warning in cor.test.default(states$Illiteracy, states$Murder, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  states$Illiteracy and states$Murder
## S = 6823.1, p-value = 8.932e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6723592
##置换检验
spearman_test(Illiteracy~Murder,data=states,distribution=approximate(B=9999))
## Warning in approximate(B = 9999): 'B' is deprecated; use 'nresample' instead
## 
##  Approximative Spearman Correlation Test
## 
## data:  Illiteracy by Murder
## Z = 4.7065, p-value < 1e-04
## alternative hypothesis: true rho is not equal to 0

两样本和K样本相关性检验

对于两配对组的置换检验，可使用wilcoxsign_test（）函数；

多于两组时，使用friedman_test（）函数

library(coin)
library(MASS)
wilcoxsign_test(U1~U2,data=UScrime,distribution="exact")
## 
##  Exact Wilcoxon-Pratt Signed-Rank Test
## 
## data:  y by x (pos, neg) 
##   stratified by block
## Z = 5.9691, p-value = 1.421e-14
## alternative hypothesis: true mu is not equal to 0

lmPerm线性回归和方差分析

lmPerm包可作线性回归和方差分析的置换检验，lmp（）和aovp（）函数

perm=选项的可选值为“Exact”、“Prob”和“SPR”

Exact：根据所有可能的排列组合生成精确检验；

Prob：从所有可能的排列中不断抽样，直至估计的标准差在估计的p值0.1之下，判停准则由可选的Ca参数控制。

SPR：使用贯序概率比检验来判断何时停止抽样。

若观测数大于10，perm=“Exact”将自动默认转换为perm=”Prob”，因为精确检验只适用于小样本问题。

简单线性回归的置换检验

library(lmPerm)
## Warning: package 'lmPerm' was built under R version 3.6.3
set.seed(1234)
fit<-lmp(weight~height,data=women,perm="Prob")
## [1] "Settings:  unique SS : numeric variables centered"
summary(fit)
## 
## Call:
## lmp(formula = weight ~ height, data = women, perm = "Prob")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##        Estimate Iter Pr(Prob)    
## height     3.45 5000   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-Squared: 0.991,   Adjusted R-squared: 0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

多元线性回归的置换检验

library(lmPerm)
set.seed(1234)
states<-as.data.frame(state.x77)
fit<-lmp(Murder~Population+Illiteracy+Income+Frost,data=states,perm="Prob")
## [1] "Settings:  unique SS : numeric variables centered"
summary(fit)
## 
## Call:
## lmp(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states, perm = "Prob")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.79597 -1.64946 -0.08112  1.48150  7.62104 
## 
## Coefficients:
##             Estimate Iter Pr(Prob)    
## Population 2.237e-04   51   1.0000    
## Illiteracy 4.143e+00 5000   0.0004 ***
## Income     6.442e-05   51   1.0000    
## Frost      5.813e-04   51   0.8627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-Squared: 0.567,   Adjusted R-squared: 0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

单因素方差分析的置换检验

library(lmPerm)
library(multcomp)
set.seed(1234)
fit<-aovp(response~trt,data=cholesterol,perm="Prob")
## [1] "Settings:  unique SS "
summary(fit)
## Component 1 :
##             Df R Sum Sq R Mean Sq Iter  Pr(Prob)    
## trt          4  1351.37    337.84 5000 < 2.2e-16 ***
## Residuals   45   468.75     10.42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

单因素协方差分析的置换检验

library(lmPerm)
set.seed(1234)
fit<-aovp(weight~gesttime+dose,data=litter,perm="Prob")
## [1] "Settings:  unique SS : numeric variables centered"
summary(fit)
## Component 1 :
##             Df R Sum Sq R Mean Sq Iter Pr(Prob)    
## gesttime     1   161.49   161.493 5000   0.0006 ***
## dose         3   137.12    45.708 5000   0.0392 *  
## Residuals   69  1151.27    16.685                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

双因素方差分析的置换检验

library(lmPerm)
set.seed(1234)
fit<-aovp(len~supp*dose,data=ToothGrowth,perm="Prob")
## [1] "Settings:  unique SS : numeric variables centered"
summary(fit)
## Component 1 :
##             Df R Sum Sq R Mean Sq Iter Pr(Prob)    
## supp         1   205.35    205.35 5000  < 2e-16 ***
## dose         1  2224.30   2224.30 5000  < 2e-16 ***
## supp:dose    1    88.92     88.92 2032  0.04724 *  
## Residuals   56   933.63     16.67                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

可以看到，置换检验主要用于假设检验，如果我们想要获得置信区间，可以使用自助法。

好了，今天的R语言实现统计方法系列推文暂时告一段落，我们下次再见吧！小伙伴们如果有什么统计上的问题，或者如果想要学习什么方面的生物信息内容，可以在微信群或者知识星球提问，没准哪天的推文就是专门解答你的问题哦！