Task5 Cohen的效应量d中重叠部分的概率计算

使用R进行统计相关问题的解决

Posted by Jiawen Wu on December 25, 2018

前言

这个…统计软件课程作业的发布贴——Task5 使用R进行统计相关问题的解决。 我还是要在这里羞耻地写上我的大名吴嘉雯,还有我的学号15307130412。


问题的来源

在Cohen的Statistical Power Analysis for the Behavioral Sciences(1988, 2nd ed.)的第21-22页,他写道:

“If we maintain the assumption that the populations being compared are normal and with equal variability, and conceive them further as equally numerous, it is possible to define measures of nonoverlap (U) associated with d which are intuitively compelling and meaningful”

“When d = 0.1, for instance, U1 here equals 7.7%, i.e., 7.7% of the area covered by both populations combined is not overlapped and when d = 2, U1 then equals 81.1%, the amount of combined area not shared by the two population distributions.”

然而他说了as equally numerous,却使用的是面积来进行计算,这和他之前的基于频率来计算重叠部分的表达方式矛盾。

因而本post就通过可视化的方式来看一下问题出在哪里。

Statistical Power Analysis for the Behavioral Sciences(1988, 2nd ed.):点击下载

环境设置

需要使用到的R package

  • library(dplyr)
  • library(ggplot2)
  • library(tidyr)
  • library(ggthemes)

ggplot的主题参数设置

bg <- "#ECF0F2" # 背景颜色
# 把画图时用到的一些底色都调成背景颜色
theme_set(theme_economist() + theme(panel.background = element_rect(fill = bg),
                                    plot.background = element_rect(fill = bg),
                                    strip.background = element_rect(fill = bg),
                                    legend.background  = element_rect(fill = bg),
                                    legend.key = element_rect(fill = bg)))
# 给会用到的几种点设置一下颜色
scale_color_rpsy <- scale_color_manual(values = c("experiment" = "#E8948E", 
                                                  "control" = "#3E91BA", 
                                                  "control_overlap" = "#82D9CB", 
                                                  "experiment_overlap" = "#FEF3AC"))

再次推荐一下超好用的科研作图调色盘:Adobe Color CC

题外话:介绍一下管道函数 %>% 的用法

  • 符号%>% 管道操作,其意思是将 %>% 左边的对象传递给右边的函数
  • 说明:%>% 来自dplyr包的管道函数,其作用是将前一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存
  • x %>% f(y) 等同于 f(x, y)
  • y %>% f(x, ., z) 等同于 f(x, y, z )
# 例1 下面的两段代码是一样的
plot(data.frame(x = rnorm(n = 200,100,15),
                control = rnorm(n = 200,110,16),
                experiment = rnorm(n = 200,130,13)))

data.frame(x = rnorm(n = 200,100,15),
                control = rnorm(n = 200,110,16),
                experiment = rnorm(n = 200,130,13)) %>% plot()

# 例2 下面的两段代码是一样的
plot(table(mtcars$gear,mtcars$am))

mtcars$gear %>%
                table(.,mtcars$am) %>%
                plot()

做数据

做两个效应量相差为Cohen‘s d = 2 的分布

set.seed(4443451)
cohend <- 2
# 产生随机的观测值(实验组和控制组)
n <- 10000
d <- data.frame(x = runif(n, -5, 5.5),
                control = runif(n, min = 0, max = 0.4),
                experiment = runif(n, min = 0, max = 0.4)) # 从同一个分布中抽取样本
str(d)

# 把这两个分布弄成效应量为2的两个分布
d <- d %>% 
        mutate(control = ifelse(control <= dnorm(x, 0, 1), control, NA),
               experiment = ifelse(experiment <= dnorm(x, cohend, 1), experiment, NA))

# d <- mutate(d,control = ifelse(control <= dnorm(x, 0, 1), control, NA),
#               experiment = ifelse(experiment <= dnorm(x, 0.5, 1), experiment, NA))

str(d)

蒙特•卡罗积分来得到重叠部分的观测值

# 写一个计算overlap的函数方便后续调用
overlap <- function(x) {
    pmin(dnorm(x, 0, 1), dnorm(x, cohend, 1))
}

d_long <- d %>%
    gather(dist, y, -x) %>% 
    mutate(overlap = ifelse(y <= overlap(x), paste(dist, "_overlap", sep = ""), dist),
           overlap = factor(overlap)) %>% 
    filter(!is.na(y))

画图演示

画出我们自己做出来的两个分布

d_long %>% 
    ggplot(aes(x, y, color = dist)) + 
    geom_point() +
    facet_wrap(~ dist, ncol = 1) +
    labs(title = "Cohen's d = 0.5时控制组和实验组的结果分布", 
         subtile = "每个点代表了1个观测值") +
  theme(text = element_text(family = "STHeiti")) +
  scale_color_rpsy

画出重叠图

d_long %>% 
    ggplot(aes(x, y, color = dist)) +
    geom_point(alpha = 0.5, size = 1.3) +
    labs(title = "两个分布的重叠部分", subtitle = "Cohen's d = 0.5") +
    theme(text = element_text(family = "STHeiti")) +
     scale_color_rpsy

对不同的部分上不同的颜色方便查看

# 再次介绍一下使用了管道函数的代码组织方式
mutate(summarise(group_by(d_long,overlap),n = n()),
          prop = n/sum(n),
          prop = paste(round(prop, 1)*100, "%", sep = ""),
          x = c(1.5, 0.25, -1, 0.25),
          y = c(0.2, 0.25, 0.2, 0.2))

# 上面那条就等价于下面这个
labels <- d_long %>% 
    group_by(overlap) %>% 
    summarise(n = n()) %>% 
    mutate(prop = n/sum(n),
           prop = paste(round(prop, 1)*100, "%", sep = ""),
           x = c(-1, 1, 3, 1),
           y = c(0.1, 0.15, 0.1, 0.1))

d_long %>% 
    ggplot(aes(x, y, color = overlap)) +
    geom_point(alpha = 0.6) +
    geom_label(data = labels, 
               aes(x=x, y = y, label = prop, color = overlap), 
               vjust = "center", show.legend = FALSE, size = 5) +
    labs(title = "各个部分所占的百分比", subtitle = "重叠部分的概率解释") +
    theme(text = element_text(family = "STHeiti")) +
    scale_color_rpsy

  • 这里计算百分比的办法:观测值的个数/总数
  • 所以就是20 + 20 = 40%

Cohen计算重叠部分使用的分布

面积计算而非频率计算,所以我们需要把重叠部分的点弄掉一半

labels <- d_long %>% 
    filter(overlap != "experiment_overlap") %>% 
    group_by(overlap) %>% 
    summarise(n=n()) %>% 
    mutate(prop = n/sum(n),
           prop = paste(round(prop, 2) * 100, "%", sep = ""),
           x = c(-0.5, 1, 2.5))

d_long %>% 
    filter(overlap != "experiment_overlap") %>% 
    ggplot(aes(x, y, color = overlap)) +
    geom_point(alpha = 0.5, size = 1.3) +
    geom_label(data = labels, 
               aes(x=x, y = 0.15, label = prop, color = overlap), 
               vjust = "center", show.legend = FALSE, size = 5) +
    labs(title = "被两个群体分布都cover了的比例",
         subtitle = "Cohen对于overlap的比例解释") +
    theme(text = element_text(family = "STHeiti")) +
    scale_color_rpsy

Reference

Grice & Barrett (2014) 在他们的文章 “A note on Cohen’s overlapping proportions of normal distributions” 里就指出了Cohen犯的这个错误。 原文链接


参考文件下载: