1 本資料について

心理学分野の研究・教育に携わっている筆者（杉本海里：https://kairi-pf.com）による、自身と所属ラボメンバーのための備忘録です。

架空のデータを作成し、Tidyverse関数群で前処理を行ったうえで、基本的な統計的仮説検定を行い、さらにggplot2で作図をしています。

心理統計に限らず統計解析全般に活用できますが、誤りも多くありそうですので、ご利用の際は参考程度になさってください。

チートシート（早見表）につき、第三者への解説は目的としておりませんので、言葉足らずな箇所が多くあるかと思いますがあらかじめご了承ください。

本資料に関してお気づきの点やご感想などがありましたら、こちらのGoogleフォームからご連絡をお願いいたします。

2 最初の準備

2.1 本記事の動作環境

R.version.string

## [1] "R version 4.4.0 (2024-04-24)"

2.2 パッケージ読み込み

まだインストールしていないパッケージは、install.packages("パッケージ名")でインストールが必要。

分散分析用パッケージである「anovakun_489.txt」は、井関-ANOVA君よりダウンロードする。

library(ppcor) #偏相関の検定、tidyverseよりも先に読み込まないと一部関数が衝突する。
library(tidyverse) #dplyr, tidyr, ggplot2を含む
library(patchwork) #ggplot2の複数の図を並べる
library(scales) #グラフの軸範囲外のデータを計算に含める（oob_keep関数）
library(ggpattern) #ggplot2のテクスチャパッケージ
library(psych) #相関行列の検定
library(plotrix) #標準誤差計算（std.error関数）
library(openxlsx) #Excel読み込み
library(effsize) #効果量算出
library(rcompanion) #効果量算出
library(rstatix) #さまざまな統計解析
library(TOSTER) #Brunner-Munzel検定
library(car) #回帰分析における多重共線性の確認
library(viridis) #色覚異常カラーパレット
source("anovakun_489.txt") #分散分析用パッケージ

2.3 データ読み込み

df_original = read.xlsx("data.xlsx")

2.4 架空データ作成

100名分の架空のデータフレーム「df_original」を用意した。なお、データは乱数生成に基づく。

列名	意味	補足
id	個人識別番号	1〜100
sex	性別	m：50名、f：40名、other：10名
time	実験に要した時間	-
session1-4	心理課題のセッションごとの得点	-
tipij1-10	BigFive特性を測定するTIPI-J質問紙のスコア	-

2.5 データ構造の確認

head関数で、データフレームの冒頭のみを確認できる。また、質的データの内訳はcount関数、量的データの要約統計量はsummary関数、不偏標準偏差はsd関数で確認する。

head(df_original)

##   id sex age time session1 session2 session3 session4 tipij1 tipij2 tipij3
## 1  1   f  46  990        0        8        8        3      1      2      2
## 2  2   m  28  258        5        8        3        5      1      1      6
## 3  3   m  48  587        3        7        4        2      5      7      5
## 4  4   m  25  445        4        0        0        5      3      5      1
## 5  5   f  20  708        2        6        6        4      3      5      4
## 6  6   m  43  342        9        6        9        0      2      7      4
##   tipij4 tipij5 tipij6 tipij7 tipij8 tipij9 tipij10
## 1      1      2      3      7      6      6       4
## 2      4      3      4      6      4      6       2
## 3      5      4      2      4      7      5       4
## 4      5      5      2      5      7      4       1
## 5      1      6      7      6      3      3       4
## 6      3      5      4      7      3      6       1

count(df_original, sex)

##     sex  n
## 1     f 40
## 2     m 50
## 3 other 10

summary(df_original$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   29.75   36.00   37.54   46.00   60.00

sd(df_original$age)

## [1] 12.35373

3 データハンドリング

3.1 よく使う関数の意味

関数名	意味
mutate	新しい列の作成、既存の列を変更、ファクターレベルの変更、などをする。
select	どのデータを列として抽出するかを選択する。もとのデータフレームに新たに追加するだけなら不要。
pull	selectと似ているが、データフレームではなくベクトルとして1つの列を抽出する。
if_else	条件式に基づいて処理を行う。if_else(条件式, 真, 偽）
case_when	条件式に基づいて処理を行う。複雑な条件の場合はif_elseよりも可読性が高い。
pivot_longer	ワイド型データからロング型データに変換する。
filter	該当する行のみ抽出する。
group_by	グルーピングする。summarize関数とセットで使うことが多い。
summarize	平均値や標準誤差などを算出する。

3.2 列同士の計算

次のコードでは下記操作を行っている。

ここで作成するデータフレーム「df」を、今後の分析・作図における基本データとする。

性別が「m」か「f」のデータのみを取り出す。
性別のファクターレベルを「f→m」から「m→f」に変更する。
session1からsession4までの平均値（score_mean）と合計得点（score_sum）を算出する。
TIPI-J得点に基づいて、Big Five性格特性の5因子の得点をそれぞれ算出する。
使用する列のみを抽出して、データフレーム「df」に格納する。

df = df_original %>%
  filter(sex=="m" | sex=="f") %>%
  mutate(
    sex                   = factor(sex, levels=c("m","f")),
    score_mean            = rowMeans(select(., session1:session4)),
    score_sum             = rowSums(select(., session1:session4)),
    bf1_extraversion      = tipij1 + 8 - tipij6,
    bf2_agreeableness     = 8 - tipij2 + tipij7,
    bf3_conscientiousness = tipij3 + 8 - tipij8,
    bf4_neuroticism       = tipij4 + 8 - tipij9,
    bf5_openness          = tipij5 + 8 - tipij10
  ) %>%
  select(sex, age, time, score_mean, score_sum, session1:session4, bf1_extraversion:bf5_openness)
head(df)

##   sex age time score_mean score_sum session1 session2 session3 session4
## 1   f  46  990       4.75        19        0        8        8        3
## 2   m  28  258       5.25        21        5        8        3        5
## 3   m  48  587       4.00        16        3        7        4        2
## 4   m  25  445       2.25         9        4        0        0        5
## 5   f  20  708       4.50        18        2        6        6        4
## 6   m  43  342       6.00        24        9        6        9        0
##   bf1_extraversion bf2_agreeableness bf3_conscientiousness bf4_neuroticism
## 1                6                13                     4               3
## 2                5                13                    10               6
## 3               11                 5                     6               8
## 4                9                 8                     2               9
## 5                4                 9                     9               6
## 6                6                 8                     9               5
##   bf5_openness
## 1            6
## 2            9
## 3            8
## 4           12
## 5           10
## 6           12

3.3 特定の列の値を変更

所要時間が平均未満のグループと平均以上のグループで分けるための識別値を追加する。

df_ifelse = df %>%
  mutate(time_group = if_else(time < mean(df$time), 0, 1)) %>%
  select(time_group)
head(df_ifelse)

##   time_group
## 1          1
## 2          0
## 3          1
## 4          0
## 5          1
## 6          0

複雑な条件の場合はcase_when関数を使用する。TRUEはその他を意味する。

df_casewhen = df %>%
  mutate(age_group = case_when(age < 30 ~ "age1",
                               age < 50 ~ "age2",
                               TRUE ~ "age3")
         ) %>%
  select(age_group)
head(df_casewhen)

##   age_group
## 1      age2
## 2      age1
## 3      age2
## 4      age1
## 5      age1
## 6      age2

3.4 ファクターレベルの変更

ファクターレベルは内部の処理順序のことであり、指定しない場合はアルファベット順となる。

levels関数を使えば現在のファクターレベルを確認できる。ファクターレベルは、解析と作図のどちらにおいても重要である。

また、同時に列名の変更もできるため、作図の前には列ラベルをグラフにおける表示名に変更しておくとよい。

df_factor = df %>%
  mutate(sex = factor(sex, levels=c("m","f"), labels=c("男性","女性"))) %>%
  select(sex)
levels(df_factor$sex)

## [1] "男性" "女性"

3.5 ロング型データへの変換

session1〜4のワイド型データをロング型データに変換する。

df_longer = df %>% 
  pivot_longer(
    cols = c(session1, session2, session3, session4),
    names_to = "session",
    values_to = "score"
  ) %>%
  select(age, sex, session, score)
head(df_longer)

## # A tibble: 6 × 4
##     age sex   session  score
##   <dbl> <fct> <chr>    <dbl>
## 1    46 f     session1     0
## 2    46 f     session2     8
## 3    46 f     session3     8
## 4    46 f     session4     3
## 5    28 m     session1     5
## 6    28 m     session2     8

3.6 条件に当てはまる行の抽出

男女のデータのみを取り出す。

df_filter1 = df %>%
  filter(sex=="m" | sex=="f") %>%
  select(sex, score_mean)
head(df_filter1)

##   sex score_mean
## 1   f       4.75
## 2   m       5.25
## 3   m       4.00
## 4   m       2.25
## 5   f       4.50
## 6   m       6.00

性別が「その他」のデータ以外を取り出す。

df_filter2 = df %>%
  filter(sex!="other") %>%
  select(sex, session1:session4)
head(df_filter2)

##   sex session1 session2 session3 session4
## 1   f        0        8        8        3
## 2   m        5        8        3        5
## 3   m        3        7        4        2
## 4   m        4        0        0        5
## 5   f        2        6        6        4
## 6   m        9        6        9        0

年齢が30〜40歳のデータのみを取り出す。 pull関数では、1つの列をベクトルとして取り出せる。

df_filter3 = df %>%
  filter(age>=30 & age<=40) %>%
  pull(score_mean)
df_filter3

##  [1] 3.00 3.75 4.00 4.75 2.00 3.50 4.75 4.50 4.25 5.25 6.50 4.75 5.25 5.75 4.50
## [16] 4.25 3.75 5.50 5.50 4.00 6.00 5.50 4.00 4.25 2.75 5.00 4.50 3.25 4.25 5.75
## [31] 4.50 4.50 4.75 2.50

3.7 グループ集計

性別ごとに、score_meanの平均値、不偏標準偏差、不偏標準誤差を集計する。

df_groupby = df %>%
  group_by(sex) %>%
  summarize(mean = mean(score_mean),
            sd   = sd(score_mean),
            se   = std.error(score_mean))
head(df_groupby)

## # A tibble: 2 × 4
##   sex    mean    sd    se
##   <fct> <dbl> <dbl> <dbl>
## 1 m      4.58  1.06 0.150
## 2 f      4.48  1.10 0.173

4 統計的仮説検定

4.1 効果量について

基本的に、r族の効果量を算出する方針である。

t値が出力されるもの \[ r = \sqrt{\frac{t^2}{t^2 - df}} \] Z値が出力されるもの \[ r = \frac{Z}{\sqrt{n}} \]

4.2 t検定

4.2.1 t検定の手法の選び方

2つの平均値のt検定の手法選択は、下記の論理で理解している。

対応のないt検定
1. Studentのt検定は等分散性を仮定するが、等分散性を満たすデータは非常に限られているため、等分散性を仮定しないwelchのt検定を使うのが一般的であり、rのt.testも標準でwelchのt検定を行うようになっている。
2. welchのt検定は正規分布を仮定するパラメトリック検定のため、ノンパラメトリック検定としては「順位平均の検定」であるWMW検定（Wilcoxonの順位和検定、Mann-WhitneyのU検定）を使うことが知られている。
3. WMW検定も無条件で使えるわけではなく等分布性（≒等分散性）を仮定するため、これらを仮定しない「確率的等質性（stochastic equality）、相対効果の検定」であるBrunner-Munzel検定の汎用性が高い。あるいは、正規性が多少弱くてもサンプルサイズが大きければwelchのt検定が有効に使える。
4. サンプルサイズが非常に小さい場合（目安としては10以下あるいは15以下程度）は、並べ替えBrunner-Munzel検定を行う。
対応のあるt検定
1. 対応のあるt検定は正規分布を仮定するパラメトリック検定のため、ノンパラメトリック検定としては「順位平均の検定」であるウィルコクソンの符号順位検定を使うことが知られている。
2. ウィルコクソンの符号順位検定も無条件で使えるわけではなく等分布性（≒等分散性）を仮定するため、これらを仮定しない「確率的等質性（stochastic equality）、相対効果の検定」である対応のあるBrunner-Munzel検定の汎用性が高い。
正規性や等分散性の検定について
- t検定の前に正規性や等分散性の検定を行うという考え方もあるが、αエラー（帰無仮説を誤って棄却する恐れ）を増大させる一方で、これらの検定を行うメリットが少ないことから、正規性や等分散性の検定は不要である。

参考文献

4.2.2 1標本のt検定

result = t.test(df$age, mu=20)
print(result)

## 
##  One Sample t-test
## 
## data:  df$age
## t = 12.896, df = 89, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
##  34.49350 39.77317
## sample estimates:
## mean of x 
##  37.13333

cohen.d(df$age, NA, mu=20)

## 
## Cohen's d (single sample)
## 
## d estimate: 1.359367 (large)
## Reference mu: 20
## 95 percent confidence interval:
##     lower     upper 
## 0.8946075 1.8241266

sqrt(result$statistic^2 / (result$statistic^2 + result$parameter)) #効果量r

##         t 
## 0.8070954

4.2.3 対応のない2標本のt検定

標準でwelchのt検定が行われる。スチューデントのt検定を使うことはない。

group1 = df %>% filter(sex=="m") %>% pull(score_mean)
group2 = df %>% filter(sex=="f") %>% pull(score_mean)
result = t.test(group1, group2)
print(result)

## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = 0.43152, df = 82.387, p-value = 0.6672
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3564564  0.5539564
## sample estimates:
## mean of x mean of y 
##   4.58000   4.48125

cohen.d(group1, group2)

## 
## Cohen's d
## 
## d estimate: 0.09190185 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.3298857  0.5136894

sqrt(result$statistic^2 / (result$statistic^2 + result$parameter)) #効果量r

##          t 
## 0.04748791

4.2.4 対応のあるt検定

result = t.test(df$session1, df$session2, paired=T)
print(result)

## 
##  Paired t-test
## 
## data:  df$session1 and df$session2
## t = -0.12799, df = 89, p-value = 0.8984
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.7344070  0.6455182
## sample estimates:
## mean difference 
##     -0.04444444

cohen.d(df$session1, df$session2, paired=T)

## 
## Cohen's d
## 
## d estimate: -0.01936216 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.3179143  0.2791900

sqrt(result$statistic^2 / (result$statistic^2 + result$parameter)) #効果量r

##          t 
## 0.01356595

4.2.5 Brunner-Munzel検定

条件ごとのサンプルサイズが15未満の際はperm=TRUEをオプションに追記する（TOSTERパッケージの説明に基づく）。

対応のある場合は、paired=TRUEをオプションに追記する。

論文には、検定統計量をBM（tかも？）、効果量をstochastic superiority estimateとして記載？（文献によって表記はバラバラで統一的見解はない。表記法を提案している論文もあるが、浸透していない様子。）

#対応のない2標本の検定の場合
group1 = df %>% filter(sex=="m") %>% pull(score_mean)
group2 = df %>% filter(sex=="f") %>% pull(score_mean)
brunner_munzel(group1, group2)

## 
##  two-sample Brunner-Munzel test
## 
## data:  group1 and group2
## t = 0.38483, df = 84.388, p-value = 0.7013
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
##  0.40103 0.64647
## sample estimates:
## p(X>Y) + .5*P(X=Y) 
##            0.52375

#対応のある検定の場合
brunner_munzel(df$session1, df$session2, paired=TRUE)

## 
##  exact paired Brunner-Munzel test
## 
## data:  df$session1 and df$session2
## t = -0.15184, df = 89, p-value = 0.8797
## alternative hypothesis: true relative effect is not equal to 0.5
## 95 percent confidence interval:
##  0.4060906 0.5805760
## sample estimates:
## p(X>Y) + .5*P(X=Y) 
##          0.4933333

4.2.6 t検定の繰り返し

このような方法はあまり行わないが、参考のために記載しておく。

検定を繰り返す際は、必要に応じて適宜多重比較補正を行う（水本, 2009, 複数の項目やテストにおける検定の多重性: モンテカルロ・シミュレーションによる検証）。

df_t = df %>% select(session1:bf5_openness)
t = data.frame()
dof = data.frame()
p = data.frame()
p_adj = data.frame()
effect_d = data.frame()
effect_r = data.frame()
lcl = data.frame()
ucl = data.frame()
count_a = 4 #1つの目の変数の種類の数
count_b = 5 #2つの目の変数の種類の数
for (i in 1:count_a){
  for (j in 1:count_b){
    output = t.test(df_t[,(i)], df_t[,(count_a+j)], paired=TRUE)
    output2 = cohen.d(df_t[,(i)], df_t[,(count_a+j)], paired=TRUE)
    t[j,i] = output$statistic
    dof[j,i] = output$parameter
    p[j,i] = output$p.value
    p_adj[j,i] = min(1, output$p.value*count_a*count_b) #Bonferroni法による調整済みp値
    effect_d[j,i] = output2$estimate
    effect_r[j,i] = sqrt(output$statistic^2 / (output$statistic^2 + output$parameter))
    lcl[j,i] = output$conf.int[1]
    ucl[j,i] = output$conf.int[2]
  }
}
result = cbind(t,dof,p,p_adj,effect_d,effect_r,lcl,ucl) #左から、t値、自由度、p値、調整済みp値、効果量r、下限信頼限界、上限信頼限界
#write.csv(result,"result.csv")
#write.xlsx(result,"result.xlsx")
result

##           V1         V2         V3         V4 V1 V2 V3 V4           V1
## 1  -8.608565  -8.017310  -9.319801  -8.636150 89 89 89 89 2.424112e-13
## 2  -8.291858  -9.032349 -10.824630  -8.903196 89 89 89 89 1.091453e-12
## 3 -10.223270  -8.873498 -10.058581  -9.647493 89 89 89 89 1.108781e-16
## 4  -9.872009 -10.127302 -10.032836 -10.024584 89 89 89 89 5.885732e-16
## 5  -8.825142  -7.226974  -8.303780  -6.895399 89 89 89 89 8.644009e-14
##             V2           V3           V4           V1           V2           V3
## 1 4.003611e-12 8.175274e-15 2.125926e-13 4.848223e-12 8.007221e-11 1.635055e-13
## 2 3.219639e-14 6.452832e-18 5.959181e-14 2.182905e-11 6.439278e-13 1.290566e-16
## 3 6.865164e-14 2.423732e-16 1.714323e-15 2.217562e-15 1.373033e-12 4.847463e-15
## 4 1.748639e-16 2.739180e-16 2.848761e-16 1.177146e-14 3.497279e-15 5.478360e-15
## 5 1.623531e-10 1.031448e-12 7.489428e-10 1.728802e-12 3.247063e-09 2.062895e-11
##             V4        V1        V2        V3        V4        V1        V2
## 1 4.251851e-12 -1.278953 -1.235309 -1.393154 -1.252864 0.6740528 0.6475746
## 2 1.191836e-12 -1.366906 -1.320644 -1.462164 -1.338455 0.6601770 0.6915641
## 3 3.428645e-14 -1.355405 -1.322619 -1.469246 -1.336372 0.7349068 0.6851377
## 4 5.697521e-15 -1.522678 -1.473676 -1.623290 -1.490989 0.7229641 0.7317097
## 5 1.497886e-08 -1.115755 -1.088929 -1.242396 -1.108078 0.6831491 0.6081271
##          V3        V4        V1        V2        V3        V4        V1
## 1 0.7027886 0.6752288 -3.842875 -3.840562 -4.354039 -3.854241 -2.401569
## 2 0.7538719 0.6863515 -4.600405 -4.473277 -4.944654 -4.552933 -2.821817
## 3 0.7293902 0.7149742 -4.312961 -4.365325 -4.883305 -4.368248 -2.909261
## 4 0.7285148 0.7282334 -4.524799 -4.452523 -5.071735 -4.526573 -3.008535
## 5 0.6607120 0.5900911 -3.648224 -3.739822 -4.268652 -3.850167 -2.307332
##          V2        V3        V4
## 1 -2.314993 -2.823739 -2.412426
## 2 -2.860056 -3.410901 -2.891512
## 3 -2.768009 -3.272251 -2.876196
## 4 -2.991921 -3.394932 -3.028982
## 5 -2.126845 -2.620237 -2.127610

4.2.7 ウィルコクソンの検定

ノンパラメトリック検定。

参考のために記載したが、前述したBrunner-Munzel検定のほうが良いだろう。

#対応のないデータの検定
group1 = df %>% filter(sex=="m") %>% pull(score_mean)
group2 = df %>% filter(sex=="f") %>% pull(score_mean)
wilcox.test(group1, group2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  group1 and group2
## W = 1047.5, p-value = 0.7019
## alternative hypothesis: true location shift is not equal to 0

#対応のあるデータの検定
wilcox.test(df$session1, df$session2, paired=TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  df$session1 and df$session2
## V = 1439, p-value = 0.9028
## alternative hypothesis: true location shift is not equal to 0

4.3 分散分析

分散分析には井関-ANOVA君のanovakunパッケージを使用する。

4.3.1 参加者内要因分散分析

反復測定データを含む場合は、以下の通りに行う方針である。

多重比較時はHolm法による補正を用いる。Shaffer法よりも仮定条件が緩く、一般的に使用できる。
球面性検定の結果によらず、すべての場合においてChi-Mullerのεによる調整を行う（Howell, 2012）。調整時は、εが1を超えた場合は1に修正され、εが下限値を下回った場合は下限値に修正される。Greenhouse-Geisser法やHuynh-Feldt法の仮定条件が、Chi-Mullerでは緩和されている。
効果量には一般化イータ二乗（generalized eta squared）を使用する。偏イータ二乗（partial eta squared）と比較して一般化されているため、実験デザインによらず他の研究と相互に比較できるためである。
信頼区間には、Baguley（2012）の「差分調整型の正規化に基づく方法」を用いる。このオプションでは、参加者間要因に対しては伝統的な算出法が適用される。

df_anova = df %>% select(session1:session4)
anovakun(df_anova, "sA", 4, holm=T, cm=T, geta=T, cind=T)

## 
## [ sA-Type Design ]
## 
## This output was generated by anovakun 4.8.9 under R version 4.4.0.
## It was executed on Sat Jun 15 02:30:39 2024.
## 
##  
## << DESCRIPTIVE STATISTICS >>
## 
## == Cousineau-Morey-Baguley's Difference-Adjusted Normalized Confidence Intervals ==
## == 95% confidence intervals are calculated. ==
## 
## ------------------------------------------
##   A   n    Mean    S.D.  CIND-L  CIND-U 
## ------------------------------------------
##  a1  90  4.6444  2.2401  4.2919  4.9969 
##  a2  90  4.6889  2.3495  4.3406  5.0371 
##  a3  90  4.1778  2.5244  3.8113  4.5442 
##  a4  90  4.6333  2.3676  4.2527  5.0139 
## ------------------------------------------
## 
## 
## << SPHERICITY INDICES >>
## 
## == Mendoza's Multisample Sphericity Test and Epsilons ==
## 
## -------------------------------------------------------------------------
##  Effect  Lambda  approx.Chi  df      p         LB     GG     HF     CM 
## -------------------------------------------------------------------------
##       A  0.6107      0.9722   5 0.9648 ns  0.3333 0.9926 1.0307 1.0292 
## -------------------------------------------------------------------------
##                               LB = lower.bound, GG = Greenhouse-Geisser
##                              HF = Huynh-Feldt-Lecoutre, CM = Chi-Muller
## 
## 
## << ANOVA TABLE >>
## 
## == Adjusted by Chi-Muller's Epsilon ==
## 
## --------------------------------------------------------------
##  Source        SS  df     MS  F-ratio  p-value      G.eta^2 
## --------------------------------------------------------------
##       s  407.2806  89 4.5762                                
## --------------------------------------------------------------
##       A   15.5639   3 5.1880   0.8675   0.4584 ns    0.0077 
##   s x A 1596.6861 267 5.9801                                
## --------------------------------------------------------------
##   Total 2019.5306 359 5.6254                                
##                   +p < .10, *p < .05, **p < .01, ***p < .001
## 
## 
## output is over --------------------///

4.3.2 参加者間要因分散分析

反復測定データを含まない場合は、以下の通りに行う方針である。

多重比較時はHolm法による補正を用いる。Shaffer法よりも仮定条件が緩く、一般的に使用できる。
効果量には一般化イータ二乗（generalized eta squared）を使用する。偏イータ二乗（partial eta squared）と比較して一般化されているため、実験デザインによらず他の研究と相互に比較できるためである。
信頼区間には、Baguley（2012）の「差分調整型の正規化に基づく方法」を用いる。このオプションでは、参加者間要因に対しては伝統的な算出法が適用される。

df_anova = df %>% 
  mutate(age_group = if_else(age > mean(df$age), "old", "young")) %>% 
  select(sex, age_group, score_mean)
anovakun(df_anova, "ABs", 2, 2, holm=T, geta=T, cind=T)

## 
## [ ABs-Type Design ]
## 
## This output was generated by anovakun 4.8.9 under R version 4.4.0.
## It was executed on Sat Jun 15 02:30:39 2024.
## 
##  
## << DESCRIPTIVE STATISTICS >>
## 
## == Difference-Adjusted Confidence Intervals for Independent Means ==
## == 95% confidence intervals are calculated. ==
## 
## ----------------------------------------------
##   A   B   n    Mean    S.D.  CIND-L  CIND-U 
## ----------------------------------------------
##  a1  b1  23  4.5978  0.9675  4.3020  4.8937 
##  a1  b2  17  4.3235  1.2617  3.8648  4.7822 
##  a2  b1  17  4.7500  1.1180  4.3435  5.1565 
##  a2  b2  33  4.4924  1.0317  4.2337  4.7511 
## ----------------------------------------------
## 
## 
## << ANOVA TABLE >>
## 
## == This data is UNBALANCED!! ==
## == Type III SS is applied. ==
## 
## -------------------------------------------------------------
##  Source       SS  df     MS  F-ratio  p-value      G.eta^2 
## -------------------------------------------------------------
##       A   0.5385   1 0.5385   0.4625   0.4983 ns    0.0053 
##       B   1.4778   1 1.4778   1.2693   0.2630 ns    0.0145 
##   A x B   0.0015   1 0.0015   0.0013   0.9718 ns    0.0000 
##   Error 100.1236  86 1.1642                                
## -------------------------------------------------------------
##   Total 101.8201  89 1.1440                                
##                  +p < .10, *p < .05, **p < .01, ***p < .001
## 
## 
## output is over --------------------///

4.4 相関の検定

4.4.1 無相関検定

cor.test(df$age, df$score_mean, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df$age and df$score_mean
## t = 0.32518, df = 88, p-value = 0.7458
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1736933  0.2400132
## sample estimates:
##        cor 
## 0.03464402

cor.test(df$age, df$score_mean, method="spearman")

## Warning in cor.test.default(df$age, df$score_mean, method = "spearman"): Cannot
## compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  df$age and df$score_mean
## S = 117928, p-value = 0.7842
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.02927557

cor.testは、spearmanをオプションとすると信頼区間を出力しない。そこで、methodはpearsonとし、rank関数を用いた順位化データを入力することで、spearmanの相関におけるrho値、p値、信頼区間を出力することができる。なお、rank関数は順位タイデータをaverage処理する。

cor.test(rank(df$age), rank(df$score_mean))

## 
##  Pearson's product-moment correlation
## 
## data:  rank(df$age) and rank(df$score_mean)
## t = 0.27475, df = 88, p-value = 0.7842
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1789002  0.2349423
## sample estimates:
##        cor 
## 0.02927557

4.4.2 無相関検定の繰り返し

methodをspearmanにした場合は、信頼区間を出力しないためこのままだとエラーがでる。

df_cor = df %>% select(age, score_mean, bf1_extraversion:bf5_openness)
r = data.frame()
dof = data.frame()
p = data.frame()
p_adj = data.frame()
lcl = data.frame()
ucl = data.frame()
count_a = 2 #1つの目の変数の種類の数
count_b = 5 #2つの目の変数の種類の数
for (i in 1:count_a){
  for (j in 1:count_b){
    output = cor.test(df_t[,(i)], df_t[,(count_a+j)], paired=TRUE)
    r[j,i] = output$estimate
    dof[j,i] = output$parameter
    p[j,i] = output$p.value
    p_adj[j,i] = min(1, output$p.value*count_a*count_b) #Bonferroni法による調整済みp値
    lcl[j,i] = output$conf.int[1]
    ucl[j,i] = output$conf.int[2]
  }
}
result = cbind(r,dof,p,p_adj,lcl,ucl) #左から、相関係数、自由度、p値、調整済みp値、下限信頼限界、上限信頼限界
#write.csv(result,"result.csv")
#write.xlsx(result,"result.xlsx")
result

##             V1            V2 V1 V2         V1        V2        V1 V2
## 1 -0.024461100  0.0340577883 88 88 0.81898645 0.7499718 1.0000000  1
## 2 -0.168912972 -0.0954728205 88 88 0.11149530 0.3707225 1.0000000  1
## 3  0.006746754 -0.0683327842 88 88 0.94967784 0.5222060 1.0000000  1
## 4 -0.222886257  0.0379837361 88 88 0.03472356 0.7222625 0.3472356  1
## 5  0.209009609  0.0002476431 88 88 0.04803984 0.9981517 0.4803984  1
##             V1         V2          V1        V2
## 1 -0.230385159 -0.1742624  0.18355987 0.2394600
## 2 -0.363295862 -0.2966978  0.03956204 0.1138697
## 3 -0.200624672 -0.2715807  0.21353951 0.1407501
## 4 -0.411006243 -0.1704481 -0.01655949 0.2431621
## 5  0.002005291 -0.2068541  0.39883743 0.2073281

4.4.3 偏相関の検定

剰余変数を統制した偏相関係数を出力することができる。

下記事例では、性別をダミー変数に変換し、性別の影響を統制したscore_meanとbf1_extraversionの相関を調べる。

統制する変数が量的変数か質的変数かを問わず、相関の対象となる2変数が正規分布に従う量的変数であれば、pearsonの方法を使うことができる。

df_pcor = df %>%
  mutate(sex_dummy = if_else(sex=="m", 0, 1)) %>%
  select(score_mean, bf1_extraversion, sex_dummy)
pcor.test(df_pcor$score_mean, df_pcor$bf1_extraversion, df_pcor$sex_dummy, method="pearson")

##       estimate   p.value   statistic  n gp  Method
## 1 -0.006658322 0.9506208 -0.06210607 90  1 pearson

4.4.4 相関行列の検定

corr.testは、出力結果のうち対角右上側に多重比較補正後のp値を出力する。

補正法には、bonferroni、holm、BH（Benjamini-Hochberg法）などが使える。たとえばBonferroni法を用いると、出力された結果の個数分がp値に乗算される。

df_corr = df %>% select(bf1_extraversion:bf5_openness)
result = corr.test(df_corr, method="pearson", adjust="bonferroni")
result_rp = cbind(result$r, result$p)
result_rp

##                       bf1_extraversion bf2_agreeableness bf3_conscientiousness
## bf1_extraversion           1.000000000       -0.04973575           -0.05073949
## bf2_agreeableness         -0.049735749        1.00000000            0.05812000
## bf3_conscientiousness     -0.050739488        0.05812000            1.00000000
## bf4_neuroticism           -0.008535193       -0.05871006           -0.05348063
## bf5_openness               0.068527681       -0.04041872            0.02454550
##                       bf4_neuroticism bf5_openness bf1_extraversion
## bf1_extraversion         -0.008535193   0.06852768        0.0000000
## bf2_agreeableness        -0.058710060  -0.04041872        0.6415538
## bf3_conscientiousness    -0.053480632   0.02454550        0.6348336
## bf4_neuroticism           1.000000000  -0.08636090        0.9363633
## bf5_openness             -0.086360900   1.00000000        0.5210172
##                       bf2_agreeableness bf3_conscientiousness bf4_neuroticism
## bf1_extraversion              1.0000000             1.0000000       1.0000000
## bf2_agreeableness             0.0000000             1.0000000       1.0000000
## bf3_conscientiousness         0.5863536             0.0000000       1.0000000
## bf4_neuroticism               0.5825516             0.6166328       0.0000000
## bf5_openness                  0.7052521             0.8183726       0.4183123
##                       bf5_openness
## bf1_extraversion                 1
## bf2_agreeableness                1
## bf3_conscientiousness            1
## bf4_neuroticism                  1
## bf5_openness                     0

psych::pairs.panels(df_corr, method="pearson")

4.5 重回帰分析（執筆中）

情報が大きく不足していますので利用はご注意ください。

波線の左が目的変数、波線の右が説明変数。

family=~~~オプションによって、分布とリンク関数を指定できる。

family=gaussian(link="identity")：デフォルト。正規分布の連続変数。
family=binomial(link="logit")：ロジスティック回帰分析。二項分布。
family=poisson(link = "log")：対数線形モデル

result_glm = glm(score_mean ~ bf1_extraversion+bf2_agreeableness+bf3_conscientiousness+bf4_neuroticism+bf5_openness, data=df)
summary(result_glm)

## 
## Call:
## glm(formula = score_mean ~ bf1_extraversion + bf2_agreeableness + 
##     bf3_conscientiousness + bf4_neuroticism + bf5_openness, data = df)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.2549226  0.7559221   5.629 2.34e-07 ***
## bf1_extraversion      -0.0004703  0.0433926  -0.011   0.9914    
## bf2_agreeableness     -0.0055966  0.0362990  -0.154   0.8778    
## bf3_conscientiousness  0.0724519  0.0379339   1.910   0.0596 .  
## bf4_neuroticism       -0.0403217  0.0424712  -0.949   0.3451    
## bf5_openness           0.0095278  0.0380880   0.250   0.8031    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.145381)
## 
##     Null deviance: 101.820  on 89  degrees of freedom
## Residual deviance:  96.212  on 84  degrees of freedom
## AIC: 275.42
## 
## Number of Fisher Scoring iterations: 2

#信頼区間
confint(result_glm)

## Waiting for profiling to be done...

##                              2.5 %     97.5 %
## (Intercept)            2.773342480 5.73650281
## bf1_extraversion      -0.085518231 0.08457770
## bf2_agreeableness     -0.076741452 0.06554818
## bf3_conscientiousness -0.001897188 0.14680102
## bf4_neuroticism       -0.123563634 0.04292029
## bf5_openness          -0.065123223 0.08417884

#VIFによる多重共線性の確認。VIFがないことの目安は5あるいは10未満。
vif(result_glm)

##      bf1_extraversion     bf2_agreeableness bf3_conscientiousness 
##              1.009533              1.010747              1.008990 
##       bf4_neuroticism          bf5_openness 
##              1.013869              1.014711

#残差の正規性の確認
plot(result_glm, 2)

AIC（赤池情報量規準）に基づく予測式の最適化をする場合は、step関数を使う。

ただし、AICなどのモデル選択は、検定ではなく予測式の最適化のために使用するものなので、検定を目的する際はフルモデルをベースとした手動選択で良いだろう。

step(result_glm)

## Start:  AIC=275.42
## score_mean ~ bf1_extraversion + bf2_agreeableness + bf3_conscientiousness + 
##     bf4_neuroticism + bf5_openness
## 
##                         Df Deviance    AIC
## - bf1_extraversion       1   96.212 273.42
## - bf2_agreeableness      1   96.239 273.44
## - bf5_openness           1   96.284 273.48
## - bf4_neuroticism        1   97.244 274.38
## <none>                       96.212 275.42
## - bf3_conscientiousness  1  100.390 277.24
## 
## Step:  AIC=273.42
## score_mean ~ bf2_agreeableness + bf3_conscientiousness + bf4_neuroticism + 
##     bf5_openness
## 
##                         Df Deviance    AIC
## - bf2_agreeableness      1   96.239 271.44
## - bf5_openness           1   96.284 271.48
## - bf4_neuroticism        1   97.244 272.38
## <none>                       96.212 273.42
## - bf3_conscientiousness  1  100.403 275.25
## 
## Step:  AIC=271.44
## score_mean ~ bf3_conscientiousness + bf4_neuroticism + bf5_openness
## 
##                         Df Deviance    AIC
## - bf5_openness           1   96.315 269.51
## - bf4_neuroticism        1   97.255 270.39
## <none>                       96.239 271.44
## - bf3_conscientiousness  1  100.406 273.26
## 
## Step:  AIC=269.51
## score_mean ~ bf3_conscientiousness + bf4_neuroticism
## 
##                         Df Deviance    AIC
## - bf4_neuroticism        1   97.387 268.51
## <none>                       96.315 269.51
## - bf3_conscientiousness  1  100.506 271.35
## 
## Step:  AIC=268.51
## score_mean ~ bf3_conscientiousness
## 
##                         Df Deviance    AIC
## <none>                       97.387 268.51
## - bf3_conscientiousness  1  101.820 270.51

## 
## Call:  glm(formula = score_mean ~ bf3_conscientiousness, data = df)
## 
## Coefficients:
##           (Intercept)  bf3_conscientiousness  
##                3.9227                 0.0743  
## 
## Degrees of Freedom: 89 Total (i.e. Null);  88 Residual
## Null Deviance:       101.8 
## Residual Deviance: 97.39     AIC: 268.5

4.6 カイ二乗検定

更新予定

5 ggplot2による作図

5.1 Myテーマ作成

ggplot2のテーマは、事前に自作しておいてggplot関数内で呼び出すことができる。また、使用する色も事前に設定しておくとよい。

#テーマ作成
theme_kairi = theme_test(base_family="HiraKakuPro-W3")+
  theme(text               = element_text(colour="black", size=9), #全体（テキスト）
        axis.text          = element_text(colour="black", size=9), #目盛り（テキスト）
        legend.text        = element_text(colour="black", size=9, margin=margin(l=2)), #凡例（テキスト）
        strip.text         = element_text(colour="black", size=9), #ファセット（テキスト）
        plot.tag           = element_text(colour="black", size=12), #タグ
        rect               = element_rect(colour="black", linewidth=0.6), #全体（塗りと枠線）
        panel.border       = element_rect(colour="black", linewidth=0.6), #グラフ領域（塗りと枠線）
        strip.background   = element_rect(colour="black", linewidth=0.6), #ファセット（塗りと枠線）
        line               = element_line(colour="black", linewidth=0.3), #目盛り（線）
        axis.line          = element_blank(), #x軸・y軸（線）
        legend.title       = element_blank(), #凡例（見出し）
        legend.margin      = margin(1.5,1.5,1.5,1.5,"mm"), #凡例（box内余白）
        legend.box.spacing = unit(0,"mm"), #凡例（boxとplot領域の間の余白）
        legend.background  = element_blank() #凡例（塗りと枠線）
        )

cols = c("#777777","#dddddd","#0068b7","#f39800")

5.2 作図のいろいろ

5.2.1 標準偏差と標準誤差

標準偏差と標準誤差は、標準ではなく不偏を使う方針である（奥村）。

sd(x) #不偏標準偏差
std_error(x) #不偏標準誤差

5.2.2 図の保存

図を生成した後にggsave関数によって図を画像として保存できる。

ggsave("fig.png", width=86, height=50, units="mm", dpi=300)

5.2.3 軸目盛りのスケール

設定ごとの意味

expand：プロットの軸の周りの余白のこと。軸の両端に余白を追加しない場合はexpand=c(0,0)と追記する。
limits：プロットに表示をする目盛りの範囲の設定。
breaks：目盛りの付け方に関する設定。たとえ0から10までに目盛りをつけたとしても、limitsで目盛りの表示範囲を5〜10にしていたら、0〜4の目盛りは表示されない。基本的には、limitsとbreaksは同じ範囲でよい。
oob=oob_keep：描画範囲外のデータもstat_summaryや回帰直線の描画時に計算に含まれるようになるため、忘れないように必ず毎回つけておく。

scale_y_continuous(expand=c(0,0), limits=c(0,15), breaks=seq(0,15,1), oob=oob_keep)+

5.2.4 stat_summaryは要注意

stat_summaryは、plot表示範囲のデータのみに基づいて計算を行うため、表示範囲を手動で狭めると値が変わる。

plot表示範囲を狭める場合は、前述したようにoob=oob_keepのオプションをつけなければならない。

5.3 凡例の位置

凡例の位置は、ggplot2のバージョン3.5.0から仕組みが変わり、下記の通りとなった（詳細はこちら）。

guides関数によって、まずは凡例をtop、bottom、right、left、insideのどの位置に配置するかを設定する。その後、各位置をtheme関数内でそれぞれ指定して、位置ごとに詳細な配置方法を決める。
top、bottom、right、leftの位置に配置した場合：legend.justification.{位置}の指定をrightやleftなどと概念的に行い、plot領域との距離はlegend.box.spacing=unit(~,"mm")などとスペースをとることで設定する。
insideの位置に配置した場合：legend.justification.insideの指定はc(~,~)などと数値で行い、plot領域との距離はlegend.margin=margin(~,~,~,~,"mm")などとマージンをとることで設定する。

以下に、自分がよく使うものをまとめておく。

#初期設定の場合
df_test = expand.grid(X1=3:7, X2=3:7)
df_test$value = df_test$X1 * df_test$X2
p = ggplot(df_test, aes(X1, X2))+
  geom_tile(aes(fill=value))+
  scale_x_continuous(limits=c(0,10), breaks=seq(0,10,1))+
  scale_y_continuous(limits=c(0,10), breaks=seq(0,10,1))+
  scale_fill_continuous(guide = guide_legend())+
  theme_kairi
p

#プロット外側の上（右端揃え）
p + guides(fill = guide_legend(position="top"))+
  theme(legend.justification.top="right")

#プロット内側の右上
p + guides(fill = guide_legend(position="inside"))+
  theme(legend.justification.inside=c(1,1))

#プロット内側の右下
p + guides(fill = guide_legend(position="inside"))+
  theme(legend.justification.inside=c(1,0))

#プロット内側の左上（水平並び）
p + guides(fill = guide_legend(position="inside"))+
  theme(legend.direction="horizontal",
        legend.justification.inside=c(0,1))

5.4 patchwork

patchworkパッケージでは、ggplot2の複数の図を並べることができる。縦線|は横方向に並べ、スラッシュ/は縦方向に並べる。丸括弧で括ればグルーピングできる。plot_annotation(tag_levels="〜")オプションを付け加えることで、図にタグを自動追加できる。

df_plot = df
p = ggplot(df_plot, aes(x=age, y=score_mean))+
  geom_point()+
  theme_kairi

(p | p) / p + plot_annotation(tag_levels='A')

5.5 ヒストグラム＋縦横ライン

labs関数内でtagを設定すると、Aなどのタグを差し込める。

df_plot = df
ggplot(df_plot, aes(x=age))+
  geom_histogram(binwidth = 1, fill="gray", colour="black", linewidth=0.3)+
  geom_vline(xintercept=mean(df_plot$age))+
  geom_hline(yintercept=5)+
  scale_y_continuous(expand=c(0,0), limits=c(0,10), breaks=seq(0,10,1), oob=oob_keep)+
  labs(x="年齢", y="人数", tag="A")+
  theme_kairi

任意の範囲でビンを設定できる。また、積み上げにすることもできる。

ggplot(df_plot, aes(x=age, fill=sex))+
  geom_histogram(breaks=c(18,20,30,40,50,60,70), colour="black", linewidth=0.3, position="stack")+
  labs(x="年齢", y="人数", tag="A")+
  theme_kairi+
  scale_fill_viridis(discrete=TRUE)+
  scale_colour_viridis(discrete=TRUE)

5.6 棒グラフ

棒グラフよりも、後述する「ひげ付きドットプロット」や「箱ひげ図」を使ったほうが良い。

.groups = "drop"は、summarise後にグループ化を解除するという意味。

df_plot = df %>%
  pivot_longer(
    cols = c(bf1_extraversion, bf2_agreeableness, bf3_conscientiousness, bf4_neuroticism, bf5_openness),
    names_to = "bigfive",
    values_to = "score"
  ) %>%
  group_by(sex, bigfive) %>%
  summarize(mean = mean(score),
            se = std.error(score),
            .groups = "drop")
head(df_plot)

## # A tibble: 6 × 4
##   sex   bigfive                mean    se
##   <fct> <chr>                 <dbl> <dbl>
## 1 m     bf1_extraversion       7.7  0.385
## 2 m     bf2_agreeableness      8.66 0.450
## 3 m     bf3_conscientiousness  7.56 0.407
## 4 m     bf4_neuroticism        8.58 0.376
## 5 m     bf5_openness           7.66 0.421
## 6 f     bf1_extraversion       7.85 0.401

ggplot(df_plot, aes(x=bigfive, y=mean, fill=sex))+
  geom_col(position="dodge", colour="black", linewidth=0.3, width=0.6)+
  geom_errorbar(aes(ymin=mean-se,ymax=mean+se), position=position_dodge(0.6), width=0.1, linewidth=0.3)+
  scale_x_discrete(labels=c("外向性","協調性","勤勉性","神経症傾向","開放性"),)+
  scale_fill_manual(labels=c("男性","女性"), values=c("#777777","#dddddd"))+
  scale_y_continuous(expand=c(0,0), limits=c(0,15), breaks=seq(0,15,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  guides(fill=guide_legend(position="inside"))+
  theme(legend.direction="horizontal",
        legend.justification.inside=c(0,1))

5.7 ひげ付きドットプロット

5.7.1 集計あり

df_plot = df %>%
  pivot_longer(
    cols = c(bf1_extraversion, bf2_agreeableness, bf3_conscientiousness, bf4_neuroticism, bf5_openness),
    names_to = "bigfive",
    values_to = "score"
  ) %>%
  group_by(sex, bigfive) %>%
  summarize(mean = mean(score),
            se = std.error(score),
            .groups = "drop") %>%
  mutate(bigfive = factor(bigfive, 
                          levels=c("bf1_extraversion","bf2_agreeableness","bf3_conscientiousness","bf4_neuroticism", "bf5_openness"), 
                          labels=c("外向性","協調性","勤勉性","神経症傾向","開放性")))
head(df_plot)

## # A tibble: 6 × 4
##   sex   bigfive     mean    se
##   <fct> <fct>      <dbl> <dbl>
## 1 m     外向性      7.7  0.385
## 2 m     協調性      8.66 0.450
## 3 m     勤勉性      7.56 0.407
## 4 m     神経症傾向  8.58 0.376
## 5 m     開放性      7.66 0.421
## 6 f     外向性      7.85 0.401

ggplot(df_plot, aes(x=bigfive, y=mean, shape=sex))+
  geom_errorbar(aes(ymin=mean-se,ymax=mean+se), position=position_dodge(0.5), width=0.3, linewidth=0.3)+
  geom_point(size=1.6, position=position_dodge(0.5))+
  scale_shape_manual(values=c(15,5), labels=c("男性","女性"))+
  scale_y_continuous(limits=c(6.5,10), breaks=seq(6.5,10,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  guides(shape = guide_legend(position="top"))+
  theme(legend.justification.top="right")

5.7.2 集計あり（軸反転）

軸反転をする際は、事前にファクターレベルをrev関数で反転させておき、さらにposition=position_dodge(~~~)の値を負の数にすることで実現できる。

df_plot = df_plot %>%
  mutate(bigfive = factor(bigfive, levels=rev(levels(df_plot$bigfive))))
ggplot(df_plot, aes(x=bigfive, y=mean, shape=sex))+
  coord_flip()+
  geom_errorbar(aes(ymin=mean-se,ymax=mean+se), position=position_dodge(-0.5), width=0.3, linewidth=0.3)+
  geom_point(size=1.6, position=position_dodge(-0.5))+
  scale_shape_manual(values=c(15,5), labels=c("男性","女性"))+
  scale_y_continuous(limits=c(6.5,10), breaks=seq(6.5,10,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  guides(shape = guide_legend(position="top"))+
  theme(legend.justification.top="right")

## Warning: `position_dodge()` requires non-overlapping x intervals.

5.7.3 ファセット

ファセット機能を使うときに、ファセットラベル名を任意の名称にしたい場合は、ggplot内ではなく事前にデータフレームを作る際に、labelsを設定しなければならない。

df_plot = df %>%
  pivot_longer(
    cols = c(bf1_extraversion, bf2_agreeableness, bf3_conscientiousness, bf4_neuroticism, bf5_openness),
    names_to = "bigfive",
    values_to = "score"
  ) %>%
  group_by(sex, bigfive) %>%
  summarize(mean = mean(score),
            se = std.error(score),
            .groups = "drop") %>%
  mutate(bigfive = factor(bigfive, 
                          levels=c("bf1_extraversion","bf2_agreeableness","bf3_conscientiousness","bf4_neuroticism", "bf5_openness"), 
                          labels=c("外向性","協調性","勤勉性","神経症傾向","開放性")))
head(df_plot)

## # A tibble: 6 × 4
##   sex   bigfive     mean    se
##   <fct> <fct>      <dbl> <dbl>
## 1 m     外向性      7.7  0.385
## 2 m     協調性      8.66 0.450
## 3 m     勤勉性      7.56 0.407
## 4 m     神経症傾向  8.58 0.376
## 5 m     開放性      7.66 0.421
## 6 f     外向性      7.85 0.401

ggplot(df_plot, aes(x=sex, y=mean))+
  geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=0.3, linewidth=0.3)+
  geom_point(size=1.6)+
  scale_x_discrete(labels=c("男性","女性"))+
  scale_y_continuous(limits=c(6.5,10), breaks=seq(6.5,10,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  facet_grid(. ~ bigfive)

5.7.4 ファセット（軸反転）

df_plot = df_plot %>%
  mutate(sex = factor(sex, levels=rev(levels(df_plot$sex))))
ggplot(df_plot, aes(x=sex, y=mean)) +
  coord_flip()+
  geom_errorbar(aes(ymin=mean-se,ymax=mean+se), width=0.3, linewidth=0.3)+
  geom_point(size=1.6)+
  scale_x_discrete(labels=rev(c("男性","女性")))+
  scale_y_continuous(limits=c(6.5,10), breaks=seq(6.5,10,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  facet_grid(bigfive ~ .)+
  theme(strip.text.y = element_text(angle=0))

5.7.5 集計なし

集計せずに作図する場合はstat_summaryを使用する。

df_plot = df %>%
  pivot_longer(
    cols = c(bf1_extraversion, bf2_agreeableness, bf3_conscientiousness, bf4_neuroticism, bf5_openness),
    names_to = "bigfive",
    values_to = "score") %>%
  mutate(bigfive = factor(bigfive, 
                          levels=c("bf1_extraversion","bf2_agreeableness","bf3_conscientiousness","bf4_neuroticism", "bf5_openness"), 
                          labels=c("外向性","協調性","勤勉性","神経症傾向","開放性")))

ggplot(df_plot, aes(x=sex, y=score)) +
  stat_summary(fun.data="mean_se", geom="errorbar", width=0.2, linewidth=0.3, color="black")+
  stat_summary(fun="mean", geom="point", shape=18, size=4, fill="black")+
  scale_x_discrete(labels=c("男性","女性"))+
  scale_y_continuous(limits=c(6.5,10), breaks=seq(6.5,10,1), oob=oob_keep)+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  facet_grid(. ~ bigfive)

5.8 箱ひげ図

df_plot = df
ggplot(df_plot, aes(x=sex, y=score_mean))+
  geom_boxplot(fill="gray", width=0.5, outlier.colour=NA, linewidth=0.3)+
  stat_summary(fun.data="mean_se", geom="errorbar", width=0.1, linewidth=0.3, color="red", position=position_nudge(0.05))+
  stat_summary(fun="mean", geom="point", shape=23, size=3, fill="red", position=position_nudge(0.05))+
  labs(x="X軸の名称", y="Y軸の名称")+
  scale_x_discrete(labels=c("男性","女性"))+
  scale_shape_manual(values=c(0,24))+
  theme_kairi

5.9 バイオリンプロット

df_plot = df
ggplot(df_plot, aes(x=sex, y=score_mean))+
  geom_violin(linewidth=0.3, fill="gray")+
  geom_boxplot(fill="black", width=0.1, outlier.colour=NA, linewidth=0.3)+
  stat_summary(fun="median", geom="point", shape=21, size=3, fill="white")+
  stat_summary(fun.data="mean_se", geom="errorbar", width=0.1, linewidth=0.3, color="red", position=position_nudge(0.15))+
  stat_summary(fun="mean", geom="point", shape=23, size=3, fill="red", position=position_nudge(0.15))+
  labs(x="X軸の名称", y="Y軸の名称")+
  scale_x_discrete(labels=c("男性","女性"))+
  scale_shape_manual(values=c(0,24))+
  theme_kairi

5.10 折れ線グラフ

df_plot = df %>%
  pivot_longer(
    cols = c(session1, session2, session3, session4),
    names_to = "session",
    values_to = "score"
  ) %>%
  group_by(sex, session) %>%
  summarize(mean = mean(score),
            se = std.error(score),
            .groups = "drop")
head(df_plot)

## # A tibble: 6 × 4
##   sex   session   mean    se
##   <fct> <chr>    <dbl> <dbl>
## 1 m     session1  4.74 0.295
## 2 m     session2  4.7  0.330
## 3 m     session3  4.3  0.348
## 4 m     session4  4.58 0.378
## 5 f     session1  4.53 0.386
## 6 f     session2  4.68 0.380

pd = position_dodge(0.3)
ggplot(df_plot, aes(x=session, y=mean, colour=sex, group=sex, shape=sex))+
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=0.2, linewidth=0.3, colour="black", position=pd)+
  geom_line(linewidth=1, position=pd)+
  geom_point(size=3, position=pd)+
  labs(x="X軸の名称", y="Y軸の名称")+
  scale_y_continuous(limits=c(3,5.5), breaks=seq(3,5.5,0.5), oob=oob_keep)+
  scale_colour_manual(values=c("m"=cols[3], "f"=cols[4]),
                      labels=c("男性", "女性"))+
  scale_shape_manual(values=c("m"=15, "f"=16),
                     labels=c("男性", "女性"))+
  theme_kairi+
  guides(colour=guide_legend(position="inside"))+
  theme(legend.justification.inside=c(1,0))

5.11 散布図

5.11.1 カラーが使える場合

df_plot = df
ggplot(df_plot, aes(x=bf1_extraversion, y=score_mean, shape=sex, colour=sex))+
  geom_point(size=1.5)+
  geom_smooth(method="lm", linewidth=0.8, se=FALSE)+
  labs(x="X軸の名称", y="Y軸の名称")+
  scale_shape_manual(values=c(0,2),
                     labels=c("男性","女性"))+
  scale_colour_manual(values=c(cols[3],cols[4]),
                      labels=c("男性","女性"))+
  theme_kairi+
  guides(shape=guide_legend(position="inside"))+
  theme(legend.justification.inside=c(1,0))

## `geom_smooth()` using formula = 'y ~ x'

5.11.2 カラーが使えない場合

ggplot(df_plot, aes(x=bf1_extraversion, y=score_mean, shape=sex, fill=sex, linetype=sex, group=sex))+
  geom_point(size=1.5, colour="black")+
  geom_smooth(method="lm", colour="black", linewidth=0.8, se=FALSE)+
  labs(x="X軸の名称", y="Y軸の名称")+
  scale_shape_manual(values=c(22,21),
                     labels=c("男性","女性"))+
  scale_fill_manual(values=c("white","black"),
                      labels=c("男性","女性"))+
  scale_linetype_manual(values=c("solid","dashed"),
                        labels=c("男性","女性"))+
  theme_kairi+
  guides(shape=guide_legend(position="inside"))+
  theme(legend.justification.inside=c(1,0))

## `geom_smooth()` using formula = 'y ~ x'

5.12 順位化散布図（Spearman）

ノンパラメトリック検定などを行った場合は、通常の散布図では相関関係を示すのに不適切であることから、順位データに変換してから散布図を描くとよい。

たとえばspearmanの相関係数では、順位タイデータが存在する際は平均順位となるため、rank関数を用いた順位変換の際にはオプションにties.method="average"を入れる（奥村）。初期状態がaverageのようなので、オプションは不要。

5.12.1 重なりを透過で表現

df_plot = df %>%
  mutate(x_rank = rank(score_mean),
         y_rank = rank(bf1_extraversion)
  ) %>%
  arrange(x_rank, y_rank) %>%
  select(x_rank, y_rank)
head(df_plot)

##   x_rank y_rank
## 1    1.0   34.0
## 2    2.0   63.5
## 3    3.5    1.5
## 4    3.5   49.0
## 5    6.5   14.5
## 6    6.5   23.0

ggplot(df_plot, aes(x=x_rank, y=y_rank))+
  geom_point(shape=21, size=1.5, fill="black", colour="black", alpha=0.2)+
  geom_smooth(method="lm")+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi

## `geom_smooth()` using formula = 'y ~ x'

5.12.2 重なりを点の大きさで表現

df_plot = df %>% 
  mutate(x_rank = rank(score_mean),
         y_rank = rank(bf1_extraversion)
  ) %>%
  group_by(x_rank, y_rank) %>%
  summarise(count=n(), .groups='drop')
head(df_plot)

## # A tibble: 6 × 3
##   x_rank y_rank count
##    <dbl>  <dbl> <int>
## 1    1     34       1
## 2    2     63.5     1
## 3    3.5    1.5     1
## 4    3.5   49       1
## 5    6.5   14.5     1
## 6    6.5   23       1

ggplot(df_plot, aes(x=x_rank, y=y_rank))+
  geom_point(aes(size=count), shape=21, fill="black", colour="black", alpha=0.6)+
  scale_size_continuous(range=c(1,5)) +
  geom_smooth(method="lm")+
  labs(x="X軸の名称", y="Y軸の名称")+
  theme_kairi+
  theme(legend.position="none")

## `geom_smooth()` using formula = 'y ~ x'

心理統計のためのRチートシート

〜Tidyverse・仮説検定・ggplot2の備忘録〜

杉本海里 / Kairi Sugimoto

公開開始: 2024/05/13、更新: 2024-06-15