dplyr 1.00のacross関数が便利だった - 世界銀行で働くデータサイエンティストのブログ

「dplyr 1.00がリリースされCRANからインストールできるようになった」と教祖様(William Hadley氏)がツイートしていました。

dplyr 1.0.0 out now: https://t.co/NDMJmxwllZ. This is the culmination of months of work and we're very excited that it's now available to the world! #rstats
— Hadley Wickham (@hadleywickham) 2020年6月1日

主な変更点はTidyverseの公式ウェブサイトにまとめられています。

私も早速dplyr 1.00にアップデートしまして，変更点の中で目玉の1つであるacross()関数を使ってみたので，今回はその使い方をまとめていきます。

across関数とは

dplyr 1.00をインストールしてhelp(across)を見ると以下の記述があります。

Description: across() makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in summarise() and mutate(). across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().

簡単に言うと「across()はsummarise()やmutate()の中でselect()の機能を果たす。従来の_atや_ifに代替されるもの。」ということです。具体例を見ていきましょう。

従来の書き方とacross関数

従来の書き方とacross関数を比較するため，「irisでSpeciesごとに4変数(Sepal.Lengthなど)の平均値を計算する」という処理を考えます。従来の書き方として3つ紹介します。

従来パターン1: summariseの中身をベタ書き

最悪の書き方です。

library(dplyr)
library(tidyr)

> iris %>%
+   group_by(Species) %>%
+   summarise(Sepal.Length = mean(Sepal.Length),
+             Sepal.Width  = mean(Sepal.Width),
+             Petal.Length = mean(Petal.Length),
+             Petal.Width  = mean(Petal.Width))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03

従来パターン2: tidyrでgatherしてspread

処理次第でこの書き方が楽な場合もありますが，今回の処理では悪手です。

iris %>%
  pivot_longer(names_to = "feature", values_to = "value", cols = -Species) %>%
  group_by(Species, feature) %>%
  summarise(value = mean(value)) %>%
  pivot_wider(names_from = feature, values_from = value)

従来パターン3: summarise_ifかsummarise_at

目的の処理に対して従来の最適な書き方がこちらです。

# summarise_if
iris %>%
  group_by(Species) %>%
  summarise_if(is.numeric, mean)

# summarise_at
iris %>%
  group_by(Species) %>%
  summarise_at(vars(Petal.Length:Sepal.Width), mean)

acrossを使った書き方

across()の使い方はこちらです。

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

第1引数には変数，第2引数には関数を渡します。ポイントはselect()と同じように第1引数を渡せるという点です。starts_with()やwhere()などのヘルパー関数を用いることができます。

上で行ったirisの処理を(1)コロン，(2)ヘルパー関数where，(3)ヘルパー関数everythingを使ってやってみるとこんな感じです。

# (1) colon
iris %>%
  group_by(Species) %>%
  summarise(across(Sepal.Length:Petal.Width, mean))

# (2) where()
iris %>%
  group_by(Species) %>%
  summarise(across(where(is.numeric), mean))

# (3) everything()
iris %>%
  group_by(Species) %>%
  summarise(across(everything(), mean))

「従来のsummarise_ifやsummarise_atと一緒じゃん」と思う方もいるかもしれません。しかし，across()関数の方が圧倒的に便利な点があります。

across関数の，ここが凄い!!!

across()を使えば，従来の_ifや_atには不可能だった表現が可能になります。例えばirisを使って

4変数(SepalとPetal)の平均値を計算
Speciesのレベル数を出力
行数を出力

を一気にできます。

> iris %>%
+   summarise(across(where(is.numeric), mean),
+             across(where(is.factor), nlevels),
+             count = n())

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species count
1     5.843333    3.057333        3.758    1.199333       3   150