{dplyr} 各グループで最大値を持つ行や平均値以上の値を持つ行を抽出

4278 ワード

dplyr R tidyverse R テキストリンク

自分が結構引っかかってたので簡単なメモ。

A. グループ化したい列を`group_by`して`filter`すればOK

グループ化されたtibbleに対しては、filterは通常とは少し違う挙動をします。詳しくはfilterのレファレンスに書いてあります。以下部分的な引用です：

In the ungrouped version, filter() compares the value of mass in each row to the global average (taken over the whole data set), keeping only the rows with mass greater than this global average. In contrast, the grouped version calculates the average mass separately for each gender group, and keeps rows with mass greater than the relevant within-gender average.

https://dplyr.tidyverse.org/reference/filter.html#grouped-tibbles より

つまり、グループ化されたtibbleに対して== max(<列名>)や>= mean(<列名>)等によるフィルタリングをすると、全体における最大値や全体の平均値ではなく、グループ内の最大値やグループ内の平均値との比較でフィルタリングが行われます。これを知っていれば行抽出が楽々できます。

例

各グループ１で最大値を持つ行を抽出

iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length == max(Sepal.Length))

## # A tibble: 3 x 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5.8         4            1.2         0.2 setosa    
## 2          7           3.2          4.7         1.4 versicolor
## 3          7.9         3.8          6.4         2   virginica

各グループで平均値以上の値を持つ行を抽出

iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length >= mean(Sepal.Length))

## # A tibble: 68 x 5
## # Groups:   Species [3]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          5.4         3.9          1.7         0.4 setosa 
##  3          5.4         3.7          1.5         0.2 setosa 
##  4          5.8         4            1.2         0.2 setosa 
##  5          5.7         4.4          1.5         0.4 setosa 
##  6          5.4         3.9          1.3         0.4 setosa 
##  7          5.1         3.5          1.4         0.3 setosa 
##  8          5.7         3.8          1.7         0.3 setosa 
##  9          5.1         3.8          1.5         0.3 setosa 
## 10          5.4         3.4          1.7         0.2 setosa 
## # ... with 58 more rows

注意：必要に合わせて適宜`ungroup`するのを忘れずに

Author And Source

この問題について({dplyr} 各グループで最大値を持つ行や平均値以上の値を持つ行を抽出), 我々は、より多くの情報をここで見つけました https://qiita.com/ocean_f/items/cf4b5594efd4685bc75a

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .