pandas.value_counts() と、collections.Counter の処理時間を比較してみた

4888 ワード

Python tech テキストリンク

Pandas は便利ですが、処理速度が・・・とよく耳にします。

Pandas のユニークな要素の個数をカウントする value_counts() と、Python標準ライブラリの collections.Counter で処理時間を比較してみました。

import collections
import pandas as pd
import random

サンプルデータとしてランダムな整数２億個の list を作成します。

hoge_list = [random.randint(0, 9999) for i in range(200000000)]

リスト作成に 92.8秒を要しました。

hoge_df = pd.DataFrame(hoge_list)

リストから DataFrame への変換に 19.0秒を要しました。

c2 = hoge_df.value_counts()

print(c2.iloc[:3])
print(c2.iloc[-3:])

出力:
9427 20501
6629 20482
5215 20475
dtype: int64
3637 19523
4647 19505
3036 19420
dtype: int64

c1 = collections.Counter(hoge_list)

print(c1.most_common()[:3])
print(c1.most_common()[-3:])

出力:
[(9427, 20501), (6629, 20482), (5215, 20475)]
[(3637, 19523), (4647, 19505), (3036, 19420)]

結果

list collections.Counter 9.1秒

DataFrame変換済み value_counts 1.7秒

ただし list -> DataFrame への変換 19.0秒

単純にユニークな要素の個数をカウントするところだけをみると value_counts の方が早いようです。

以上になります、最後までお読みいただきありがとうございました。

この問題について(pandas.value_counts() と、collections.Counter の処理時間を比較してみた), 我々は、より多くの情報をここで見つけました https://zenn.dev/megane_otoko/articles/087_value_counts

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Collection and Share based on the CC protocol