Jupyternotebookでpandas-profilingの日本語部分が文字化けする件について

4450 ワード

pandas Jupyter-notebook Python Docker Python テキストリンク

はじめに

データ分析をする際に初めにやることといえば、データがどのような特徴を持っているのか把握することからだと思います。そんなときにpandas-profilingを使うとEDAを一括して行ってくれるのでとても便利です。しかし、Jupyternotebookで試したところデータの日本語のカラムなどが□□□のように文字化け(豆腐)となってしまったので解決策をまとめていきたいと思います。

環境

Python3.8
Docker
Jupyter notebook

原因

pandas-profilingで文字化けが起きる原因はmatplotlibとseabornが日本語化対応できていないことに起因します。日本語化対応できれば、matplotlibとseabornを利用したpandas-profilingも日本語対応されます。

この記事ではmatplotlibとseabornの日本語化の手順について説明することになります。

Matplotlibとseabornの日本語化

この記事ではDockerを利用したJupyternotebook環境の日本語対応について説明します。
さらに効率の良い方法があるかもしれませんので、その際にはコメントを頂けると嬉しいです。

1. 日本語のフォントをダウンロードする

こちらのサイトからipaexg00401.zip(4.0MB)をダウンロードして解凍します。
ipaexg00401フォルダの中にあるipaexg.ttfをDockerfileがあるディレクトリに移動します。

2. seabornの日本語対応のためコンテナ上のファイルをホストにコピーする

ここで行う作業は、seabornを日本語化させるために必要なrcmod.pyをローカルにダウンロードして内容を書き換え、docker-compose upの度に、コンテナ上のrcmod.pyをホストの書き換え済みのrcmod.pyで上書きするように設定します。このような流れをとることでdocker-compose upの度にrcmod.pyを書き換えずに済みます。

(本当はDockerfileでコンテナ上のを書き換えたいのですがわかりませんでした)

日本語対応していない状態でdocker-compose upをします。
別のターミナルを開いて、コンテナIDを確認します。

# コンテナIDを確認する
$ docker ps

次にコンテナ上のrcmod.pyをホスト(ローカル上)に保存します。

$ docker cp [コンテナID]:opt/conda/lib/python3.8/site-packages/seaborn/rcmod.py [保存先(C:\Users\....など)]

最後に保存したrcmod.pyをDockerfileがあるディレクトリにコピーします。

3. rcmod.pyを書き換える

rcmod.pyを開いて以下の内容を変更します。

86-87行目のdef set(context="notebook", ...)のfontの部分をfont="IPAexGothic"に変更します。

def set_theme(context="notebook", style="darkgrid", palette="deep",
              font="IPAexGothic", font_scale=1, color_codes=True, rc=None):

次に205行目の"font.family": ["sans-serif"]を以下に変更します

"font.family": ["IPAexGothic"]

これでseabornの日本語対応のための書き換えは終了です。

4. Dockerfileに以下を追加する

# matplotlibとscipyの日本語化
# 日本語のフォントをコピーする
COPY ipaexg.ttf /opt/conda/lib/python3.8/site-packages/matplotlib/mpl-data/fonts/ttf/ipaexg.ttf
# 書き換えたrcmod.pyでコンテナ上のrcmod.pyを上書きする
COPY ./rcmod.py /opt/conda/lib/python3.8/site-packages/seaborn/rcmod.py
# matplotlibの設定ファイルの最後にfont.family : IPAexGothicを追加する
RUN echo "font.family : IPAexGothic" >>  /opt/conda/lib/python3.8/site-packages/matplotlib/mpl-data/matplotlibrc
# キャッシュを削除する
RUN rm -r ./.cache

これでmatplotlibとseabornの日本語対応が行えます。
以下のコードで文字化けしないか確かめることができます。

# matplotlibが日本対応できているか確認する
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.xlabel('日本語化')
plt.ylabel('matplotlibの')
plt.show()

# seabornが日本語対応できているか確認する
import seaborn as sns
sns.set(style="whitegrid")

# Load the example Titanic dataset
titanic = sns.load_dataset("titanic")

# Draw a nested barplot to show survival for class and sex
g = sns.catplot(x="class", y="survived", hue="sex", data=titanic,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("seabornの日本語化")

ラベルに日本語が使われているので、ラベルが文字化け(豆腐)してなければ成功です。

matplotlibとseabornの日本語対応が確認出来たら、pandas-profilingでも日本語対応されているはずです。

終わりに

EDAを行う前にとりあえずpandas-profilingするのが定番になりそうな気がしています。

Author And Source

この問題について(Jupyternotebookでpandas-profilingの日本語部分が文字化けする件について), 我々は、より多くの情報をここで見つけました https://qiita.com/Sicut_study/items/440877e38485886bde27

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

shell utf 8ファイルのbomヘッダを除去する方法

10種類のソートアルゴリズムのまとめ(バブル、選択、挿入、ヒル、集計、高速、スタック、トポロジー、選手権、基数)