機械学習-ワインデータ分析

37985 ワード

python 機械学習テキストリンク

ワインデータ分析

1.ワインデータの概要

Wine:分類問題では、Iris花データほど有名ではありませんが、たくさん買いました.
人類史上最古の酒とされています.

紀元前7000年頃、ジョージア-アルメナ-トルコ北東部(コカソ)に痕跡

が発見された

プラトニック「神から与えられた贈り物の中で、ワインほど偉大な価値を持つものはない」.

ワイン風味分類:リンク2

データ:リンク

ワイン品質データ(ダウンロード):リンク2

白ワイン品質データ(ダウンロード):リンク2

# 데이터 읽기
import pandas as pd

red_wine = pd.read_csv('winequality-red.csv', sep=';')
white_wine = pd.read_csv('winequality-white.csv', sep=';')

両データの構造が同一であることを確認する

列のタイプ

固定酸度:固定酸度

揮発性酸度:揮発性酸度

クエン酸:クエン酸

残糖:残糖分

塩化物:塩化物

遊離二酸化硫黄:自由二酸化硫黄

総二酸化硫黄:総二酸化硫黄

密度:密度

硫酸塩

:硫酸塩

alcohol

質量:0から10(高いほど良い)

# 레드 와인/화이트 와인 데이터 연결
# color 컬럼 추가하여 구분
red_wine['color'] = 1.
white_wine['color'] = 0.

wine = pd.concat([red_wine, white_wine])
wine.reset_index(drop=True, inplace=True)
wine.info()

# 내용 확인
wine['quality'].unique()

wine['quality'].value_counts()

# quality 히스토그램
import plotly.express as px

fig = px.histogram(wine, x='quality')
fig.show()

白ワインの数が多く、

を確認

ワインの等級は少し小さくて、

を確認します

# 레드/화이트 와인별 등급 Histogram
fig = px.histogram(wine, x='quality', color='color')
fig.show()

2.赤ワイン、白ワインの分類

赤ワインか白ワインか

# 라벨 분리
x = wine.drop(['color'], axis=1)
y = wine['color']

# 훈련용, 테스트용 나누기
from sklearn.model_selection import train_test_split
import numpy as np

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=13)
# y값 확인
np.unique(y_train, return_counts=True)

# 훈련용과 테스트용이 레드/화이트에 따라 어느정도 구분되었는지 확인
import plotly.graph_objects as go

fig = go.Figure()
# 여러 개의 그래프를 그릴 경우
fig.add_trace(go.Histogram(x=x_train['quality'], name='Train'))
fig.add_trace(go.Histogram(x=x_test['quality'], name='Test'))

fig.update_layout(barmode='overlay')
fig.update_traces(opacity=0.75)
fig.show()

の精度では、トレーニング値とテスト値は同等である.

# 결정 나무 훈련
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

wine_tree = DecisionTreeClassifier(max_depth = 2, random_state=13)
wine_tree.fit(x_train, y_train)

y_pred_tr = wine_tree.predict(x_train)
y_pred_test = wine_tree.predict(x_test)

重要判断

(総二酸化硫黄:総二酸化硫黄).

塩化物:塩化物

# decision tree 그래프
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(12, 8))
plot_tree(wine_tree)
plt.show()

3.データ前処理

# 와인 데이터의 몇 개 항목의 Boxplot 그리기
# 컬럼값의 차이가 심하면 훈련이 제대로 이루어지지 않을 수 있다.
# Scaler를 통해 정리할 수 있다.
#   => 결정 나무에서는 이런 전처리는 의미를 가지지 않는다.
#   => 주로 Cost Function을 최적화할 때 유효하다
fig = go.Figure()
fig.add_trace(go.Box(y=x['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=x['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=x['quality'], name='quality'))
fig.show()

# MinMaxScaler, StandardScaler
from sklearn.preprocessing import MinMaxScaler, StandardScaler

MMS = MinMaxScaler()
SS = StandardScaler()

MMS.fit(x)
SS.fit(x)

x_ss = SS.transform(x)
x_mms = MMS.transform(x)

x_ss_pd = pd.DataFrame(x_ss, columns=x.columns)
x_mms_pd = pd.DataFrame(x_mms, columns=x.columns)

# x_mms_pd
fig = go.Figure()
fig.add_trace(go.Box(y=x_mms_pd['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=x_mms_pd['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=x_mms_pd['quality'], name='quality'))
fig.show()

# x_ss_pd
fig = go.Figure()
fig.add_trace(go.Box(y=x_ss_pd['fixed acidity'], name='fixed acidity'))
fig.add_trace(go.Box(y=x_ss_pd['chlorides'], name='chlorides'))
fig.add_trace(go.Box(y=x_ss_pd['quality'], name='quality'))
fig.show()

MinMaxScaleを使用して

を学習

結晶木にはほとんど効果がありません.

# MinMaxScaler
x_train, x_test, y_train, y_test = train_test_split(x_mms_pd, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth = 2, random_state=13)
wine_tree.fit(x_train, y_train)

y_pred_tr = wine_tree.predict(x_train)
y_pred_test = wine_tree.predict(x_test)

print('Train Acc : {}'.format(accuracy_score(y_train, y_pred_tr)))
print('Train Acc : {}'.format(accuracy_score(y_test, y_pred_test)))

標準カレンダーを使用して

を学習

結晶木にはほとんど効果がありません.

# StandardScaler
x_train, x_test, y_train, y_test = train_test_split(x_ss_pd, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth = 2, random_state=13)
wine_tree.fit(x_train, y_train)

y_pred_tr = wine_tree.predict(x_train)
y_pred_test = wine_tree.predict(x_test)

print('Train Acc : {}'.format(accuracy_score(y_train, y_pred_tr)))
print('Train Acc : {}'.format(accuracy_score(y_test, y_pred_test)))

max depthを上げると数値が変わります.

# 레드 와인과 화이트 와인을 구분하는 중요 특성
dict(zip(x_train.columns, wine_tree.feature_importances_))

4.味のバイナリ分類

質量列が進化しました

5以上:1(美味しい)

5:0以下(美味しくない)

# taste 컬럼 추가
wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]
wine.info()

# 레드/화이트 와인 분류와 동일하게 진행
X = wine.drop(['taste'], axis=1)
y = wine['taste']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)

# 결정 트리 생성
wine_tree = DecisionTreeClassifier(max_depth=2, random_state = 13)
# 훈련 데이터 fit
wine_tree.fit(X_train, y_train)

の正しさは1です.

なぜ1が現れるのでしょうか.

質量カラムを使用して100%に分割します.

質量柱を用いて試食柱を作製したが,質量柱は除去されず,精度は100%であった.

の味を作る時に使う品質も取り除いてから行います.

# 정확성 확인
y_pred_tr = wine_tree.predict(X_train)
y_pred_Test = wine_tree.predict(X_test)

print('Train Acc : {}'.format(accuracy_score(y_train, y_pred_tr)))
print('Test Acc : {}'.format(accuracy_score(y_test, y_pred_Test)))

# Decision Tree 확인
plt.figure(figsize=(12, 8))
plot_tree(wine_tree, feature_names=X.columns)
plt.show()

品質削除後

を行う.

# 레드/화이트 와인 분류와 동일하게 진행
X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)

# 결정 트리 생성
wine_tree = DecisionTreeClassifier(max_depth=2, random_state = 13)

# 훈련 데이터 fit
wine_tree.fit(X_train, y_train)

# 정확성 확인
y_pred_tr = wine_tree.predict(X_train)
y_pred_Test = wine_tree.predict(X_test)

print('Train Acc : {}'.format(accuracy_score(y_train, y_pred_tr)))
print('Test Acc : {}'.format(accuracy_score(y_test, y_pred_Test)))

アルコール、揮発性酸性度および遊離二酸化硫黄を用いた.

アルコールが高いと美味しいですか?

# Decision Tree
plt.figure(figsize=(12, 8))
plot_tree(wine_tree, feature_names=X.columns)
plt.show()

Reference

この問題について(機械学習-ワインデータ分析), 我々は、より多くの情報をここで見つけました https://velog.io/@skarb4788/머신-러닝-와인-데이터-분석

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

Ubuntu16.04システムのインストールは必ず行います

ubuntu16.04 teamviewerのインストール