Introducing Scikit-Learn

3832 ワード

About Scikit-Learn

Scikit-Learn is one of the most-used open-source machine learning library for Python. Scikit-Learn provides various unsupervised and supervised learning algorithms which many data-scientists rely on.

Install Scikit-Learn

conda install scikit-learn

pip install scikit-learn

import sklearn

print(sklearn.__version__)

Output

0.21.3

Predict Types of Irises

We will try to classify types of irises based on the imported feature dataset (i.e - sepal length, sepal width, petal length, petal width).
Classification
supervised-learning problem where a class label is predicted for a given exmaple of input data (i.e - classify COVID-19, classify spam mails)

from sklearn.datasets import load_iris 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

import pandas as pd

# load iris dataset
iris = load_iris()

# iris.data contains feature-data in a numpy format
iris_data = iris.data

# iris.target contains label-data in a numpy format
iris_label = iris.target
print('Iris Target Values : \n', iris_label)
print('Iris Target Names : \n', iris.target_names)

# convert data-set to DataFrame
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names)
iris_df['label'] = iris.target
iris_df.head(3)

Output

Iris Target Values : 
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
Iris Target Names : 
 ['setosa' 'versicolor' 'virginica']

Split to Train & Test Data

Train and test data must be splitted in order to evaluate the performance of the trained model. Scikit-Learn provies train_test_split() API to easily split dataset.

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, test_size=0.2, random_state=11)

# craete Decision Tree Classifier object
dt_clf = DecisionTreeClassifier(random_state=11)

# perform train 

# fit() calls train feature data set & train label data set 
dt_clf.fit(X_train, y_train)

Now, DecisionTreeClassifier has completed its training on data based on train data-set. Prediction must use another dataset (test data-set) by calling predict().

# perform prediction on dt_clf using test data-set

pred = dt_clf.predict(X_test)

Now import accuracy_score to evaluate the performance of the model

from sklearn.metrics import accuracy_score

print('Accuracy Score : {0:4f}'.format(accuracy_score(y_test, pred)))

Output

Accuracy Score : 0.933333

The trained algorithm of decision tree classifer is measured to have 93.33% of accuracy.

To Summarize

Split Data-set : split data to train and test data-set

Train Model : train the model by applying ML-algorithm based on the train data-set

Perform Prediction : Predict classification based on the trained ML-model

Evaluation : Evaluate the accuracy of the prediction by comparing results to label test data

Reference

この問題について(Introducing Scikit-Learn), 我々は、より多くの情報をここで見つけました https://velog.io/@jiselectric/Machine-Learning-with-Scikit-Learn-01

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

ブラウザURLグループ

データ構造とアルゴリズム理論の総括(feat.big(O)