tensorflow estimatorを用いて住宅価格を予測する線形回帰モデルを訓練した.

23287 ワード

マシン学習 tenssor flow

データセットのリンク

初回試行

コード:

訓練100回の結果:

パッケージコード、トレーニングプロセスを表示

構想

スーパーパラメータ

コード:

結果:

図:

経験法則

特徴を合成し、群値

を除去する

Googleの機械学習クイック入門コースを学ぶには、線形回帰を利用して住宅価格を予測する作業があり、データセットには8つの特徴があるが、ここではそのうちの1つだけで練習の役割を果たしている.単一の特性だけで良いモデルを作ることはできません
データセットのリンク
初めての試み
コード:

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import os
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

tf.logging.set_verbosity(tf.logging.ERROR)  # DEBUG INFO WARN ERROR FATAL

pd.options.display.max_rows = 10
pd.options.display.max_columns = 9
# pd.set_option('max_columns', 9)
pd.options.display.float_format = '{:.1f}'.format

#      
# california_housing_dataframe = pd.read_csv
# ("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=',')
california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=',')

#     
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))

#           ，median_house_value     
california_housing_dataframe["median_house_value"] /= 1000.0

#     
# print('
  ：')
# print(california_housing_dataframe.head())
# print('
    ：')
# describe = california_housing_dataframe.describe()
# print(describe)

#    

# 1.        
my_feature = california_housing_dataframe[['total_rooms']]  #   dataframe
# my_feature_series = california_housing_dataframe['total_rooms']  #   series
# print('
  ')
# print(type(my_feature))
# print(type(my_feature_series))
feature_columns = [tf.feature_column.numeric_column('total_rooms')]  #       todo
# print(feature_columns)


# 2.    
targets = california_housing_dataframe['median_house_value']

# 3.      
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
linear_regressor = tf.estimator.LinearRegressor(feature_columns=feature_columns, optimizer=my_optimizer)


# 4.      
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """
        
    :param features:     
    :param targets:     
    :param batch_size:        
    :param shuffle:       
    :param num_epochs:     
    :return:     
    """
    features = {key: np.array(value) for key, value in dict(features).items()}

    ds = Dataset.from_tensor_slices((features, targets))
    ds = ds.batch(batch_size).repeat(num_epochs)

    if shuffle:
        ds = ds.shuffle(buffer_size=10000)

    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels


# 5.  
_ = linear_regressor.train(input_fn=lambda: my_input_fn(my_feature, targets), steps=100)

# 6.    
prediction_input_fn = lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
predictions = linear_regressor.predict(input_fn=prediction_input_fn)

predictions = np.array([item['predictions'][0] for item in predictions])
# print(predictions)

# 6.    
mean_squared_error = metrics.mean_squared_error(targets, predictions)
root_mean_squared_error = math.sqrt(mean_squared_error)

min_house_value = california_housing_dataframe['median_house_value'].min()
max_house_value = california_housing_dataframe['median_house_value'].max()
max_min_difference = max_house_value - min_house_value

# print('Mean squared error(on train set): %.3f' % mean_squared_error)
print('Root mean squared error(on train set): %.3f' % root_mean_squared_error)
print('Max. median house value(on train set): %.3f' % max_house_value)
print('Min. median house value(on train set): %.3f' % min_house_value)
print('Difference between Min. and Max.(on train set): %.3f' % max_min_difference)

#    ，    
# Root mean squared error(on train set): 237.417
# Max. median house value(on train set): 500.001
# Min. median house value(on train set): 14.999
# Difference between Min. and Max.(on train set): 485.002
calibration_data = pd.DataFrame()
calibration_data['prediction'] = pd.Series(predictions)
calibration_data['targets'] = pd.Series(targets)
print(calibration_data.describe())
#        prediction  targets
# count     17000.0  17000.0
# mean          0.1    207.3
# std           0.1    116.0
# min           0.0     15.0
# 25%           0.1    119.4
# 50%           0.1    180.4
# 75%           0.2    265.0
# max           1.9    500.0

#    
sample = california_housing_dataframe.sample(n=300)
x_0 = sample['total_rooms'].min()
x_1 = sample['total_rooms'].max()
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias
plt.plot([x_0,x_1],[y_0, y_1], c='r')
plt.xlabel('total_rooms')
plt.ylabel('median_house_value')
plt.scatter(sample['total_rooms'], sample['median_house_value'])
plt.show()

100回のトレーニングの結果:
データセットから300点抽出したと表示され、赤い線が予測で、結果は悪かった.
パッケージコード、トレーニングプロセスの表示
構想
いくつかのステップごとに、これまでの訓練の結果を示します.
スーパーパラメータ

steps:反復の合計回数を訓練し、反復ごとにステップとし、サンプルの損失関数をステップとして計算し、損失関数を使用してモデルの重みを変更します.

batch size:単一ステップのサンプル数(ランダム選択).

total number of trained example=batch size×steps t o t a l n u m b e r o f t r a i n e d e x a m p l e = b a t c h s i z e × s t e p s

periods:報告状況の粒度を制御します.periodsが7でstepsが70の場合、10ステップごとに印刷損失値が出力されます.
number of training examples in each period=batch size×stepsperiods n u m b e r o f t r a i n i n g e x a m p l e s i n e a c h p e r i o d = b a t c h s i z e × s t e p s p e r i o d s

コード:

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import os
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

tf.logging.set_verbosity(tf.logging.ERROR)  # DEBUG INFO WARN ERROR FATAL

pd.options.display.max_rows = 10
pd.options.display.max_columns = 9
pd.options.display.float_format = '{:.1f}'.format

#      
california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=',')
california_housing_dataframe = california_housing_dataframe.reindex(
    np.random.permutation(california_housing_dataframe.index))  #     
california_housing_dataframe["median_house_value"] /= 1000.0  #           ，median_house_value     


# 4.      
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """
        
    :param features:     
    :param targets:     
    :param batch_size:        
    :param shuffle:       
    :param num_epochs:     
    :return:     
    """
    features = {key: np.array(value) for key, value in dict(features).items()}

    ds = Dataset.from_tensor_slices((features, targets))
    ds = ds.batch(batch_size).repeat(num_epochs)

    if shuffle:
        ds = ds.shuffle(buffer_size=10000)

    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels


def train_model(learning_rate, steps, batch_size, input_feature='total_rooms'):
    periods = 10  #        
    steps_per_periods = steps / periods  #           

    my_feature = input_feature
    my_feature_data = california_housing_dataframe[[my_feature]]  #    
    my_label = 'median_house_value'
    targets = california_housing_dataframe[my_label]  #   

    #      
    feature_columns = [tf.feature_column.numeric_column(my_feature)]

    #       
    training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)
    prediction_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

    #      
    my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)  #     

    #   
    linear_regressor = tf.estimator.LinearRegressor(feature_columns=feature_columns, optimizer=my_optimizer)

    #          
    plt.figure(figsize=(15, 6))
    plt.subplot(1, 2, 1)
    plt.title('Learned line by period')
    plt.ylabel(my_label)
    plt.xlabel(my_feature)
    sample = california_housing_dataframe.sample(n=300)
    plt.scatter(sample[my_feature], sample[my_label])
    colors = np.linspace(0, 1, periods)
    cmap = cm.get_cmap('hsv')

    print('Training model ...')
    print('RMSE(on training set):')
    predictions_buffer = None
    root_mean_squared_errors = []
    for period in range(0, periods):
        linear_regressor.train(input_fn=training_input_fn, steps=steps_per_periods)
        predictions = linear_regressor.predict(input_fn=prediction_input_fn)
        predictions = np.array([item['predictions'][0] for item in predictions])
        # item    ：{'predictions': array([0.015675], dtype=float32)}
        # print(predictions)
        predictions_buffer = predictions

        root_mean_squared_error = math.sqrt(metrics.mean_squared_error(targets, predictions))
        print('period %02d : %.2f' % (period, root_mean_squared_error))
        root_mean_squared_errors.append(root_mean_squared_error)

        y_extents = np.array([0, sample[my_label].max()])
        weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]
        bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')
        x_extents = (y_extents - bias) / weight
        # 1.        ； 2.                
        x_extents = np.maximum(np.minimum(x_extents, sample[my_feature].max()),  #                       
                               sample[my_feature].min())  #                
        y_extents = weight * x_extents + bias
        plt.plot(x_extents, y_extents, color=cmap(colors[period]), label='period:{}'.format(period))
    print('Model training finished.')

    plt.legend(loc='best')

    plt.subplot(1, 2, 2)
    plt.ylabel('RMSE')
    plt.xlabel('Periods')
    plt.title('Root mean squared error vs. periods')
    plt.tight_layout()
    plt.plot(root_mean_squared_errors)
    plt.show()

    calibration_data = pd.DataFrame()
    calibration_data['prediction'] = pd.Series(predictions_buffer)
    calibration_data['targets'] = pd.Series(targets)
    display.display(calibration_data.describe())

    print("final RMSE（on training data): % .2f" % root_mean_squared_errors[-1])


train_model(learning_rate=0.00002, steps=500, batch_size=5)

結果:

Training model ...
RMSE(on training set):
period 00 : 225.63
period 01 : 214.42
period 02 : 204.44
period 03 : 195.69
period 04 : 188.50
period 05 : 181.34
period 06 : 176.10
period 07 : 172.26
period 08 : 169.46
period 09 : 167.89
Model training finished.
       prediction  targets
count     17000.0  17000.0
mean        113.1    207.3
std          93.3    116.0
min           0.1     15.0
25%          62.6    119.4
50%          91.0    180.4
75%         134.9    265.0
max        1623.7    500.0
final RMSE（on training data):  167.89

Process finished with exit code 0

図:
訓練回数が増えるにつれて精度が高くなるほど誤差が低くなることが分かる.
経験法則.
標準的なスーパーパラメータを調整する方法はありません.スーパーパラメータの効果はデータに依存します.次の法則は参考にしてください.-訓練誤差は着実に減少すべきで、最初は急激に減少し、最後に収束-訓練誤差は収束せず、より長い時間運転を試みる-訓練誤差の減少速度は遅く、学習率を向上させ、減少速度を加速させることができるかどうかを試みる-学習率が多すぎると、かえって収束速度が遅くなり、あるいは発散を招く

誤差振動の場合、学習率の低下を試みる

より低い学習率とより大きなステップ数、より大きなロットは通常比較的良い効果を得ることができる

ロットサイズが小さすぎると発散するため、大きな値から小さな試みに移行し、性能に影響する最小境界

を検出することができる.
フィーチャーを合成し、クラスタ外の値を除去

california_housing_dataframe['rooms_per_person'] = (
        california_housing_dataframe['total_rooms'] / california_housing_dataframe['population'])
california_housing_dataframe['rooms_per_person'] = california_housing_dataframe['rooms_per_person'].apply(
    lambda x: min(x, 5))
train_model(learning_rate=0.05, steps=500, batch_size=5, input_feature='rooms_per_person')

トレーニング関数の最後に追加:

    plt.figure('02', figsize=(15, 6))
    plt.subplot(1, 2, 1)
    plt.scatter(calibration_data['prediction'], calibration_data['targets'])

    plt.subplot(1, 2, 2)
    california_housing_dataframe['rooms_per_person'].hist()

クラスタ外値(ノイズ)データの下で特徴的な訓練結果予測値と目標値図は,理想的には直線であるべきである.および特徴のヒストグラム分布.以前のクラスタ外の点は除去されました.予測値と目標値図は、理想的には直線である必要があります.および特徴のヒストグラム分布.

HDu 1088 Write a simple HTML Browser文字列処理

HDU 5677 ztr loves substring(回文串に多重リュックサック)