『機械学習基石』作業一


ブログはMarcovaldo’s blogに移行しました
機械学習の穴に入ったので,決心して歩き続けた.「統計学習方法」という本は10種類のアルゴリズムを紹介しており、あまり難しくないが、その中の導出をよく研究するために再読しなければならない.「機械学習実戦」という本は各種アルゴリズムの具体的な例を提供し、Pythonは実現し、入門者がアルゴリズムの具体的な応用を理解するのに適している.またCouseraではスタンフォードのAndrew Ngの「Machine Learning」、台大林田軒の「機械学習の礎」と「機械学習技法」の2つの授業を選んだ.Andrewの授業は簡単で、多くの数学の導出と証明を省いたが、全面的で、機械学習におけるアルゴリズムについて多くの総括と比較を行った.林田軒の授業には数学の証明がたくさん含まれていて、難しいので、よく研究しなければなりません.本repositoryは主に「機械学習の礎」を学ぶ過程のノートと授業後のプログラミング作業を記録した.
「機械学習の礎」の最初の宿題を終えたばかりで、20の選択問題がある.そのうち、前の14の考察課程の理解、後の6つはcodeの実現が必要だ.以下、後の6つのプログラミング問題を抜粋、Pythonで実現し、MLFex 1に保存する.pyで.
タイトル
Question 15 For Questions 15-20, you will play with PLA and pocket algorithm. First, we use an artificial data set to study PLA. The data set is in
https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_15_train.dat
Each line of the data set contains one (xn,yn) with xn∈R4. The first 4 numbers of the line contains the components of xn orderly, the last number is yn. Please initialize your algorithm with w=0 and take sign(0) as −1 Implement a version of PLA by visiting examples in the naive cycle using the order of examples in the data set. Run the algorithm on the data set. What is the number of updates before the algorithm halts? ≥201 updates 51 - 200 updates <10 updates 31 - 50 updates 11 - 30 updates
Question 16 Implement a version of PLA by visiting examples in fixed, pre-determined random cycles throughout the algorithm. Run the algorithm on the data set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average number of updates before the algorithm halts? ≥201 updates 11 - 30 updates 51 - 200 updates 31 - 50 updates <10 updates
Question 17 Implement a version of PLA by visiting examples in fixed, pre-determined random cycles throughout the algorithm, while changing the update rule to be
Wt+1←Wt+ηyn(t)Xn(t)
with η=0.5. Note that your PLA in the previous Question corresponds to η=1. Please repeat your experiment for 2000 times, each with a different random seed. What is the average number of updates before the algorithm halts? 51 - 200 updates <10 updates 31 - 50 updates D 11 - 30 updates E ≥201 updates
Question 18 Next, we play with the pocket algorithm. Modify your PLA in Question 16 to visit examples purely randomly, and then add the ‘pocket’ steps to the algorithm. We will use
https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_18_train.dat
as the training data set D, and
https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_18_test.dat
as the test set for ”verifying” the g returned by your algorithm (see lecture 4 about verifying). The sets are of the same format as the previous one. Run the pocket algorithm with a total of 50 updates on D, and verify the performance of wPOCKET using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set? 0.6 - 0.8 <0.2 0.4 - 0.6 ≥0.8 0.2 - 0.4
Question 19 Modify your algorithm in Question 18 to return w50 (the PLA vector after 50 updates) instead of w^ (the pocket vector) after 50 updates. Run the modified algorithm on D, and verify the performance using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set? <0.2 ≥0.8 0.4 - 0.6 0.6 - 0.8 0.2 - 0.4
Question 20 Modify your algorithm in Question 18 to run for 100 updates instead of 50, and verify the performance of wPOCKET using the test set. Please repeat your experiment for 2000 times, each with a different random seed. What is the average error rate on the test set? <0.2 0.2 - 0.4 0.6 - 0.8 ≥0.8 0.4 - 0.6
コードの説明
#       down  ,          
#    exercise,                      :MLFex1_15_train.dat、MLFex1_18_train.dat、MLFex1_18_test.dat
def getRawDataSet(url):
    dataSet = urllib2.urlopen(url)
    filename = 'MLFex1_' + url.split('_')[1] + '_' + url.split('_')[2]
    with open(filename, 'w') as fr:
        fr.write(dataSet.read())
    return filename
#                 ,  X,y     
def getDataSet(filename):
    dataSet = open(filename, 'r')
    dataSet = dataSet.readlines()   #        ,  dataSet   
    num = len(dataSet)  #        
    #   X, Y
    X = np.zeros((num, 5))
    Y = np.zeros((num, 1))
    for i in range(num):
        data = dataSet[i].strip().split()
        X[i, 0] = 1.0
        X[i, 1] = np.float(data[0])
        X[i, 2] = np.float(data[1])
        X[i, 3] = np.float(data[2])
        X[i, 4] = np.float(data[3])
        Y[i, 0] = np.int(data[4])
    return X, Y
# sigmoid  ,     
def sign(x, w):
    if np.dot(x, w)[0] >= 0:
        return 1
    else:
        return -1
#     PLA    
# X, Y,         ,shape   (n+1)*m, m*1
# w,       ,shape (n+1)*1
# eta,  
# updates,    
#          (flag,          ,      w    fit    ),    w,       iterations
#            
def trainPLA_Naive(X, Y, w, eta, updates):
    iterations = 0  #         
    num = len(X)    #        
    flag = True
    for i in range(updates):
        flag = True
        for j in range(num):
            if sign(X[j], w) != Y[j, 0]:
                flag = False
                w += eta * Y[j, 0] * np.matrix(X[j]).T
                break
            else:
                continue
        if flag == True:
            iterations = i
            break
    return flag, iterations, w
#    PLA    ,             
#              
def trainPLA_Fixed(X, Y, w, eta, updates):
    iterations = 0  #         
    num = len(X)    #        
    flag = True
    for i in range(updates):
        flag = True
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]:
                flag = False
                w += eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                break
            else:
                continue
        if flag == True:
            iterations = i
            break
    return flag, iterations, w
#         w PLA  
#       
def pocketPLA(X, Y, w, eta, updates):
    num = len(X)    #        
    for i in range(updates):
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if (sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]):
                wt = w + eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                errate0 = errorTest(X, Y, w)
                errate1 = errorTest(X, Y, wt)
                if errate1 < errate0:
                    w = wt
                break

    return w
#   19        ,         
def trainPLA(X, Y, w, eta, updates):
    num = len(X)
    for i in range(updates):
        rand_sort = range(len(X))
        rand_sort = random.sample(rand_sort, len(X))
        for j in range(num):
            if (sign(X[rand_sort[j]], w) != Y[rand_sort[j], 0]):
                w += eta * Y[rand_sort[j], 0] * np.matrix(X[rand_sort[j]]).T
                break
    return w
#     ,  w    
def errorTest(X, y, w):
    error = 0.0   # remmember the number of the data the hypothesis doesn't fit
    rand_sort = range(len(X))
    rand_sort = random.sample(rand_sort, len(X))
    for i in range(len(X)):
        if sign(X[rand_sort[i]], w) != y[rand_sort[i], 0]:
            error += 1.0
    return error/len(X)
#           15 
def question15():
    url = 'https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_15_train.dat'
    filename = getRawDataSet(url)
    X, y = getDataSet(filename)
    w0 = np.zeros((5, 1))
    eta = 1
    updates = 80
    flag, iterations, w = trainPLA_Naive(X, y, w0, eta, updates)
    print flag
    print iterations
    print w
#           16 
def question16():
    # url = 'https://d396qusza40orc.cloudfront.net/ntumlone%2Fhw1%2Fhw1_15_train.dat'
    filename = 'MLFex1_15_train.dat'         # getRawDataSet(url)
    X, y = getDataSet(filename)
    w0 = np.zeros((5, 1))
    eta = 0.5
    updates = 200
    times = []
    for i in range(2000):
        w0 = np.zeros((5, 1))
        flag, iterations, w = trainPLA_Fixed(X, y, w0, eta, updates)
        if flag == True:
            times.append(iterations)
    print times
    print len(times)
    return sum(times)/len(times)

作者[@Marcovaldo]2016年03月24日