Kaggle經典測驗，泰坦尼克號的生存預測，機器學習實驗----02-有解無憂

Kaggle經典測驗，泰坦尼克號的生存預測，機器學習實驗----02

文章目錄

- Kaggle經典測驗，泰坦尼克號的生存預測，機器學習實驗----02
- - 一、引言
  - 二、問題
  - 三、問題分析
  - 四、具體操作
  - - 1、讀取資料并且進行預處理
    - 2、劃分標簽以及特征并且初始化引數
    - 3、開始線性回歸
    - 4、測驗模型并且進行打分
    - 5、保存資料結果
  - 五、完整代碼
  - 六、本演算法的核心代碼：

一、引言

泰坦尼克號（RMS Titanic），又譯作鐵達尼號，是英國白星航運公司下轄的一艘奧林匹克級游輪，排水量46000噸，泰坦尼克號是當時世界上體積最龐大、內部設施最豪華的客運輪船，有“永不沉沒”的美譽，

然而不幸的是，在它的處女航中，泰坦尼克號便遭厄運——它從英國南安普敦出發駛向美國紐約，1912年4月14日23時40分左右，泰坦尼克號與一座冰山相撞，造成右舷船艏至船中部破裂，五間水密艙進水，4月15日凌晨2時20分左右，泰坦尼克船體斷裂成兩截后沉入大西洋底3700米處，2224名船員及乘客中，1517人喪生，其中僅333具罹難者遺體被尋回，泰坦尼克號沉沒事故為和平時期死傷人數最為慘重的一次海難，其殘骸直至1985年才被再度發現，目前受到聯合國教育、科學及文化組織的保護，

二、問題

那么，問題來了，想在泰坦尼克號這次災難中存活下來需要具備那些條件呢？

也就是說，如果實作知道一個人的所有情況，我們該如何判斷這個人是否會遇難呢？

這樣就需要機器學習來解決了，

三、問題分析

顯然，在泰坦尼克號這次災難中，一個人要么是遇難，要么是幸存，所以實際上是一個邏輯回歸，但由于這是剛剛起步的一個實驗，我們暫且不使用邏輯回歸，依然采用線性回歸來進行資料的處理和分析，后面我們會再次做這個實驗，屆時，我們將使用邏輯回歸，然而本次就是用線性回歸了啦，

四、具體操作

1、讀取資料并且進行預處理

首先讀入檔案（csv），然后洗掉一些不必要的資料，接下來進行一定的調整，既可以得到下面的結果：



def read_data_of_csv(file_name):
    """
    read the csv files to get the data of the titanic accident
    :param file_name: the name of the file
    :return: df -> the data in the file that is opened above
    """
    df = pandas.read_csv(file_name)
    return df


if __name__ == '__main__':
    """
    main
    """

    # here we need not to split the train and the test !

    """
    1.get data and do the prior things before the machine learning
    """
    df_train = read_data_of_csv("titanic/train.csv")
    # print(df_train)
    # deal with the data first before machine learning
    df_train.drop("Embarked", axis=1, inplace=True)
    # i think embarked is not useful, so i delete this embarked line
    df_train.drop("Cabin", axis=1, inplace=True)
    # delete the cabin
    df_train.drop("Ticket", axis=1, inplace=True)
    # delete the ticket
    df_train.drop("Name", axis=1, inplace=True)
    # delete the name
    df_train.drop("PassengerId", axis=1, inplace=True)
    # delete the passenger id

    for int_number_of_len in range(len(df_train)):
        if df_train.loc[int_number_of_len, "Sex"] == "male":
            df_train.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_train.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0
    df_train.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data
    print(df_train)
    # show the result

最終的df_train:

     Survived  Pclass Sex   Age  SibSp  Parch     Fare
0           0       3   1  22.0      1      0   7.2500
1           1       1   0  38.0      1      0  71.2833
2           1       3   0  26.0      0      0   7.9250
3           1       1   0  35.0      1      0  53.1000
4           0       3   1  35.0      0      0   8.0500
..        ...     ...  ..   ...    ...    ...      ...
885         0       3   0  39.0      0      5  29.1250
886         0       2   1  27.0      0      0  13.0000
887         1       1   0  19.0      0      0  30.0000
889         1       1   1  26.0      0      0  30.0000
890         0       3   1  32.0      0      0   7.7500

[714 rows x 7 columns]

Process finished with exit code 0

2、劃分標簽以及特征并且初始化引數

需要將資料中的特征以及標簽分開來進行處理

隨后需要進行引數的初始化


    """
    2.split the label and the features
    """
    
    y_train = df_train.loc[:, "Survived"]
    print(y_train)
    # y of the train

    X_train = df_train.loc[:, "Pclass": "Fare"]
    print(X_train)
    # X of the train

    """
    3.set the initial params of the liner regression
    """
    alpha = float(input("input the alpha:\n"))


    list_theta = []
    for number_of_the_total_thetas_list in range(6 + 1):
        list_theta[number_of_the_total_thetas_list] = float(
            input(f"input the theta {number_of_the_total_thetas_list}:\n"))
        # input the theta

3、開始線性回歸

這里是訓練機器的代碼

必須強調一下！！

引數選取非常重要！！！！



    """
    4.make the machine learning of the liner regression operations
    """
    iter_of_regression = int(input("input the number of iter times:\n"))

    for num_of_iter_of_regression in range(iter_of_regression):
        # make iter_of_regression times of the regression
        h_x = list_theta[0] + \
              list_theta[1] * X_train.loc[:, "Pclass"] + \
              list_theta[2] * X_train.loc[:, "Sex"] + \
              list_theta[3] * X_train.loc[:, "Age"] + \
              list_theta[4] * X_train.loc[:, "SibSp"] + \
              list_theta[5] * X_train.loc[:, "Parch"] + \
              list_theta[6] * X_train.loc[:, "Fare"]
        # 7 theta
        # 6 x of the feature

        loss = \
            y_train - h_x
        # calculate the loss

        # loss ^ 2
        print(loss.T.dot(loss))


        # the sum of loss:
        sum_loss = 0
        # plus the loss
        for o in range(len(loss)):
            # print(loss.iloc[o])
            sum_loss += loss.iloc[o]  # float
        # list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
        # print(loss)
        list_theta[0] += \
            alpha * sum_loss / len(loss)
        # update the list_theta[0]

        # print(list(X_train.index))
        # 0's index
        for c in [1, 2, 3, 4, 5, 6]:
            list_theta[c] += \
                alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
        # update the list theta of the params
        # X_train.loc[:, c - 1].T.dot(loss)
        # T transfer, dot dot_multiply

        # do all the thetas !!

        # continue

        print(list_theta)
        # theta
        # print(loss.T.dot(loss))
        # print(loss.T.dot(loss))
        # loss ^ 2

        continue

4、測驗模型并且進行打分


    """
    5.do the test of this project
    """

    # the same operation
    df_test = read_data_of_csv("titanic/test.csv")
    df_test.drop("Embarked", axis=1, inplace=True)
    df_test.drop("Cabin", axis=1, inplace=True)
    df_test.drop("Ticket", axis=1, inplace=True)
    df_test.drop("Name", axis=1, inplace=True)
    df_test.drop("PassengerId", axis=1, inplace=True)
    # delete

    for int_number_of_len in range(len(df_test)):
        if df_test.loc[int_number_of_len, "Sex"] == "male":
            df_test.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_test.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0

    # delete the NaNs
    df_test.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data

    print(df_test)
    # show the result

    y_test = df_test.loc[:, "Survived"]

    X_test = df_test.loc[:, "Pclass":"Fare"]

    print(y_test)
    print(X_test)

    test_h_x = list_theta[0] + \
        list_theta[1] * X_train.loc[:, "Pclass"] + \
        list_theta[2] * X_train.loc[:, "Sex"] + \
        list_theta[3] * X_train.loc[:, "Age"] + \
        list_theta[4] * X_train.loc[:, "SibSp"] + \
        list_theta[5] * X_train.loc[:, "Parch"] + \
        list_theta[6] * X_train.loc[:, "Fare"]
    # the final function

    """
    6.score the model
    """

    N = len(y_test)
    # the total of the test number
    num_of_win_the_prediction_of_the_model = 0
    # predict the result

    # as long as we are right, whether the person is alive or dead, it does not matter
    for win_number_of_the_test_of_each in range(N):
        if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
            test_h_x.iloc[win_number_of_the_test_of_each] = 0
            # prediction < 0.5 => 0
            if y_test.iloc[win_number_of_the_test_of_each] == 0:
                num_of_win_the_prediction_of_the_model += 1
                # right!
                # right, so we num of win ++
            else:
                pass
                # wrong

        else:
            test_h_x.iloc[win_number_of_the_test_of_each] = 1
            # prediction >= 0.5 => 1
            if y_test.iloc[win_number_of_the_test_of_each] == 1:
                num_of_win_the_prediction_of_the_model += 1
                # right
            else:
                pass
                # wrong

5、保存資料結果

我們打開一個txt檔案，將資料保存在里面，


    """
    7.save the model
    """

    with open("result.txt", "w") as f:
        f.write("result record:\n")
        # result

        f.write("the alpha:\n")
        f.write(f"{alpha}")
        f.write("\n")
        # 1.alpha

        f.write("the thetas of the model:\n")
        recording_number_position = 1
        for theta_of_the_last in list_theta:
            f.write(f"{recording_number_position}. ")
            f.write(f"{theta_of_the_last}")
            f.write("\n")
            recording_number_position += 1
            continue
        # 2.write the thetas
        f.write("\n")

        f.write("features:\n")
        r_n_p = 1
        for data_of_feature in list(X_train):
            f.write(f"{r_n_p}. ")
            f.write(f"{data_of_feature}")
            f.write("\n")
            r_n_p += 1
            continue
        f.write("\n")
        # 3.features

        f.write("score:\n")
        f.write(f"{num_of_win_the_prediction_of_the_model / N}")
        f.write("\n")
        f.write(f"  or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
        f.write("\n")
        # 4.score

        f.close()
        # close the file

    sys.exit("bye bye!")
"""
END
"""

最后的資料的呈現：

result record:
the alpha:
0.0001
the thetas of the model:
1. 1.2486050331704237
2. -0.16522774554911307
3. -0.48151480883845743
4. -0.00541835063447703
5. -0.04947303449597998
6. -0.011060136497625706
7. 0.0005749482239116638

features:
1. Pclass
2. Sex
3. Age
4. SibSp
5. Parch
6. Fare

score:
0.4954682779456193
  or the 100 :49.546827794561935

從上面的資料可以看出來呢，這個模型并不是很好，以至于連及格都沒有及格了啦，wwww~~

不過沒有關系，后面我們會使用邏輯回歸再做一次這個案例的啦，后面那個顯然會好一點哦，

五、完整代碼

"""

the titanic survival prediction of machine learning

by
author: Hu Yu Xuan

at
time: 2021/8/9

using
method: liner regression

"""


import numpy
import pandas
import sys


def read_data_of_csv(file_name):
    """
    read the csv files to get the data of the titanic accident
    :param file_name: the name of the file
    :return: df -> the data in the file that is opened above
    """
    df = pandas.read_csv(file_name)
    return df


if __name__ == '__main__':
    """
    main
    """

    # here we need not to split the train and the test !

    """
    1.get data and do the prior things before the machine learning
    """

    df_train = read_data_of_csv("titanic/train.csv")
    # print(df_train)
    # deal with the data first before machine learning
    df_train.drop("Embarked", axis=1, inplace=True)
    # i think embarked is not useful, so i delete this embarked line
    df_train.drop("Cabin", axis=1, inplace=True)
    # delete the cabin
    df_train.drop("Ticket", axis=1, inplace=True)
    # delete the ticket
    df_train.drop("Name", axis=1, inplace=True)
    # delete the name
    df_train.drop("PassengerId", axis=1, inplace=True)
    # delete the passenger id

    for int_number_of_len in range(len(df_train)):
        if df_train.loc[int_number_of_len, "Sex"] == "male":
            df_train.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_train.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0
    df_train.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data
    print(df_train)
    # show the result

    """
    2.split the label and the features
    """

    y_train = df_train.loc[:, "Survived"]
    print(y_train)
    # y of the train

    X_train = df_train.loc[:, "Pclass": "Fare"]
    print(X_train)
    # X of the train

    """
    3.set the initial params of the liner regression
    """

    alpha = float(input("input the alpha:\n"))
    # alpha
    list_theta = [0, 0, 0, 0, 0, 0, 0]
    # 7 params
    for number_of_the_total_thetas_list in range(6 + 1):
        list_theta[number_of_the_total_thetas_list] = float(
            input(f"input the theta {number_of_the_total_thetas_list}:\n"))
        # input the theta

    """
    4.make the machine learning of the liner regression operations
    """

    iter_of_regression = int(input("input the number of iter times:\n"))

    for num_of_iter_of_regression in range(iter_of_regression):
        # make iter_of_regression times of the regression
        h_x = list_theta[0] + \
              list_theta[1] * X_train.loc[:, "Pclass"] + \
              list_theta[2] * X_train.loc[:, "Sex"] + \
              list_theta[3] * X_train.loc[:, "Age"] + \
              list_theta[4] * X_train.loc[:, "SibSp"] + \
              list_theta[5] * X_train.loc[:, "Parch"] + \
              list_theta[6] * X_train.loc[:, "Fare"]
        # 7 theta
        # 6 x of the feature

        loss = \
            y_train - h_x
        # calculate the loss

        # loss ^ 2
        print(loss.T.dot(loss))


        # the sum of loss:
        sum_loss = 0
        # plus the loss
        for o in range(len(loss)):
            # print(loss.iloc[o])
            sum_loss += loss.iloc[o]  # float
        # list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
        # print(loss)
        list_theta[0] += \
            alpha * sum_loss / len(loss)
        # update the list_theta[0]

        # print(list(X_train.index))
        # 0's index
        for c in [1, 2, 3, 4, 5, 6]:
            list_theta[c] += \
                alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
        # update the list theta of the params
        # X_train.loc[:, c - 1].T.dot(loss)
        # T transfer, dot dot_multiply

        # do all the thetas !!

        # continue

        print(list_theta)
        # theta
        # print(loss.T.dot(loss))
        # print(loss.T.dot(loss))
        # loss ^ 2

        continue
# 103.57666895852172
# [1.2448209211390562,
    # -0.164299592246984,
    # -0.48128889342546577,
    # -0.005382484123675659,
    # -0.04934860676751743,
    # -0.01102572287455978,
    # 0.0005836091240344807]
# 103.57666895852172 --- loss ^ 2
# [1.2448209211390562, -0.164299592246984, -0.48128889342546577, -0.005382484123675659, -0.04934860676751743, -0.01102572287455978, 0.0005836091240344807]

    """
    5.do the test of this project
    """

    # the same operation
    df_test = read_data_of_csv("titanic/test.csv")
    df_test.drop("Embarked", axis=1, inplace=True)
    df_test.drop("Cabin", axis=1, inplace=True)
    df_test.drop("Ticket", axis=1, inplace=True)
    df_test.drop("Name", axis=1, inplace=True)
    df_test.drop("PassengerId", axis=1, inplace=True)
    # delete

    for int_number_of_len in range(len(df_test)):
        if df_test.loc[int_number_of_len, "Sex"] == "male":
            df_test.loc[int_number_of_len, "Sex"] = 1
            # if male then set the sex 1
        else:
            df_test.loc[int_number_of_len, "Sex"] = 0
            # if female then set the sex 0
    # change the introduction of sex from string to int 1 or 0

    # delete the NaNs
    df_test.dropna(axis=0, how="any", inplace=True)
    # delete the NaN data

    print(df_test)
    # show the result

    y_test = df_test.loc[:, "Survived"]

    X_test = df_test.loc[:, "Pclass":"Fare"]

    print(y_test)
    print(X_test)

    test_h_x = list_theta[0] + \
        list_theta[1] * X_train.loc[:, "Pclass"] + \
        list_theta[2] * X_train.loc[:, "Sex"] + \
        list_theta[3] * X_train.loc[:, "Age"] + \
        list_theta[4] * X_train.loc[:, "SibSp"] + \
        list_theta[5] * X_train.loc[:, "Parch"] + \
        list_theta[6] * X_train.loc[:, "Fare"]
    # the final function

    """
    6.score the model
    """

    N = len(y_test)
    # the total of the test number
    num_of_win_the_prediction_of_the_model = 0
    # predict the result

    # as long as we are right, whether the person is alive or dead, it does not matter
    for win_number_of_the_test_of_each in range(N):
        if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
            test_h_x.iloc[win_number_of_the_test_of_each] = 0
            # prediction < 0.5 => 0
            if y_test.iloc[win_number_of_the_test_of_each] == 0:
                num_of_win_the_prediction_of_the_model += 1
                # right!
                # right, so we num of win ++
            else:
                pass
                # wrong

        else:
            test_h_x.iloc[win_number_of_the_test_of_each] = 1
            # prediction >= 0.5 => 1
            if y_test.iloc[win_number_of_the_test_of_each] == 1:
                num_of_win_the_prediction_of_the_model += 1
                # right
            else:
                pass
                # wrong

    """
    7.save the model
    """

    with open("result.txt", "w") as f:
        f.write("result record:\n")
        # result

        f.write("the alpha:\n")
        f.write(f"{alpha}")
        f.write("\n")
        # 1.alpha

        f.write("the thetas of the model:\n")
        recording_number_position = 1
        for theta_of_the_last in list_theta:
            f.write(f"{recording_number_position}. ")
            f.write(f"{theta_of_the_last}")
            f.write("\n")
            recording_number_position += 1
            continue
        # 2.write the thetas
        f.write("\n")

        f.write("features:\n")
        r_n_p = 1
        for data_of_feature in list(X_train):
            f.write(f"{r_n_p}. ")
            f.write(f"{data_of_feature}")
            f.write("\n")
            r_n_p += 1
            continue
        f.write("\n")
        # 3.features

        f.write("score:\n")
        f.write(f"{num_of_win_the_prediction_of_the_model / N}")
        f.write("\n")
        f.write(f"  or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
        f.write("\n")
        # 4.score

        f.close()
        # close the file

    sys.exit("bye bye!")
"""
END
"""

剛開始，我以為我的這個實驗是這樣的：

在這里插入圖片描述

后來做完以后結果是這樣的：
（徹底翻車了，不及格）
在這里插入圖片描述
好在這個是我的第二次實驗，缺乏一些經驗，而且選取的模型也不夠合適，導致了這樣的偏差，后面我在做一個邏輯回歸，一定會好很多的啦，

六、本演算法的核心代碼：

進行引數值的更新：
（注意更新時的具體代碼！！）

# make iter_of_regression times of the regression
        h_x = list_theta[0] + \
              list_theta[1] * X_train.loc[:, "Pclass"] + \
              list_theta[2] * X_train.loc[:, "Sex"] + \
              list_theta[3] * X_train.loc[:, "Age"] + \
              list_theta[4] * X_train.loc[:, "SibSp"] + \
              list_theta[5] * X_train.loc[:, "Parch"] + \
              list_theta[6] * X_train.loc[:, "Fare"]
        # 7 theta
        # 6 x of the feature

        loss = \
            y_train - h_x
        # calculate the loss

        # loss ^ 2
        print(loss.T.dot(loss))


        # the sum of loss:
        sum_loss = 0
        # plus the loss
        for o in range(len(loss)):
            # print(loss.iloc[o])
            sum_loss += loss.iloc[o]  # float
        # list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
        # print(loss)
        list_theta[0] += \
            alpha * sum_loss / len(loss)
        # update the list_theta[0]

        # print(list(X_train.index))
        # 0's index
        for c in [1, 2, 3, 4, 5, 6]:
            list_theta[c] += \
                alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
        # update the list theta of the params
        # X_train.loc[:, c - 1].T.dot(loss)
        # T transfer, dot dot_multiply

        # do all the thetas !!

        # continue

        print(list_theta)
        # theta
        # print(loss.T.dot(loss))
        # print(loss.T.dot(loss))
        # loss ^ 2

就寫到這里啦，

后續可以在來看看我的邏輯回歸，這個會更好的啦，

謝謝閱讀了了啊，

如果感興趣的話就點個贊支持一下唄！！！！

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/292855.html

標籤：AI

上一篇：機器學習面試題（一）

下一篇：【問答機器人】QA機器人排序模型