Kaggle經典測驗,泰坦尼克號的生存預測,機器學習實驗----02
文章目錄
- Kaggle經典測驗,泰坦尼克號的生存預測,機器學習實驗----02
- 一、引言
- 二、問題
- 三、問題分析
- 四、具體操作
- 1、讀取資料并且進行預處理
- 2、劃分標簽以及特征并且初始化引數
- 3、開始線性回歸
- 4、測驗模型并且進行打分
- 5、保存資料結果
- 五、完整代碼
- 六、本演算法的核心代碼:
一、引言
泰坦尼克號(RMS Titanic),又譯作鐵達尼號,是英國白星航運公司下轄的一艘奧林匹克級游輪,排水量46000噸,泰坦尼克號是當時世界上體積最龐大、內部設施最豪華的客運輪船,有“永不沉沒”的美譽 ,
然而不幸的是,在它的處女航中,泰坦尼克號便遭厄運——它從英國南安普敦出發駛向美國紐約,1912年4月14日23時40分左右,泰坦尼克號與一座冰山相撞,造成右舷船艏至船中部破裂,五間水密艙進水,4月15日凌晨2時20分左右,泰坦尼克船體斷裂成兩截后沉入大西洋底3700米處,2224名船員及乘客中,1517人喪生,其中僅333具罹難者遺體被尋回,泰坦尼克號沉沒事故為和平時期死傷人數最為慘重的一次海難,其殘骸直至1985年才被再度發現,目前受到聯合國教育、科學及文化組織的保護,
二、問題
那么,問題來了,想在泰坦尼克號這次災難中存活下來需要具備那些條件呢?
也就是說,如果實作知道一個人的所有情況,我們該如何判斷這個人是否會遇難呢?
這樣就需要機器學習來解決了,
三、問題分析
顯然,在泰坦尼克號這次災難中,一個人要么是遇難,要么是幸存,所以實際上是一個邏輯回歸,但由于這是剛剛起步的一個實驗,我們暫且不使用邏輯回歸,依然采用線性回歸來進行資料的處理和分析,后面我們會再次做這個實驗,屆時,我們將使用邏輯回歸,然而本次就是用線性回歸了啦,
四、具體操作
1、讀取資料并且進行預處理
首先讀入檔案(csv),然后洗掉一些不必要的資料,接下來進行一定的調整,既可以得到下面的結果:
def read_data_of_csv(file_name):
"""
read the csv files to get the data of the titanic accident
:param file_name: the name of the file
:return: df -> the data in the file that is opened above
"""
df = pandas.read_csv(file_name)
return df
if __name__ == '__main__':
"""
main
"""
# here we need not to split the train and the test !
"""
1.get data and do the prior things before the machine learning
"""
df_train = read_data_of_csv("titanic/train.csv")
# print(df_train)
# deal with the data first before machine learning
df_train.drop("Embarked", axis=1, inplace=True)
# i think embarked is not useful, so i delete this embarked line
df_train.drop("Cabin", axis=1, inplace=True)
# delete the cabin
df_train.drop("Ticket", axis=1, inplace=True)
# delete the ticket
df_train.drop("Name", axis=1, inplace=True)
# delete the name
df_train.drop("PassengerId", axis=1, inplace=True)
# delete the passenger id
for int_number_of_len in range(len(df_train)):
if df_train.loc[int_number_of_len, "Sex"] == "male":
df_train.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_train.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
df_train.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_train)
# show the result
最終的df_train:
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 1 22.0 1 0 7.2500
1 1 1 0 38.0 1 0 71.2833
2 1 3 0 26.0 0 0 7.9250
3 1 1 0 35.0 1 0 53.1000
4 0 3 1 35.0 0 0 8.0500
.. ... ... .. ... ... ... ...
885 0 3 0 39.0 0 5 29.1250
886 0 2 1 27.0 0 0 13.0000
887 1 1 0 19.0 0 0 30.0000
889 1 1 1 26.0 0 0 30.0000
890 0 3 1 32.0 0 0 7.7500
[714 rows x 7 columns]
Process finished with exit code 0
2、劃分標簽以及特征并且初始化引數
需要將資料中的特征以及標簽分開來進行處理
隨后需要進行引數的初始化
"""
2.split the label and the features
"""
y_train = df_train.loc[:, "Survived"]
print(y_train)
# y of the train
X_train = df_train.loc[:, "Pclass": "Fare"]
print(X_train)
# X of the train
"""
3.set the initial params of the liner regression
"""
alpha = float(input("input the alpha:\n"))
list_theta = []
for number_of_the_total_thetas_list in range(6 + 1):
list_theta[number_of_the_total_thetas_list] = float(
input(f"input the theta {number_of_the_total_thetas_list}:\n"))
# input the theta
3、開始線性回歸
這里是訓練機器的代碼
必須強調一下!!
引數選取非常重要!!!!
"""
4.make the machine learning of the liner regression operations
"""
iter_of_regression = int(input("input the number of iter times:\n"))
for num_of_iter_of_regression in range(iter_of_regression):
# make iter_of_regression times of the regression
h_x = list_theta[0] + \
list_theta[1] * X_train.loc[:, "Pclass"] + \
list_theta[2] * X_train.loc[:, "Sex"] + \
list_theta[3] * X_train.loc[:, "Age"] + \
list_theta[4] * X_train.loc[:, "SibSp"] + \
list_theta[5] * X_train.loc[:, "Parch"] + \
list_theta[6] * X_train.loc[:, "Fare"]
# 7 theta
# 6 x of the feature
loss = \
y_train - h_x
# calculate the loss
# loss ^ 2
print(loss.T.dot(loss))
# the sum of loss:
sum_loss = 0
# plus the loss
for o in range(len(loss)):
# print(loss.iloc[o])
sum_loss += loss.iloc[o] # float
# list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
# print(loss)
list_theta[0] += \
alpha * sum_loss / len(loss)
# update the list_theta[0]
# print(list(X_train.index))
# 0's index
for c in [1, 2, 3, 4, 5, 6]:
list_theta[c] += \
alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
# update the list theta of the params
# X_train.loc[:, c - 1].T.dot(loss)
# T transfer, dot dot_multiply
# do all the thetas !!
# continue
print(list_theta)
# theta
# print(loss.T.dot(loss))
# print(loss.T.dot(loss))
# loss ^ 2
continue
4、測驗模型并且進行打分
"""
5.do the test of this project
"""
# the same operation
df_test = read_data_of_csv("titanic/test.csv")
df_test.drop("Embarked", axis=1, inplace=True)
df_test.drop("Cabin", axis=1, inplace=True)
df_test.drop("Ticket", axis=1, inplace=True)
df_test.drop("Name", axis=1, inplace=True)
df_test.drop("PassengerId", axis=1, inplace=True)
# delete
for int_number_of_len in range(len(df_test)):
if df_test.loc[int_number_of_len, "Sex"] == "male":
df_test.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_test.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
# delete the NaNs
df_test.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_test)
# show the result
y_test = df_test.loc[:, "Survived"]
X_test = df_test.loc[:, "Pclass":"Fare"]
print(y_test)
print(X_test)
test_h_x = list_theta[0] + \
list_theta[1] * X_train.loc[:, "Pclass"] + \
list_theta[2] * X_train.loc[:, "Sex"] + \
list_theta[3] * X_train.loc[:, "Age"] + \
list_theta[4] * X_train.loc[:, "SibSp"] + \
list_theta[5] * X_train.loc[:, "Parch"] + \
list_theta[6] * X_train.loc[:, "Fare"]
# the final function
"""
6.score the model
"""
N = len(y_test)
# the total of the test number
num_of_win_the_prediction_of_the_model = 0
# predict the result
# as long as we are right, whether the person is alive or dead, it does not matter
for win_number_of_the_test_of_each in range(N):
if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
test_h_x.iloc[win_number_of_the_test_of_each] = 0
# prediction < 0.5 => 0
if y_test.iloc[win_number_of_the_test_of_each] == 0:
num_of_win_the_prediction_of_the_model += 1
# right!
# right, so we num of win ++
else:
pass
# wrong
else:
test_h_x.iloc[win_number_of_the_test_of_each] = 1
# prediction >= 0.5 => 1
if y_test.iloc[win_number_of_the_test_of_each] == 1:
num_of_win_the_prediction_of_the_model += 1
# right
else:
pass
# wrong
5、保存資料結果
我們打開一個txt檔案,將資料保存在里面,
"""
7.save the model
"""
with open("result.txt", "w") as f:
f.write("result record:\n")
# result
f.write("the alpha:\n")
f.write(f"{alpha}")
f.write("\n")
# 1.alpha
f.write("the thetas of the model:\n")
recording_number_position = 1
for theta_of_the_last in list_theta:
f.write(f"{recording_number_position}. ")
f.write(f"{theta_of_the_last}")
f.write("\n")
recording_number_position += 1
continue
# 2.write the thetas
f.write("\n")
f.write("features:\n")
r_n_p = 1
for data_of_feature in list(X_train):
f.write(f"{r_n_p}. ")
f.write(f"{data_of_feature}")
f.write("\n")
r_n_p += 1
continue
f.write("\n")
# 3.features
f.write("score:\n")
f.write(f"{num_of_win_the_prediction_of_the_model / N}")
f.write("\n")
f.write(f" or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
f.write("\n")
# 4.score
f.close()
# close the file
sys.exit("bye bye!")
"""
END
"""
最后的資料的呈現:
result record:
the alpha:
0.0001
the thetas of the model:
1. 1.2486050331704237
2. -0.16522774554911307
3. -0.48151480883845743
4. -0.00541835063447703
5. -0.04947303449597998
6. -0.011060136497625706
7. 0.0005749482239116638
features:
1. Pclass
2. Sex
3. Age
4. SibSp
5. Parch
6. Fare
score:
0.4954682779456193
or the 100 :49.546827794561935
從上面的資料可以看出來呢,這個模型并不是很好,以至于連及格都沒有及格了啦,wwww~~
不過沒有關系,后面我們會使用邏輯回歸再做一次這個案例的啦,后面那個顯然會好一點哦,
五、完整代碼
"""
the titanic survival prediction of machine learning
by
author: Hu Yu Xuan
at
time: 2021/8/9
using
method: liner regression
"""
import numpy
import pandas
import sys
def read_data_of_csv(file_name):
"""
read the csv files to get the data of the titanic accident
:param file_name: the name of the file
:return: df -> the data in the file that is opened above
"""
df = pandas.read_csv(file_name)
return df
if __name__ == '__main__':
"""
main
"""
# here we need not to split the train and the test !
"""
1.get data and do the prior things before the machine learning
"""
df_train = read_data_of_csv("titanic/train.csv")
# print(df_train)
# deal with the data first before machine learning
df_train.drop("Embarked", axis=1, inplace=True)
# i think embarked is not useful, so i delete this embarked line
df_train.drop("Cabin", axis=1, inplace=True)
# delete the cabin
df_train.drop("Ticket", axis=1, inplace=True)
# delete the ticket
df_train.drop("Name", axis=1, inplace=True)
# delete the name
df_train.drop("PassengerId", axis=1, inplace=True)
# delete the passenger id
for int_number_of_len in range(len(df_train)):
if df_train.loc[int_number_of_len, "Sex"] == "male":
df_train.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_train.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
df_train.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_train)
# show the result
"""
2.split the label and the features
"""
y_train = df_train.loc[:, "Survived"]
print(y_train)
# y of the train
X_train = df_train.loc[:, "Pclass": "Fare"]
print(X_train)
# X of the train
"""
3.set the initial params of the liner regression
"""
alpha = float(input("input the alpha:\n"))
# alpha
list_theta = [0, 0, 0, 0, 0, 0, 0]
# 7 params
for number_of_the_total_thetas_list in range(6 + 1):
list_theta[number_of_the_total_thetas_list] = float(
input(f"input the theta {number_of_the_total_thetas_list}:\n"))
# input the theta
"""
4.make the machine learning of the liner regression operations
"""
iter_of_regression = int(input("input the number of iter times:\n"))
for num_of_iter_of_regression in range(iter_of_regression):
# make iter_of_regression times of the regression
h_x = list_theta[0] + \
list_theta[1] * X_train.loc[:, "Pclass"] + \
list_theta[2] * X_train.loc[:, "Sex"] + \
list_theta[3] * X_train.loc[:, "Age"] + \
list_theta[4] * X_train.loc[:, "SibSp"] + \
list_theta[5] * X_train.loc[:, "Parch"] + \
list_theta[6] * X_train.loc[:, "Fare"]
# 7 theta
# 6 x of the feature
loss = \
y_train - h_x
# calculate the loss
# loss ^ 2
print(loss.T.dot(loss))
# the sum of loss:
sum_loss = 0
# plus the loss
for o in range(len(loss)):
# print(loss.iloc[o])
sum_loss += loss.iloc[o] # float
# list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
# print(loss)
list_theta[0] += \
alpha * sum_loss / len(loss)
# update the list_theta[0]
# print(list(X_train.index))
# 0's index
for c in [1, 2, 3, 4, 5, 6]:
list_theta[c] += \
alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
# update the list theta of the params
# X_train.loc[:, c - 1].T.dot(loss)
# T transfer, dot dot_multiply
# do all the thetas !!
# continue
print(list_theta)
# theta
# print(loss.T.dot(loss))
# print(loss.T.dot(loss))
# loss ^ 2
continue
# 103.57666895852172
# [1.2448209211390562,
# -0.164299592246984,
# -0.48128889342546577,
# -0.005382484123675659,
# -0.04934860676751743,
# -0.01102572287455978,
# 0.0005836091240344807]
# 103.57666895852172 --- loss ^ 2
# [1.2448209211390562, -0.164299592246984, -0.48128889342546577, -0.005382484123675659, -0.04934860676751743, -0.01102572287455978, 0.0005836091240344807]
"""
5.do the test of this project
"""
# the same operation
df_test = read_data_of_csv("titanic/test.csv")
df_test.drop("Embarked", axis=1, inplace=True)
df_test.drop("Cabin", axis=1, inplace=True)
df_test.drop("Ticket", axis=1, inplace=True)
df_test.drop("Name", axis=1, inplace=True)
df_test.drop("PassengerId", axis=1, inplace=True)
# delete
for int_number_of_len in range(len(df_test)):
if df_test.loc[int_number_of_len, "Sex"] == "male":
df_test.loc[int_number_of_len, "Sex"] = 1
# if male then set the sex 1
else:
df_test.loc[int_number_of_len, "Sex"] = 0
# if female then set the sex 0
# change the introduction of sex from string to int 1 or 0
# delete the NaNs
df_test.dropna(axis=0, how="any", inplace=True)
# delete the NaN data
print(df_test)
# show the result
y_test = df_test.loc[:, "Survived"]
X_test = df_test.loc[:, "Pclass":"Fare"]
print(y_test)
print(X_test)
test_h_x = list_theta[0] + \
list_theta[1] * X_train.loc[:, "Pclass"] + \
list_theta[2] * X_train.loc[:, "Sex"] + \
list_theta[3] * X_train.loc[:, "Age"] + \
list_theta[4] * X_train.loc[:, "SibSp"] + \
list_theta[5] * X_train.loc[:, "Parch"] + \
list_theta[6] * X_train.loc[:, "Fare"]
# the final function
"""
6.score the model
"""
N = len(y_test)
# the total of the test number
num_of_win_the_prediction_of_the_model = 0
# predict the result
# as long as we are right, whether the person is alive or dead, it does not matter
for win_number_of_the_test_of_each in range(N):
if test_h_x.iloc[win_number_of_the_test_of_each] < 0.5:
test_h_x.iloc[win_number_of_the_test_of_each] = 0
# prediction < 0.5 => 0
if y_test.iloc[win_number_of_the_test_of_each] == 0:
num_of_win_the_prediction_of_the_model += 1
# right!
# right, so we num of win ++
else:
pass
# wrong
else:
test_h_x.iloc[win_number_of_the_test_of_each] = 1
# prediction >= 0.5 => 1
if y_test.iloc[win_number_of_the_test_of_each] == 1:
num_of_win_the_prediction_of_the_model += 1
# right
else:
pass
# wrong
"""
7.save the model
"""
with open("result.txt", "w") as f:
f.write("result record:\n")
# result
f.write("the alpha:\n")
f.write(f"{alpha}")
f.write("\n")
# 1.alpha
f.write("the thetas of the model:\n")
recording_number_position = 1
for theta_of_the_last in list_theta:
f.write(f"{recording_number_position}. ")
f.write(f"{theta_of_the_last}")
f.write("\n")
recording_number_position += 1
continue
# 2.write the thetas
f.write("\n")
f.write("features:\n")
r_n_p = 1
for data_of_feature in list(X_train):
f.write(f"{r_n_p}. ")
f.write(f"{data_of_feature}")
f.write("\n")
r_n_p += 1
continue
f.write("\n")
# 3.features
f.write("score:\n")
f.write(f"{num_of_win_the_prediction_of_the_model / N}")
f.write("\n")
f.write(f" or the 100 :{100 * num_of_win_the_prediction_of_the_model / N}")
f.write("\n")
# 4.score
f.close()
# close the file
sys.exit("bye bye!")
"""
END
"""
剛開始,我以為我的這個實驗是這樣的:

后來做完以后結果是這樣的:
(徹底翻車了,不及格)

好在這個是我的第二次實驗,缺乏一些經驗,而且選取的模型也不夠合適,導致了這樣的偏差, 后面我在做一個邏輯回歸,一定會好很多的啦,
六、本演算法的核心代碼:
進行引數值的更新:
(注意更新時的具體代碼!!)
# make iter_of_regression times of the regression
h_x = list_theta[0] + \
list_theta[1] * X_train.loc[:, "Pclass"] + \
list_theta[2] * X_train.loc[:, "Sex"] + \
list_theta[3] * X_train.loc[:, "Age"] + \
list_theta[4] * X_train.loc[:, "SibSp"] + \
list_theta[5] * X_train.loc[:, "Parch"] + \
list_theta[6] * X_train.loc[:, "Fare"]
# 7 theta
# 6 x of the feature
loss = \
y_train - h_x
# calculate the loss
# loss ^ 2
print(loss.T.dot(loss))
# the sum of loss:
sum_loss = 0
# plus the loss
for o in range(len(loss)):
# print(loss.iloc[o])
sum_loss += loss.iloc[o] # float
# list_theta[0] = list_theta[0] + alpha * sum_loss / len(loss)
# print(loss)
list_theta[0] += \
alpha * sum_loss / len(loss)
# update the list_theta[0]
# print(list(X_train.index))
# 0's index
for c in [1, 2, 3, 4, 5, 6]:
list_theta[c] += \
alpha * (X_train.loc[:, list(X_train)[(c - 1)]].T.dot(loss)) / len(loss)
# update the list theta of the params
# X_train.loc[:, c - 1].T.dot(loss)
# T transfer, dot dot_multiply
# do all the thetas !!
# continue
print(list_theta)
# theta
# print(loss.T.dot(loss))
# print(loss.T.dot(loss))
# loss ^ 2
就寫到這里啦,
后續可以在來看看我的邏輯回歸,這個會更好的啦,
謝謝閱讀了了啊,
如果感興趣的話就點個贊支持一下唄!!!!
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/292855.html
標籤:AI
上一篇:機器學習面試題 (一)
下一篇:【問答機器人】QA機器人排序模型
