hello,大家好,我又回來了,如約,更新回圈神經網路,
最近好像事情變少了,但是狀態還是很差,新生班級要展示了,希望51班大哥們能拿個好名次,這篇博客,是用LSTM/RNN來對影評進行分析,這個網路挺復雜,訓練了好久,能感覺出來GPU的作用了,
另外,我還打算開辟一個機器學習專欄,不知道大家怎么看,后面有投票,希望大家能投一下!!謝謝!!!!
下一次更新遷移學習,這個已經在準備啦,很快啦!!
import tensorflow as tf
tf.__version__
'2.6.0'
tf.test.is_gpu_available()
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
True
回圈神經網路(RNN)介紹
很多問題具有時序性,自然語言處理、視頻影像處理、股票交易資訊等等
比如:
- jupyter 2021年成功入黨了,2021年成功拿到國獎了,2022年成功保研了,哈哈哈哈,先做個夢,
大家就會發現,其實只有成功入黨有主語jupyter,但是人類的閱讀習慣,后面都是jupyter做的,這就是時序性,

多層全連接的神經網路或者卷積神經網路都只能根據當前的狀態進行處理,不能很好地處理時序問題,
(題外話,我們已經接觸了全連接和卷積神經網路了)
回圈神經網路(RNN)的結構比較特殊,它后一層網路的輸入和前一層網路的輸出有關系,這樣就能把上一層的資訊傳遞給下一層,
但是普通RNN,會存在梯度消失與梯度爆炸(因為他的激活函式是tanh函式)
- 當序列過長時,由于梯度消失和梯度爆炸問題,對于t時刻來說,它產生的梯度在時間軸上向歷史傳播幾層后就消失了,根本就無法影響太遙遠的過去
RNN會忘記很久之前的資訊,而只能記住近期出現的資訊,所以RNN很難有效處理長文本

長短時記憶網路(LSTM)介紹

RNN的問題:
- 梯度爆炸
- 梯度消失
解決之道:
對于梯度爆炸,一般靠裁剪后的優化演算法即可解決,比如gradient clipping(如果梯度的范數大于某個給定值,將梯度同比收縮)
通過LSTM改進RNN結構,消除梯度消失
傳統的RNN每一步的隱藏單元只是執行了一個簡單的tanh或RELU操作
LSTM基本結構和RNN相似,主要不同LSTM對隱含層進行了改進,LSTM中每一個神經元相當于一個記憶細胞
LSTM較于RNN的優點:
- 緩解梯度消失問題
- 使用門結構,解決了長距離依賴的問題
一、自制資料集
- 這種方法更加現實
- 基本思路:
- 獲取資料,確定資料格式規范
- 文字分詞,英文分詞可以按照空格分詞,中文分詞可以參考jieba
- 建立詞索引表,給每個詞一個數字索引編號
- 段落文字轉為詞索引向量
- 段落文字轉為詞嵌入矩陣
import os
import tarfile
import urllib.request
import numpy as np
import re
from random import randint
# 資料地址
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
# 資料存放路徑
file_path = 'data/aclImdb_v1.tar.gz'
if not os.path.exists('data'):
os.mkdir('data')
if not os.path.isfile(file_path):
print('downloading')
result = urllib.request.urlretrieve(url,filename=file_path)
print('ok',result)
else:
print(file_path,'is existed!')
downloading
ok ('data/aclImdb_v1.tar.gz', <http.client.HTTPMessage object at 0x7f1599fc17d0>)
# 解壓資料
if not os.path.exists('data/aclImdb'):
tfile = tarfile.open(file_path,'r:gz')
print('extracting…')
result = tfile.extractall('data/') # tfile.extractall('data/')將檔案解壓到data目錄下
print('ok',result)
else:
print('data/aclImdb is existed')
extracting…
ok None
# 讀取資料集,題外話,對re不熟,需要補
# 將文本中不需要的字符清除,如html標簽<br/>
def remove_tags(text):
re_tag = re.compile(r'<[>]+>') # compile 函式用于編譯正則運算式,生成一個 Pattern 物件
return re_tag.sub('',text) # re_tag.sub('',text)將匹配到的字符換成空
# 讀取資料集封裝成函式
def read_files(file_type):
# 1)將所有的檔案的路徑存入file_list,并統計正樣本和負樣本的個數
path = 'data/aclImdb/'
file_list = []
positive_file_path = path+file_type+'/pos/'
for f in os.listdir(positive_file_path):
file_list.append(positive_file_path+f)
positive_num = len(file_list)
negitave_file_path = path+file_type+'/neg/'
for f in os.listdir(negitave_file_path):
file_list.append(negitave_file_path+f)
negitave_num = len(file_list) - positive_num
print('read',file_type,':',len(file_list))
print('positive_num',positive_num)
print('negitave_num',negitave_num)
# 2)自己制作標簽,因為這個資料集的檔案夾名就是特征的標簽
labels = [[1,0]]*positive_num + [[0,1]]*negitave_num # 串列相加會拼接串列,串列×一個數字會重復里面的內容
# 3)得到所有文本
features = []
for fi in file_list:
with open(fi,'rt',encoding='utf8') as f:
features+=[remove_tags(''.join(f.readlines()))]
return features,labels
train_x,train_y = read_files('train')
test_x,test_y = read_files('test')
test_y = np.array(test_y)
train_y = np.array(train_y)
read train : 25000
positive_num 12500
negitave_num 12500
read test : 25000
positive_num 12500
negitave_num 12500
train_x[0] # 特征
'It started out slow after an excellent animated intro, as the director had a bunch of characters and school setting to develop. Once the bet is on, though, the movie picks up the pace as it\'s a race against time to see if a certain number of worms can be eaten by 7 pm. We had a good opportunity on the way home to discuss some things with our son: bullies, helping others, mind over matter when you don\'t want to do something.<br /><br />Of special note is the girl who played Erica (Erk): Hallie Kate Eisenberg. The director kinda sneaks her in unexpectedly, and when she is on-screen she is captivating. She\'s one of those "Hey, she looks familiar" faces, and then I remembered that she was the little girl that Pepsi featured about 8 years ago. She was also in "Paulie", that movie about the parrot who tries to find his way home.<br /><br />Ms. Eisenberg made many TV and movie appearances in \'99-00, but then was not seen much for the next few years. She\'s now 14 and is growing up to be a beautiful woman. Her smile really warms up the screen. If she can get some more good roles she could have as good a career (or better?) than Haley Joel Osment, another three named kid actor, but hopefully without some of the problems that Osment has been in lately.<br /><br />Anywhozitz, according to my 8 y.o. son, who just finished reading the story, the film did not seem to follow the book all that well, but was entertaining none the less. The ending of the film seemed like a big setup for some sequels (How to Eat Boiled Slugs? Escargot Kid\'s Style?), which might not be such a bad thing. It was nice to take the family to a movie and not have to worry about language, violence or sex scenes.<br /><br />One other good aspect of the movie was the respect/fear engendered by the principal Mr. Burdock (Boilerplate). Movies nowadays tend to show adult authority figures as buffoons. While he has one particular goofy scene, he ruled the school with a firm hand. It was also nice to see Andrea Martin getting some work.'
train_y[0] # 正評論
array([1, 0])
二、資料處理
1.建立字典
token = tf.keras.preprocessing.text.Tokenizer(num_words=4000) # 4000是只統計4000個詞匯
token.fit_on_texts(train_x) # 從train_x中建立字典
2.文字轉數字串列(詞向量)
train_sequences = token.texts_to_sequences(train_x) # 將文本映射成詞向量中的數字,也就是詞出現的排名
test_sequences = token.texts_to_sequences(test_x)
train_x[0]
'It started out slow after an excellent animated intro, as the director had a bunch of characters and school setting to develop. Once the bet is on, though, the movie picks up the pace as it\'s a race against time to see if a certain number of worms can be eaten by 7 pm. We had a good opportunity on the way home to discuss some things with our son: bullies, helping others, mind over matter when you don\'t want to do something.<br /><br />Of special note is the girl who played Erica (Erk): Hallie Kate Eisenberg. The director kinda sneaks her in unexpectedly, and when she is on-screen she is captivating. She\'s one of those "Hey, she looks familiar" faces, and then I remembered that she was the little girl that Pepsi featured about 8 years ago. She was also in "Paulie", that movie about the parrot who tries to find his way home.<br /><br />Ms. Eisenberg made many TV and movie appearances in \'99-00, but then was not seen much for the next few years. She\'s now 14 and is growing up to be a beautiful woman. Her smile really warms up the screen. If she can get some more good roles she could have as good a career (or better?) than Haley Joel Osment, another three named kid actor, but hopefully without some of the problems that Osment has been in lately.<br /><br />Anywhozitz, according to my 8 y.o. son, who just finished reading the story, the film did not seem to follow the book all that well, but was entertaining none the less. The ending of the film seemed like a big setup for some sequels (How to Eat Boiled Slugs? Escargot Kid\'s Style?), which might not be such a bad thing. It was nice to take the family to a movie and not have to worry about language, violence or sex scenes.<br /><br />One other good aspect of the movie was the respect/fear engendered by the principal Mr. Burdock (Boilerplate). Movies nowadays tend to show adult authority figures as buffoons. While he has one particular goofy scene, he ruled the school with a firm hand. It was also nice to see Andrea Martin getting some work.'
type(train_sequences[0])
list
3.讓轉換后的數字串列長度相同
'''
tf.keras.preprocessing.sequence.pad_sequences(train_sequences, 浮點數或整數構成的兩層嵌套串列
padding='post',‘pre’或‘post’,確定當需要補0時,在序列的起始還是結尾補0
truncating='post',‘pre’或‘post’,確定當截斷序列時,從起始還是結尾截斷
maxlen=400),’None或整數,為序列的最大長度,大于此長度的序列將會被截斷,小于此長度’會填0
'''
train_x = tf.keras.preprocessing.sequence.pad_sequences(train_sequences,
padding='post',
truncating='post',
maxlen=400)
test_x = tf.keras.preprocessing.sequence.pad_sequences(test_sequences,
padding='post',
truncating='post',
maxlen=400)
train_x[0]
array([ 9, 642, 43, 547, 100, 32, 318, 1121, 14, 1, 164,
66, 3, 758, 4, 102, 2, 392, 953, 5, 2058, 277,
1, 2130, 6, 20, 148, 1, 17, 2847, 53, 1, 1059,
14, 42, 3, 1519, 426, 55, 5, 64, 44, 3, 810,
608, 4, 67, 27, 31, 690, 72, 66, 3, 49, 1429,
20, 1, 93, 341, 5, 46, 180, 16, 260, 489, 2753,
405, 327, 117, 548, 51, 22, 89, 178, 5, 78, 139,
7, 7, 4, 315, 851, 6, 1, 247, 34, 253, 1861,
1, 164, 1927, 38, 8, 2, 51, 56, 6, 20, 265,
56, 6, 3712, 438, 28, 4, 145, 1395, 56, 269, 1076,
1586, 2, 92, 10, 2024, 12, 56, 13, 1, 114, 247,
12, 2553, 41, 705, 150, 593, 56, 13, 79, 8, 12,
17, 41, 1, 34, 494, 5, 166, 24, 93, 341, 7,
7, 1559, 90, 108, 245, 2, 17, 3309, 8, 18, 92,
13, 21, 107, 73, 15, 1, 372, 168, 150, 438, 147,
2425, 2, 6, 1784, 53, 5, 27, 3, 304, 252, 38,
1822, 63, 53, 1, 265, 44, 56, 67, 76, 46, 50,
49, 552, 56, 97, 25, 14, 49, 3, 609, 39, 125,
71, 157, 286, 769, 550, 281, 18, 2353, 206, 46, 4,
1, 709, 12, 45, 74, 8, 7, 7, 1789, 5, 58,
705, 1600, 489, 34, 40, 1763, 883, 1, 62, 1, 19,
119, 21, 303, 5, 790, 1, 271, 29, 12, 70, 18,
13, 439, 597, 1, 326, 1, 274, 4, 1, 19, 465,
37, 3, 191, 15, 46, 2278, 86, 5, 1893, 402, 60,
235, 21, 27, 138, 3, 75, 151, 9, 13, 324, 5,
190, 1, 220, 5, 3, 17, 2, 21, 25, 5, 3230,
41, 1098, 564, 39, 380, 136, 7, 7, 28, 82, 49,
1247, 4, 1, 17, 13, 1, 1158, 1088, 31, 1, 440,
99, 2876, 2345, 5, 120, 1155, 2576, 14, 134, 26, 45,
28, 840, 2962, 133, 26, 1, 392, 16, 3, 505, 9,
13, 79, 324, 5, 64, 1588, 394, 46, 154, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0], dtype=int32)
三、基于LSTM結構的構建模型
model = tf.keras.models.Sequential()
# 詞嵌入層,這里充當輸入層
'''
model.add(tf.keras.layers.Embedding(output_dim=32,輸出詞向量的維度
input_dim=4000,#輸入詞匯表的長度,最大詞匯數+1
input_length=400)) # 輸入Tensor的長度
'''
model.add(tf.keras.layers.Embedding(output_dim=32,
input_dim=4000,
input_length=400))
# 平坦層
# model.add(tf.keras.layers.SimpleRNN(units=16)) # RNN
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=32))) # LSTM
# model.add(tf.keras.layers.GlobalAveragePooling1D())
# model.add(tf.keras.layers.Flatten())
# 全連接層
model.add(tf.keras.layers.Dense(units=256,activation='relu'))
# 丟棄層,防止過擬合
model.add(tf.keras.layers.Dropout(0.3))
# 輸出層
model.add(tf.keras.layers.Dense(units=2,activation='softmax'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 400, 32) 128000
_________________________________________________________________
bidirectional (Bidirectional (None, 64) 16640
_________________________________________________________________
dense (Dense) (None, 256) 16640
_________________________________________________________________
dropout (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 2) 514
=================================================================
Total params: 161,794
Trainable params: 161,794
Non-trainable params: 0
_________________________________________________________________
四、訓練
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(train_x,train_y,validation_split=0.2,epochs=10,batch_size=128,verbose=1)
Epoch 1/10
157/157 [==============================] - 32s 154ms/step - loss: 0.5359 - accuracy: 0.7192 - val_loss: 0.5319 - val_accuracy: 0.7444
Epoch 2/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2882 - accuracy: 0.8847 - val_loss: 0.5372 - val_accuracy: 0.7904
Epoch 3/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2302 - accuracy: 0.9119 - val_loss: 0.3840 - val_accuracy: 0.8646
Epoch 4/10
157/157 [==============================] - 23s 149ms/step - loss: 0.2008 - accuracy: 0.9280 - val_loss: 0.4596 - val_accuracy: 0.8344
Epoch 5/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1862 - accuracy: 0.9327 - val_loss: 0.5627 - val_accuracy: 0.7946
Epoch 6/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1749 - accuracy: 0.9380 - val_loss: 0.5431 - val_accuracy: 0.8148
Epoch 7/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1443 - accuracy: 0.9491 - val_loss: 0.4799 - val_accuracy: 0.8632
Epoch 8/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1283 - accuracy: 0.9553 - val_loss: 0.6568 - val_accuracy: 0.8078
Epoch 9/10
157/157 [==============================] - 23s 149ms/step - loss: 0.1087 - accuracy: 0.9632 - val_loss: 0.6196 - val_accuracy: 0.8314
Epoch 10/10
157/157 [==============================] - 23s 149ms/step - loss: 0.0960 - accuracy: 0.9688 - val_loss: 0.4496 - val_accuracy: 0.8698
import matplotlib.pyplot as plt
def show_train_history(train_history,train_metrics,val_metrics):
plt.plot(train_history[train_metrics])
plt.plot(train_history[val_metrics])
plt.title('Trian History')
plt.ylabel(train_metrics)
plt.xlabel('epoch')
plt.legend(['trian','validation'],loc='upper left')
plt.show()
show_train_history(history.history,'loss','val_loss')
![[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-kO1yDEaA-1634552592215)(output_36_0.png)]](https://img.uj5u.com/2021/10/19/275515190826014.png)
show_train_history(history.history,'accuracy','val_accuracy')
![[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-Ocyq9SL4-1634552592216)(output_37_0.png)]](https://img.uj5u.com/2021/10/19/275515190826015.png)
看這個驗證集的準確率和損失一直在波動,而訓練集一直在上升,其實就可以大概估計出是有點過擬合的意思了
五、評估和預測
model.evaluate(test_x,test_y,verbose=1) # 0是無,1是進度條,2是一個epoch一個
782/782 [==============================] - 40s 51ms/step - loss: 0.5644 - accuracy: 0.8374
[0.5644006133079529, 0.8374000191688538]
pre = model.predict(test_x)
pre[0],test_y[0]
(array([9.996530e-01, 3.470438e-04], dtype=float32), array([1, 0]))
# 模型應用,我自己寫的
x = ["This is really a junk movie. Jupyter doesn't like it. Thank you! It's really bad"]
x = token.texts_to_sequences(x)
x = tf.keras.preprocessing.sequence.pad_sequences(x,
padding='post',
truncating='post',
maxlen=400)
x
array([[ 11, 6, 63, 3, 2579, 17, 149, 37, 9, 1289, 22,
42, 63, 75, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]], dtype=int32)
y = model.predict(x)
y
array([[0.12796064, 0.8720394 ]], dtype=float32)
state = {0:'pos',1:'neg'}
state[np.argmax(y)]
'neg'
轉載請註明出處,本文鏈接:https://www.uj5u.com/qita/323305.html
標籤:AI
下一篇:Python中pandas檢查dataframe中是否包含某個欄位、或者資料列實戰、檢查dataframe中是否包含某個欄位集合
