正在使用的資料保存在 csv 檔案中:
Sentence # Word POS Tag
Sentence1 YASHAWANTHA NNP B-PER
Sentence1 K NNP I-PER
Sentence1 S NNP I-PER
Sentence1 Mobile NNP O
Sentence1 : : O
Sentence1 -7353555773 JJ O
我正在嘗試使用以下列來獲取資料集:Sentence #、Word、POS、Tag 并將 Word 列中的所有條目轉換為 Word2Vec 向量。
在這里,我正在讀取資料集并拆分成句子:
from gensim.models import Word2Vec
import pandas as pd
data = pd.read_csv(path_to_csv)
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1#
self.data = data
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent = 1
return s
except:
return None
getter = SentenceGetter(data)
sentences = getter.sentences
現在我將所有單詞轉換為它們對應的 Word2Vec 向量,其中 word2idx 是一個字典,鍵是字串,其對應的 Word2Vec 向量作為值:
vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
word2idx[key] = vec_model.wv[key]
然后對于標簽列,我使用簡單的列舉:
tag2idx = {t: i for i, t in enumerate(tags)}
然后我填充單詞和標簽:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]
然后定義模型:
from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
然后擬合模型:
history = model.fit(
x_train, np.array(y_train),
validation_split=0.2,
batch_size=32,
epochs=1,
verbose=1,
)
此擬合步驟導致以下錯誤,我不確定如何修復它
層“spatial_dropout1d_2”的輸入 0 與層不兼容:預期 ndim=3,發現 ndim=4。收到的完整形狀:(無、60、30、60)
uj5u.com熱心網友回復:
填充前的形狀
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)
是(3, 6, 30)csv 檔案中的 3 個句子,(3, 60, 30)在填充之后,30 是 word2wec 的大小。但模型需要大小為 (3, 60) 的輸入
在不改變其余部分的情況下,您可以修改網路:
wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)
model = Model(input_word, out)
轉載請註明出處,本文鏈接:https://www.uj5u.com/net/445650.html
下一篇:為什么后填充訓練比預填充更快?
