Word2Vec尺寸不正確-有解無憂

正在使用的資料保存在 csv 檔案中：

Sentence #  Word    POS Tag
Sentence1   YASHAWANTHA NNP B-PER
Sentence1   K   NNP I-PER
Sentence1   S   NNP I-PER
Sentence1   Mobile  NNP O
Sentence1   :   :   O
Sentence1   -7353555773 JJ  O

我正在嘗試使用以下列來獲取資料集：Sentence #、Word、POS、Tag 并將 Word 列中的所有條目轉換為 Word2Vec 向量。

在這里，我正在讀取資料集并拆分成句子：

from gensim.models import Word2Vec
import pandas as pd

data = pd.read_csv(path_to_csv)

class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1#
        self.data = data

        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),s["POS"].values.tolist(), s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent  = 1
            return s
        except:
            return None

getter = SentenceGetter(data)
sentences = getter.sentences

現在我將所有單詞轉換為它們對應的 Word2Vec 向量，其中 word2idx 是一個字典，鍵是字串，其對應的 Word2Vec 向量作為值：

vec_words= [[i] for i in words]
vec_model= Word2Vec(vec_words, min_count=1, size=30)
word2idx = dict({})
for idx, key in enumerate(vec_model.wv.vocab):
    word2idx[key] = vec_model.wv[key]

然后對于標簽列，我使用簡單的列舉：

tag2idx = {t: i for i, t in enumerate(tags)}

然后我填充單詞和標簽：

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

max_len = 60
X = [[word2idx[w[0]] for w in s] for s in sentences]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])
y= [to_categorical(i, num_classes = num_tags) for i in y]

然后定義模型：

from sklearn.model_selection import train_test_split
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense
from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D, Bidirectional

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim=num_words, output_dim=max_len, input_length=max_len)(input_word)
model = SpatialDropout1D(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(model)
model = Model(input_word, out)

model.compile(optimizer="rmsprop",
              loss="categorical_crossentropy",
              metrics=["accuracy"])

然后擬合模型：

history = model.fit(
    x_train, np.array(y_train),
    validation_split=0.2,
    batch_size=32, 
    epochs=1,
    verbose=1,    
)

此擬合步驟導致以下錯誤，我不確定如何修復它

層“spatial_dropout1d_2”的輸入 0 與層不兼容：預期 ndim=3，發現 ndim=4。收到的完整形狀：（無、60、30、60）

uj5u.com熱心網友回復：

填充前的形狀

X = [[word2idx[w[0]] for w in s] for s in sentences]
X = np.array(X)
print(X.shape)

是(3, 6, 30)csv 檔案中的 3 個句子，(3, 60, 30)在填充之后，30 是 word2wec 的大小。但模型需要大小為 (3, 60) 的輸入

在不改變其余部分的情況下，您可以修改網路：

wrd2vec_size = 30
input_word = Input(shape=(max_len, wrd2vec_size))
x = SpatialDropout1D(0.1)(input_word)
x = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(x)
out = TimeDistributed(Dense(num_tags, activation="softmax"))(x)

model = Model(input_word, out)

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/445650.html

標籤：Python 张量流 word2vec

上一篇：我怎樣才能有一個沒有任何變形的封面背景影像？

下一篇：為什么后填充訓練比預填充更快？