在tf資料集上應用文本向量化后如何創建滑動視窗？-有解無憂

我正在使用 TensorFlow 的TextLineDataset. 我想對資料集進行標記并創建一個滑動視窗并將標記化的文本分成兩部分 - 輸入和標簽。如果文本檔案包含以下文本：

Lorem ipsum dolor sit amet...

然后我想創建預先用 0 填充的指定長度的序列。我想遍歷文本并使用除最后一個以外的所有文本作為輸入，最后一個作為標簽。所以，我的目標是首先將文本標記為如下所示：

Lorem: 1,
ipsum: 2,
dolor: 3,
sit: 4,
amet: 5,
...

然后創建一個長度為 5 的序列來訓練模型：

X_train = [[0, 0, 0, 0, 1], [0, 0, 0, 1, 2], [0, 0, 1, 2, 3], ...]
y_train = [2, 3, 4, ...] # next word of the sequence in X_train

我正在使用TextVectorization標記化但無法找到為大型資料集創建輸入和標簽的有效方法。

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int',
                                                    max_tokens=MAX_WORDS,
                                                    output_sequence_length=MAX_SEQUENCE_LENGTH)
vectorize_layer.adapt(train_data)
train_data = train_data.map(vectorize_layer)

在資料集上使用 for 回圈會使設備在嘗試分配大量記憶體時耗盡記憶體。做這個的最好方式是什么？

uj5u.com熱心網友回復：

您可以使用滑動視窗功能；tensorflow-text但是，該TextVectorization圖層似乎僅適用于后填充：

import tensorflow as tf
import tensorflow_text as tft

with open('data.txt', 'w') as f:
  f.write('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam efficitur viverra lacus?\n')

train_data = tf.data.TextLineDataset(['/content/data.txt'])

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int', max_tokens=50, pad_to_max_tokens=True)
vectorize_layer.adapt(train_data)

window_size = 5

def sliding_window(x):
  encoded = vectorize_layer(x)
  x = tft.sliding_window(encoded, width=window_size, axis=0)
  y = tft.sliding_window(encoded, width=window_size   1, axis=0)[:, -1]
  return x[:tf.shape(y)[0],:], y

train_data = train_data.map(sliding_window)


vocab = tf.constant(vectorize_layer.get_vocabulary())
keys = tf.cast(tf.range(vocab.shape[0]), tf.int64)
table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(keys, vocab),
    default_value="")

train_data = tf.data.Dataset.zip((train_data.map(lambda x, y: x).flat_map(tf.data.Dataset.from_tensor_slices),
                                 train_data.map(lambda x, y: y).flat_map(tf.data.Dataset.from_tensor_slices)))

for x, y in train_data:
  print('x -->', x, 'y -->', y)
  print('x -->', table.lookup(x), 'y -->', table.lookup(y), '\n')

x --> tf.Tensor([ 4  6  9  3 11], shape=(5,), dtype=int64) y --> tf.Tensor(10, shape=(), dtype=int64)
x --> tf.Tensor([b'lorem' b'ipsum' b'dolor' b'sit' b'amet'], shape=(5,), dtype=string) y --> tf.Tensor(b'consectetur', shape=(), dtype=string) 

x --> tf.Tensor([ 6  9  3 11 10], shape=(5,), dtype=int64) y --> tf.Tensor(13, shape=(), dtype=int64)
x --> tf.Tensor([b'ipsum' b'dolor' b'sit' b'amet' b'consectetur'], shape=(5,), dtype=string) y --> tf.Tensor(b'adipiscing', shape=(), dtype=string) 

x --> tf.Tensor([ 9  3 11 10 13], shape=(5,), dtype=int64) y --> tf.Tensor(7, shape=(), dtype=int64)
x --> tf.Tensor([b'dolor' b'sit' b'amet' b'consectetur' b'adipiscing'], shape=(5,), dtype=string) y --> tf.Tensor(b'elit', shape=(), dtype=string) 

x --> tf.Tensor([ 3 11 10 13  7], shape=(5,), dtype=int64) y --> tf.Tensor(12, shape=(), dtype=int64)
x --> tf.Tensor([b'sit' b'amet' b'consectetur' b'adipiscing' b'elit'], shape=(5,), dtype=string) y --> tf.Tensor(b'aliquam', shape=(), dtype=string) 

x --> tf.Tensor([11 10 13  7 12], shape=(5,), dtype=int64) y --> tf.Tensor(8, shape=(), dtype=int64)
x --> tf.Tensor([b'amet' b'consectetur' b'adipiscing' b'elit' b'aliquam'], shape=(5,), dtype=string) y --> tf.Tensor(b'efficitur', shape=(), dtype=string) 

x --> tf.Tensor([10 13  7 12  8], shape=(5,), dtype=int64) y --> tf.Tensor(2, shape=(), dtype=int64)
x --> tf.Tensor([b'consectetur' b'adipiscing' b'elit' b'aliquam' b'efficitur'], shape=(5,), dtype=string) y --> tf.Tensor(b'viverra', shape=(), dtype=string) 

x --> tf.Tensor([13  7 12  8  2], shape=(5,), dtype=int64) y --> tf.Tensor(5, shape=(), dtype=int64)
x --> tf.Tensor([b'adipiscing' b'elit' b'aliquam' b'efficitur' b'viverra'], shape=(5,), dtype=string) y --> tf.Tensor(b'lacus', shape=(), dtype=string)

Note that sequences that do not have a corresponding label are discarded with the line x[:tf.shape(y)[0],:]. Also, the lookup table is only for demonstration purposes and not needed to achieve what you want. You can look at tft.pad_along_dimension if you want to apply pre-padding.

轉載請註明出處，本文鏈接：https://www.uj5u.com/net/453720.html

標籤：Python 张量流喀拉斯 nlp 张量流数据集

上一篇：如何用linspace呼叫TensorFlow模型？

下一篇：swagger的作用和配置使用