如何在tf.data.TextLineDataset上應用tf.keras.preprocessing.text.Tokenizer？-有解無憂

我正在加載一個TextLineDataset并且我想應用一個在檔案上訓練的標記器：

import tensorflow as tf

data = tf.data.TextLineDataset(filename)

MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts([x.numpy().decode('utf-8') for x in train_data])

現在我想應用這個標記器，data以便將每個單詞替換為其編碼值。我試過data.map(lambda x: tokenizer.texts_to_sequences(x))了，它給了OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

按照說明，當我將代碼撰寫為：

@tf.function
def fun(x):
    return tokenizer.texts_to_sequences(x)
train_data.map(lambda x: fun(x))

我得到：OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature。

那么如何進行標記化data呢？

uj5u.com熱心網友回復：

問題是這tf.keras.preprocessing.text.Tokenizer并不意味著在圖形模式下使用。檢查檔案，兩者都fit_on_texts需要texts_to_sequences字串串列而不是張量。我建議使用tf.keras.layers.TextVectorization，但如果您真的想使用該Tokenizer方法，請嘗試以下操作：

import tensorflow as tf
import numpy as np

with open('data.txt', 'w') as f:
  f.write('this is a very important sentence \n')
  f.write('where is my cat actually?\n')
  f.write('fish are everywhere!\n')

dataset = tf.data.TextLineDataset(['/content/data.txt'])

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([n.numpy().decode("utf-8")for n in list(dataset.map(lambda x: x))])

def tokenize(x):
  return tokenizer.texts_to_sequences([x.numpy().decode("utf-8")])

dataset = dataset.map(lambda x: tf.py_function(tokenize, [x], Tout=[tf.int32])[0])

for d in dataset:
  print(d)

tf.Tensor([2 1 3 4 5 6], shape=(6,), dtype=int32)
tf.Tensor([ 7  1  8  9 10], shape=(5,), dtype=int32)
tf.Tensor([11 12 13], shape=(3,), dtype=int32)

使用TextVectorization圖層看起來像這樣：

with open('data.txt', 'w') as f:
  f.write('this is a very important sentence \n')
  f.write('where is my cat actually?\n')
  f.write('fish are everywhere!\n')

dataset = tf.data.TextLineDataset(['/content/data.txt'])

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int')
vectorize_layer.adapt(dataset)

dataset = dataset.map(vectorize_layer)

轉載請註明出處，本文鏈接：https://www.uj5u.com/yidong/453220.html

標籤：Python 张量流喀拉斯标记化

上一篇：如果我不提供oov_token，tenosrflow中的Tokenizer如何處理詞匯表外的標記？

下一篇：具有輸入乘法密集層的Keras模型