從tensorflow資料集中一致地提取資料-有解無憂

我想將 tensorflow 資料集中的資料一致地提取到 numpy 陣列/張量中。我正在加載圖片

data = keras.preprocessing.image_dataset_from_directory(
  './data', 
  labels='inferred', 
  label_mode='binary', 
  validation_split=0.2, 
  subset="training", 
  image_size=(img_height, img_width), 
  batch_size=sz_batch, 
  crop_to_aspect_ratio=True
)

我已經得到使用以下行的提示：

xdata = np.concatenate([x for x, y in data], axis=0)
ydata = np.concatenate([y for x, y in data], axis=0)

然而，問題是提取的資料xdata和ydata不一致，因此，標簽ydata不適合其中的樣本xdata（我通過簡單地回圈提取資料來檢查這一點）。

我的第二個想法是在標準 for 回圈中提取資料：

xdata = np.empty([sz1, sz2, 3])[np.newaxis,...]
ydata = np.array([0])
for images, labels in val_ds:
    xdata = np.concatenate((xdata, images), axis=0)
    ydata = np.concatenate((ydata, labels), axis=0)

xdata = xval[1:]
ydata = yval[1:]

盡管資料似乎與這種方法一致，但我認為這種方法相當繁瑣，而且寫得也不好（而且估計效率也不高）——尤其是最后兩行讓我很困擾。但是我無法想出一種更簡單的方法來提取資料并將提取的資料堆疊在 numpy 陣列/張量中。

我很樂意幫助如何在 python 中正確解決這個問題。

無論如何，我想知道為什么處理 tensorflow 資料集至少在我看來真的很麻煩。首先，我需要解決上述問題，以便在其他例程中使用資料，而不是在 tensorflow 中。其次，即使我在 tensorflow 訓練之外的任何地方使用資料，我的選擇也不是很簡單。例如，如果我想將來自 NN 的預測標簽與來自資料集的真實標簽進行比較，我無法輕松地提取該資料集的一致標簽。我必須在 for 回圈中分別預測每個樣本。

注意：我不會/不能使用 tfds

uj5u.com熱心網友回復：

關于你的資料集的順序將其轉換為numpy的陣列時，請確保您設定shuffle=False的image_dataset_from_directory，如果你想看到相同的結果：

import tensorflow as tf
import matplotlib.pyplot as plt
import pathlib

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

batch_size = 32

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(180, 180),
  batch_size=batch_size,
  shuffle=False)

normalization_layer = tf.keras.layers.Rescaling(1./255)
train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
images, labels = next(iter(train_ds.take(1)))
image = images[0]
plt.title('label :: '   str(labels[0]))
plt.imshow(image.numpy())

從 tensorflow 資料集中一致地提取資料

之后，您可以嘗試幾種方法將資料集轉換為串列或類似陣列的結構：

選項1：

train_ds = train_ds.unbatch()
data = list(train_ds.map(lambda x, y: (x, y)))
data = list(map(list, zip(*data)))
images, labels = data[0], data[1]

image = images[0]
plt.title('label :: '   str(labels[0]))
plt.imshow(image.numpy())

選項 2：

import numpy as np

train_ds = train_ds.unbatch()
images = np.asarray(list(train_ds.map(lambda x, y: x)))
labels = np.asarray(list(train_ds.map(lambda x, y: y)))
image = images[0]
plt.title('label :: '   str(labels[0]))
plt.imshow(image)

選項 3：

import numpy as np

# no unbatching
images = np.concatenate(list(train_ds.map(lambda x, y: x)))
labels = np.concatenate(list(train_ds.map(lambda x, y: y)))

image = images[0]
plt.title('label :: '   str(labels[0]))
plt.imshow(image)

所有選項都將保持資料的順序：

從 tensorflow 資料集中一致地提取資料

更新 1： 您也可以嘗試使用tf.TensorArray和設定shuffle=True：

images = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
labels = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)

for x, y in train_ds.unbatch():
  images = images.write(images.size(), x)
  labels = labels.write(labels.size(), y)

images = tf.stack(images.stack(), axis=0)
labels = tf.stack(labels.stack(), axis=0)

轉載請註明出處，本文鏈接：https://www.uj5u.com/qiye/406138.html

標籤：

上一篇：擬合我的模型時，Tensorflow“需要可廣播的形狀”

下一篇：通過Ajax渲染表單-Symfony5