【翻譯自： A Hands-on Tutorial to Learn Attention Mechanism For Image Caption Generation in Python】

【說明：analyticsvidhya這里的文章個人很喜歡，所以閑暇時間里會做一點翻譯和學習實踐的作業，這里是相應作業的實踐記錄，希望能幫到有需要的人！】

總覽

了解影像字幕生成的注意力機制
實作注意力機制以在python中生成字幕

介紹

注意機制是人類所具有的復雜的認知能力，當人們收到資訊時，他們可以有意識地忽略一些主要資訊，而忽略其他次要資訊，

這種自我選擇的能力稱為注意力，注意機制使神經網路能夠專注于其輸入子集以選擇特定特征，

近年來，神經網路推動了影像字幕的巨大發展，研究人員正在為計算機視覺和序列到序列建模系統尋找更具挑戰性的應用程式，他們試圖用人類的術語描述世界，在上一篇文章中，我們看到了通過Merge架構進行影像標題處理的程序，今天，我們將探討一種更為復雜而精致的設計來解決此問題，

注意機制已成為深度學習社區中從業者的首選方法，它最初是在使用Seq2Seq模型的神經機器翻譯的背景下設計的，但今天我們將看看它在影像字幕中的實作，

注意機制不是將整個影像壓縮為靜態表示，而是使顯著特征在需要時動態地走在最前列，當影像中有很多雜波時，這一點尤其重要，

讓我們舉個例子來更好地理解：

我們的目標是生成一個標題，例如“兩只白狗在雪地上奔跑”，為此，我們將看到如何實作一種稱為Bahdanau的注意力或本地注意力的特定型別的注意力機制，

通過這種方式，我們可以看到模型在生成標題時將焦點放在影像的哪些部分，此實作將需要深度學習的強大背景，

1、問題陳述的處理
2、了解資料集
3、實作
1、匯入所需的庫
2、資料加載和預處理
3、模型定義
4、模型訓練
5、貪婪搜索和BLEU評估
4、下一步是什么？
5、尾注

問題陳述的處理

編碼器-解碼器影像字幕系統將使用將產生隱藏狀態的預訓練卷積神經網路對影像進行編碼，然后，它將使用LSTM解碼此隱藏狀態并生成標題，

對于每個序列元素，將先前元素的輸出與新序列資料結合起來用作輸入，這為RNN網路提供了一種記憶，可能使字幕更具資訊性和背景關系感知能力，

但是RNN的訓練和評估在計算上往往很昂貴，因此在實踐中，記憶體只限于少數幾個元素，注意模型可以通過從輸入影像中選擇最相關的元素來幫助解決此問題，使用Attention機制，首先將影像分為n個部分，然后我們計算每個影像的影像表示形式，當RNN生成新單詞時，注意機制將注意力集中在影像的相關部分上，因此解碼器僅使用特定的圖片的一部分，

在Bahdanau或本地關注中，關注僅放在少數幾個來源位置，由于全球關注集中于所有目標詞的所有來源方詞，因此在計算上非常昂貴，為了克服這種缺陷，本地注意力選擇只關注每個目標詞的編碼器隱藏狀態的一小部分，

區域注意力首先找到對齊位置，然后在其位置所在的左右視窗中計算注意力權重，最后對背景關系向量進行加權，區域注意的主要優點是減少了注意機制計算的成本，

在計算中，本地注意力不是考慮源語言端的所有單詞，而是根據預測函式預測在當前解碼時要對齊的源語言端的位置，然后在背景關系視窗中導航，僅考慮視窗中的單詞，

Bahdanau注意的設計

編碼器和解碼器的所有隱藏狀態用于生成背景關系向量，注意機制將輸入和輸出序列與前饋網路引數化的比對得分進行比對，它有助于注意源序列中最相關的資訊，該模型基于與源位置和先前生成的目標詞關聯的背景關系向量來預測目標詞，

為了參考原始字幕評估字幕，我們使用一種稱為BLEU的評估方法，它是使用最廣泛的評估指標，它用于分析要評估的翻譯陳述句與參考翻譯陳述句之間n-gram的相關性，

在本文中，多個影像等效于翻譯中的多個源語言句子， BLEU的優點是考慮更長的匹配資訊，它認為的粒度是n元語法字而不是單詞， BLEU的缺點是無論匹配哪種n-gram，都將被視為相同，

我希望這使您對我們如何處理此問題陳述有所了解，讓我們深入研究實施！

了解資料集

我使用了Flickr8k資料集，其中每個影像都與五個不同的標題相關聯，這些標題描述了所收集的影像中描述的物體和事件，

Flickr8k體積小巧，可以使用CPU在低端筆記本電腦/臺式機上輕松進行培訓，因此是一個很好的入門資料集，

我們的資料集結構如下：

讓我們實作字幕生成的注意機制！

步驟1：-匯入所需的庫

在這里，我們將利用Tensorflow創建模型并對其進行訓練，大部分代碼歸功于TensorFlow教程，如果您想要GPU進行訓練，則可以使用Google Colab或Kaggle筆記本，

import string
import numpy as np
import pandas as pd
from numpy import array
from pickle import load

from PIL import Image
import pickle
from collections import Counter
import matplotlib.pyplot as plt

import sys, time, os, warnings
warnings.filterwarnings("ignore")
import re

import keras
import tensorflow as tf
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.applications.vgg16 import VGG16, preprocess_input

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

步驟2：-資料加載和預處理

定義影像和字幕路徑，并檢查資料集中總共有多少影像，

image_path = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset"
dir_Flickr_text = "/content/gdrive/My Drive/FLICKR8K/Flickr8k_text/Flickr8k.token.txt"
jpgs = os.listdir(image_path)

print("Total Images in Dataset = {}".format(len(jpgs)))

輸出如下：

我們創建一個資料框來存盤影像ID和標題，以便于使用，

file = open(dir_Flickr_text,'r')
text = file.read()
file.close()

datatxt = []
for line in text.split('\n'):
   col = line.split('\t')
   if len(col) == 1:
       continue
   w = col[0].split("#")
   datatxt.append(w + [col[1].lower()])

data = pd.DataFrame(datatxt,columns=["filename","index","caption"])
data = data.reindex(columns =['index','filename','caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)

data.head()

輸出如下：

接下來，讓我們可視化一些圖片及其5個標題：

npic = 5
npix = 224
target_size = (npix,npix,3)
count = 1

fig = plt.figure(figsize=(10,20))
for jpgfnm in uni_filenames[10:14]:
   filename = image_path + '/' + jpgfnm
   captions = list(data["caption"].loc[data["filename"]==jpgfnm].values)
   image_load = load_img(filename, target_size=target_size)
   ax = fig.add_subplot(npic,2,count,xticks=[],yticks=[])
   ax.imshow(image_load)
   count += 1

   ax = fig.add_subplot(npic,2,count)
   plt.axis('off')
   ax.plot()
   ax.set_xlim(0,1)
   ax.set_ylim(0,len(captions))
   for i, caption in enumerate(captions):
       ax.text(0,i,caption,fontsize=20)
   count += 1
plt.show()

輸出如下：

接下來，讓我們看看我們當前的詞匯量是多少：

vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

輸出如下：

Vocabulary Size

接下來執行一些文本清理，例如洗掉標點符號，單個字符和數字值：

def remove_punctuation(text_original):
   text_no_punctuation = text_original.translate(string.punctuation)
   return(text_no_punctuation)

def remove_single_character(text):
   text_len_more_than1 = ""
   for word in text.split():
       if len(word) > 1:
           text_len_more_than1 += " " + word
   return(text_len_more_than1)

def remove_numeric(text):
   text_no_numeric = ""
   for word in text.split():
       isalpha = word.isalpha()
       if isalpha:
           text_no_numeric += " " + word
   return(text_no_numeric)

def text_clean(text_original):
   text = remove_punctuation(text_original)
   text = remove_single_character(text)
   text = remove_numeric(text)
   return(text)

for i, caption in enumerate(data.caption.values):
   newcaption = text_clean(caption)
   data["caption"].iloc[i] = newcaption

現在讓我們看一下清理后詞匯量的大小

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

輸出如下：

接下來，我們將所有標題和影像路徑保存在兩個串列中，以便我們可以使用路徑集立即加載影像，我們還向每個字幕添加了“ <開始>”和“ <結束>”標簽，以便模型可以理解每個字幕的開始和結束，

PATH = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/"
all_captions = []
for caption  in data["caption"].astype(str):
   caption = '<start> ' + caption+ ' <end>'
   all_captions.append(caption)

all_captions[:10]

輸出如下：

all_img_name_vector = []
for annot in data["filename"]:
   full_image_path = PATH + annot
   all_img_name_vector.append(full_image_path)

all_img_name_vector[:10]

輸出如下：

現在您可以看到我們有40455個影像路徑和標題，

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

輸出如下：

我們將僅取每個批次的40000個，以便可以正確選擇批次大小，即如果批次大小= 64，則可以選擇625個批次，為此，我們定義了一個函式來將資料集限制為40000個影像和標題，

def data_limiter(num,total_captions,all_img_name_vector):
 train_captions, img_name_vector = shuffle(total_captions,all_img_name_vector,random_state=1)
 train_captions = train_captions[:num]
 img_name_vector = img_name_vector[:num]
 return train_captions,img_name_vector

train_captions,img_name_vector = data_limiter(40000,total_captions,all_img_name_vector)

步驟3：-模型定義

讓我們使用VGG16定義影像特征提取模型，我們必須記住，這里不需要分類影像，只需要為影像提取影像矢量即可，因此，我們從模型中洗掉了softmax層，我們必須先將所有影像預處理為相同大小，即224×224，然后再將其輸入模型，

def load_image(image_path):
   img = tf.io.read_file(image_path)
   img = tf.image.decode_jpeg(img, channels=3)
   img = tf.image.resize(img, (224, 224))
   img = preprocess_input(img)
   return img, image_path

image_model = tf.keras.applications.VGG16(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

image_features_extract_model.summary()

輸出如下：

接下來，讓我們將每個圖片名稱映射到要加載圖片的函式：

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我們提取特征并將其存盤在各自的.npy檔案中，然后將這些特征通過編碼器傳遞.NPY檔案存盤在任何計算機上重建陣列所需的所有資訊，包括dtype和shape資訊，

%%time
for img, path in tqdm(image_dataset):
 batch_features = image_features_extract_model(img)
 batch_features = tf.reshape(batch_features,
                             (batch_features.shape[0], -1, batch_features.shape[3]))

 for bf, p in zip(batch_features, path):
   path_of_feature = p.numpy().decode("utf-8")
   np.save(path_of_feature, bf.numpy())

接下來，我們標記標題，并為資料中所有唯一的單詞建立詞匯表，我們還將詞匯量限制在前5000個單詞以節省記憶體，我們將更換的話不詞匯與令牌<UNK>

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                 oov_token="<unk>",
                                                 filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')

tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

讓我們可視化填充的訓練和標題以及標記化的向量：

train_captions[:3]

輸出如下：

train_seqs[:3]

輸出如下：

接下來，我們可以計算所有字幕的最大和最小長度：

def calc_max_length(tensor):
   return max(len(t) for t in tensor)
max_length = calc_max_length(train_seqs)

def calc_min_length(tensor):
   return min(len(t) for t in tensor)
min_length = calc_min_length(train_seqs)

print('Max Length of any caption : Min Length of any caption = '+ str(max_length) +" : "+str(min_length))

輸出如下：

接下來，使用80-20拆分創建訓練和驗證集：

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)

定義訓練引數：

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
features_shape = 512
attention_features_shape = 49



def map_func(img_name, cap):
 img_tensor = np.load(img_name.decode('utf-8')+'.npy')
 return img_tensor, cap
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
        map_func, [item1, item2], [tf.float32, tf.int32]),
         num_parallel_calls=tf.data.experimental.AUTOTUNE)

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

接下來，讓我們重點定義編碼器-解碼器的體系結構，本文定義的架構類似于論文“ Show and Tell：一種神經影像字幕生成器”中描述的架構：-

VGG-16編碼器定義如下：

class VGG16_Encoder(tf.keras.Model):
   # This encoder passes the features through a Fully connected layer
   def __init__(self, embedding_dim):
       super(VGG16_Encoder, self).__init__()
       # shape after fc == (batch_size, 49, embedding_dim)
       self.fc = tf.keras.layers.Dense(embedding_dim)
       self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)

   def call(self, x):
       #x= self.dropout(x)
       x = self.fc(x)
       x = tf.nn.relu(x)
       return x

我們基于GPU / CPU功能定義RNN

def rnn_type(units):
   if tf.test.is_gpu_available():
       return tf.compat.v1.keras.layers.CuDNNLSTM(units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
   else:
       return tf.keras.layers.GRU(units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_activation='sigmoid',
                                  recurrent_initializer='glorot_uniform')

接下來，使用Bahdanau注意定義RNN解碼器：

'''The encoder output(i.e. 'features'), hidden state(initialized to 0)(i.e. 'hidden') and
the decoder input (which is the start token)(i.e. 'x') is passed to the decoder.'''

class Rnn_Local_Decoder(tf.keras.Model):
 def __init__(self, embedding_dim, units, vocab_size):
   super(Rnn_Local_Decoder, self).__init__()
   self.units = units
   self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
   self.gru = tf.keras.layers.GRU(self.units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_initializer='glorot_uniform')
  
   self.fc1 = tf.keras.layers.Dense(self.units)

   self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
   self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)

   self.fc2 = tf.keras.layers.Dense(vocab_size)

   # Implementing Attention Mechanism
   self.Uattn = tf.keras.layers.Dense(units)
   self.Wattn = tf.keras.layers.Dense(units)
   self.Vattn = tf.keras.layers.Dense(1)

 def call(self, x, features, hidden):
   # features shape ==> (64,49,256) ==> Output from ENCODER
   # hidden shape == (batch_size, hidden_size) ==>(64,512)
   # hidden_with_time_axis shape == (batch_size, 1, hidden_size) ==> (64,1,512)

   hidden_with_time_axis = tf.expand_dims(hidden, 1)

   # score shape == (64, 49, 1)
   # Attention Function
   '''e(ij) = f(s(t-1),h(j))'''
   ''' e(ij) = Vattn(T)*tanh(Uattn * h(j) + Wattn * s(t))'''

   score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))

   # self.Uattn(features) : (64,49,512)
   # self.Wattn(hidden_with_time_axis) : (64,1,512)
   # tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)) : (64,49,512)
   # self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis))) : (64,49,1) ==> score

   # you get 1 at the last axis because you are applying score to self.Vattn
   # Then find Probability using Softmax
   '''attention_weights(alpha(ij)) = softmax(e(ij))'''

   attention_weights = tf.nn.softmax(score, axis=1)

   # attention_weights shape == (64, 49, 1)
   # Give weights to the different pixels in the image
   ''' C(t) = Summation(j=1 to T) (attention_weights * VGG-16 features) '''

   context_vector = attention_weights * features
   context_vector = tf.reduce_sum(context_vector, axis=1)

   # Context Vector(64,256) = AttentionWeights(64,49,1) * features(64,49,256)
   # context_vector shape after sum == (64, 256)
   # x shape after passing through embedding == (64, 1, 256)

   x = self.embedding(x)
   # x shape after concatenation == (64, 1,  512)

   x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
   # passing the concatenated vector to the GRU

   output, state = self.gru(x)
   # shape == (batch_size, max_length, hidden_size)

   x = self.fc1(output)
   # x shape == (batch_size * max_length, hidden_size)

   x = tf.reshape(x, (-1, x.shape[2]))

   # Adding Dropout and BatchNorm Layers
   x= self.dropout(x)
   x= self.batchnormalization(x)

   # output shape == (64 * 512)
   x = self.fc2(x)

   # shape : (64 * 8329(vocab))
   return x, state, attention_weights

 def reset_state(self, batch_size):
   return tf.zeros((batch_size, self.units))


encoder = VGG16_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

接下來，我們定義損失函式和優化器：

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
   from_logits=True, reduction='none')

def loss_function(real, pred):
 mask = tf.math.logical_not(tf.math.equal(real, 0))
 loss_ = loss_object(real, pred)
 mask = tf.cast(mask, dtype=loss_.dtype)
 loss_ *= mask

 return tf.reduce_mean(loss_)

步驟4：-模型訓練

接下來，讓我們定義培訓步驟，我們使用一種稱為教師強制的技術，該技術將目標單詞作為下一個輸入傳遞給解碼器，此技術有助于快速了解正確的序列或序列的正確統計屬性，

loss_plot = []

@tf.function
def train_step(img_tensor, target):
 loss = 0
 # initializing the hidden state for each batch
 # because the captions are not related from image to image

 hidden = decoder.reset_state(batch_size=target.shape[0])
 dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

 with tf.GradientTape() as tape:
     features = encoder(img_tensor)
     for i in range(1, target.shape[1]):
         # passing the features through the decoder
         predictions, hidden, _ = decoder(dec_input, features, hidden)
         loss += loss_function(target[:, i], predictions)

         # using teacher forcing
         dec_input = tf.expand_dims(target[:, i], 1)

 total_loss = (loss / int(target.shape[1]))
 trainable_variables = encoder.trainable_variables + decoder.trainable_variables
 gradients = tape.gradient(loss, trainable_variables)
 optimizer.apply_gradients(zip(gradients, trainable_variables))

 return loss, total_loss

接下來，我們訓練模型：

EPOCHS = 20
for epoch in range(start_epoch, EPOCHS):
   start = time.time()
   total_loss = 0

   for (batch, (img_tensor, target)) in enumerate(dataset):
       batch_loss, t_loss = train_step(img_tensor, target)
       total_loss += t_loss

       if batch % 100 == 0:
           print ('Epoch {} Batch {} Loss {:.4f}'.format(
             epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
   # storing the epoch end loss value to plot later
   loss_plot.append(total_loss / num_steps)

   print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                        total_loss/num_steps))

   print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

讓我們繪制誤差圖：

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

輸出如下：

步驟5：-貪婪搜尋和BLEU評估

讓我們定義定義字幕的貪婪方法：

def evaluate(image):
   attention_plot = np.zeros((max_length, attention_features_shape))

   hidden = decoder.reset_state(batch_size=1)
   temp_input = tf.expand_dims(load_image(image)[0], 0)
   img_tensor_val = image_features_extract_model(temp_input)
   img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])

   features = encoder(img_tensor_val)
   dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
   result = []

   for i in range(max_length):
       predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
       attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
       predicted_id = tf.argmax(predictions[0]).numpy()
       result.append(tokenizer.index_word[predicted_id])

       if tokenizer.index_word[predicted_id] == '<end>':
           return result, attention_plot

       dec_input = tf.expand_dims([predicted_id], 0)
   attention_plot = attention_plot[:len(result), :]

   return result, attention_plot

另外，我們定義了一個函式來繪制生成的每個單詞的注意力圖，就像在簡介中看到的那樣

def plot_attention(image, result, attention_plot):
   temp_image = np.array(Image.open(image))
   fig = plt.figure(figsize=(10, 10))
   len_result = len(result)
   for l in range(len_result):
       temp_att = np.resize(attention_plot[l], (8, 8))
       ax = fig.add_subplot(len_result//2, len_result//2, l+1)
       ax.set_title(result[l])
       img = ax.imshow(temp_image)
       ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

   plt.tight_layout()
   plt.show()

最后，讓我們在文章開頭為圖片生成標題，看看注意力機制關注什么并生成

# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = '/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/2319175397_3e586cfaf8.jpg'

# real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = 'Two white dogs are playing in the snow'

#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)

for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)

#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result

score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")

print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
plot_attention(image, result, attention_plot)

輸出如下：

您可以看到我們能夠生成與真實字幕相同的字幕，讓我們嘗試一下測驗集中的其他影像，

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]

#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)

#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result_final

print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)

plot_attention(image, result, attention_plot)
print(f"time took to Predict: {round(time.time()-start)} sec")

Image.open(img_name_val[rid])

輸出如下：

您可以看到，即使我們的字幕與真實字幕有很大不同，它仍然非常準確，它能夠識別出女人的黃色襯衫和她的手在口袋里，

讓我們看看另一個：

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]

real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]

#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)

for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)

#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]

real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result

score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")

print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)

plot_attention(image, result, attention_plot)

在這里，我們可以看到我們的字幕比真實的字幕之一更好地定義了影像，

在那里！我們已經成功實作了用于生成影像標題的注意力機制，

下一步是什么？

近年來，注意力機制得到了高度利用，這僅僅是更多先進系統的開始，您可以實施以改善模型的事情：-利用較大的資料集，尤其是MS COCO資料集或比MS COCO大26倍的Stock3M資料集，實作不同的注意力機制，例如帶有Visual Sentinel和的自適應注意力，語意注意實作基于Transformer的模型，該模型的性能應比LSTM好得多，為影像特征提取實作更好的體系結構，例如Inception，Xception和Efficient network，

尾注

這對注意力機制及其如何應用于深度學習應用程式非常有趣，在注意力機制和取得最新成果方面進行了大量研究，請務必嘗試我的一些建議，以改善發電機的性能并與我分享您的結果！您覺得這篇文章對您有幫助嗎？請在下面的評論部分中分享您的寶貴反饋，隨時分享您完整的代碼筆記本，這將對我們的社區成員有所幫助，

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/258136.html

標籤：AI

上一篇：樂視視頻 App 圖示改為“欠 122 億”，網友：我在別家分紅包，卻在你家隨份子！...

下一篇：Prometheus核心概念：你是如何在專案中使用Summary型別的Metric的？

Python中影像標題生成的注意機制實戰教程

總覽

介紹

問題陳述的處理

Bahdanau注意的設計

了解資料集

讓我們實作字幕生成的注意機制！

步驟1：-匯入所需的庫

步驟2：-資料加載和預處理

步驟3：-模型定義

步驟4：-模型訓練

步驟5：-貪婪搜尋和BLEU評估

下一步是什么？

尾注