Re-thinking Co-Salient Object Detection

原始檔案:https://www.yuque.com/lart/papers/feumut

CoSOD最近的一篇綜述, 梳理了該領域的方法, 提出了一個資料集, 在CVPR版本基礎上進一步提出了一個新方法.

CoSOD

是什么

As a extension of this, co-salient object detection (CoSOD) emerged recently to employ a set of images.

The goal of CoSOD is to extract the salient object(s) that are common

within a single image (e.g., red-clothed football players in Fig. 1 (b))

or across multiple images (e.g., the blue-clothed gymnast in Fig. 1 (c)).

Two important characteristics of co-salient objects are local saliency and global similarity.

應用前景

collection-aware crops
- Cosaliency: Where people look when comparing images
co-segmentation
- Higher-order image co-segmentation
- Object-Based Multiple Foreground Video Co-Segmentation via Multi-State Selection Graph
weakly supervised learning
- Capsal: Leveraging captioning to boost semantics for salient object detection
image retrieval
- A model of visual attention for natural image retrieval
- Salientshape: group saliency in image collections
video foreground detection
- Cluster-based co-saliency detection

現有資料集

MSRC [Object categorization by learned universal visual dictionary] and_** Image Pair**_ [A co-saliency model of image pairs] are two of the earliest ones.
1. MSRC was designed for recognizing object classes from images and has spurred many interesting ideas over the past several years. This dataset includes 8 image groups and 240 images in total, with manually annotated pixel-level ground-truth data.
2. Image Pair, introduced by Li et al. [29], was specifically designed for image pairs and contains 210 images (105 groups) in total.
The iCoSeg [icoseg: Interactive co-segmentation with intelligent scribble guidance] dataset was released in 2010. It is a relatively larger dataset consisting of 38 categories with 643 images in total.
1. Each image group in this dataset contains 4 to 42 images,
2. rather than only 2 images like in the Image Pair dataset.
The THUR15K [Salientshape: group saliency in image collections] and CoSal2015 [Co-saliency detection via looking deep and wide] are two large-scale publicly available datasets, with CoSal2015 widely used for assessing CoSOD algorithms.
Different from the above-mentioned datasets, the WICOS [Co-saliency detection within a single image] dataset aims to detect co-salient objects from a single image, where each image can be viewed as one group.

存在的問題

Although the aforementioned datasets have advanced the CoSOD task to various degrees, they are severely limited in variety, with only dozens of groups. On such small-scale datasets, the scalability of methods cannot be fully evaluated.
Moreover, these datasets only provide object-level labels. **None of them provide rich annotations such as bounding boxes, instances, etc., ** which are important for progressing many vision tasks and multi-task modeling. Especially in the current deep learning era, where models are often data-hungry.
Most CoSOD datasets tend to focus on the appearance-similarity between objects to identify the co-salient object across multiple images. However, this leads to data selection bias [Salient objects in clutter: Bringing salient object detection to the foreground], [Unbiased look at dataset bias] and is not always appropriate, since, in real-world applications, the salient objects in a group of images often vary in terms of texture, scene, and background, even if they belong to the same category.

CoSOD的評估

現有評估方式的局限

評價全面性(Completeness), 建議引入更多的指標, 例如S-measure, E-measure.
評價合理性(Fairness), 對于F-measure需要使用二值預測結果的特性, 不同的二值策略導致不同的結果, 所以需要一套公用的基準代碼來評估.

To address the aforementioned limitations, we argue that integrating various publicly available CoSOD algorithms, datasets, and metrics, and then providing a complete, unified benchmark, is highly desired.

CoSOD與SOD評估方式的差異

CoSOD涉及到分組, 也就是以每一組內(這些影像內普遍出現的目標往往就是Co-salient Obejct)統計各個指標的結果, 但是這里有個細節需要注意:

對于直接可獲得的數值指標(例如MAE、S-measure、weighted F-measure、adaptive F-measure和adaptive E-measure)而言, 就是各組內計算平均值后, 所有組的結果再一起計算一次均值.
但是對于需要通過變化閾值來計算的指標(例如max F-measure、mean F-measure、max E-measure和mean F-measure)而言, 就是各組內平均得到256長度的序列后, 再所有組一起算一次均值. 對于最終得到的腸胃256序列的結果取最大或者均值便可以得到對應的指標值.

關于各個指標具體的定義細節可見本人的python代碼或者是Fan提供的matlab代碼.

https://github.com/lartpang/PySODMetrics

https://github.com/DengPingFan/CODToolbox

注意, 這里提供的鏈接是針對SOD或者COD任務的資料的指標計算代碼.

對于CoSOD任務的分組計算的特性, 需要進行調整, 具體可見Fan提供的另一份計算CoSOD的代碼, 但是他其中的指標計算并不全面, 代碼還有部分錯誤(與這里指出的是相同的錯誤:https://github.com/DengPingFan/CODToolbox/issues), 但是計算的邏輯是可以參考的:

http://dpfan.net/wp-content/uploads/CoSalBenchmark-EvaluationTools.zip

我近期已經整理了一份python的實作, 暫時沒有公開, 指標更加全面(按照本文的內容來看, SOD的指標實際上都可以被用到CoSOD上), 速度更快.

關于我對于E-measure計算的加速的思考可見以下兩篇文章:

我是如何使計算時間提速25.6倍的:https://www.yuque.com/lart/blog/aemqfz

我是如何使計算提速>150倍的:https://www.yuque.com/lart/blog/lwgt38

本文的貢獻

提出了CoSOD3k資料, 包含13個超類, 160組, 3316張圖.
整理了34篇相關作業, 評估了16個模型, 提供了一套評估代碼.
提出了一個簡單有效的CoSOD框架, 基于現有的SOD方法實作了CoSOD的有效處理.
分析了結果, 對未來的作業提出了一些建議.

CoSOD3k

看文字分析不如圖表來的直接.

不同資料集中資料屬性的統計, 可見本文提出的資料集包含的豐富的注釋型別

不同資料集中目標屬性的統計

CoSOD3k類別資訊統計

The overall dataset mask (the right of Fig. 7) tends to appear as a center-biased map without shape bias. As is well-known, humans are usually inclined to pay more attention to the center of a scene when taking a photo. Thus, it is easy for a SOD model to achieve a high score when employing a Gaussian function in its algorithm.

CoEG-Net

本文提出了一個兩分支的框架以一種多重獨立的方式(in a multiply independent fashion)分別捕獲并發依賴(concurrent dependencies)和顯著性前景. 通過上面的分支獲得co-attention maps和下面分支獲得的saliency prior maps之間相乘(element-wise)來產生最終的co-saliency prediction.

下面的顯著性分支較為簡單, 直接使用了DUTS上訓練好的EGNet來收集多尺度顯著性先驗. 這可以在不利用跨影像資訊的前提下幫助識別影像中的顯著性區域.
上面分支以一種無監督的方式生成co-attention map. 這部分需要細講一下.

Co-attention Projection for Co-saliency Learning

這里的設計受CAM[Learning deep features for discriminative localization]的啟發:

給定輸入影像\(\mathbf{I}^n\), 對應影像類別(keywords labeling)為\(c\)
從VGG最后的卷積層中獲得特征激活圖\(\mathbf{X}^n\)
\(c\)通過類別監督可以獲得(例如從分類任務的全連接層對應的引數獲得)對應與卷積特征激活輸出各個通道的權重\(\omega\)
可以得到最終的類別特定的attention map:\(\mathbf{M}^n_c=\sum^K_{k=1}\omega^c_k\mathbf{X}^n\)
針對特征圖\(\mathbf{X}^n\)上的每一個位置, 可以得到更加具體的計算方式:\(\mathbf{M}^n_c(i, j)=(\omega^c)^\top \cdot \mathbf{x}^n(i, j)\)

因此CAM實際上實作了一種從特征\(\mathbf{x}^n(i, j)\)到類別特定激活圖\(\mathbf{M}^n_c(i, j)\)的線性變換.

本文延續這種思路, 并且根據自身沒有類別標簽的情況進行了進一步無監督學習的探索.

作者給出了自己的分析:

Ideally, the unknown common object category among a group of associated images \(\{\mathbf{I}^n\}^N_{n=1}\) should corresponds to a linear projection that results in high class activation scores in the common object regions, while having low class activation scores in other image regions.

From another point of view, the common object category should correspond to the linear transformation that generates the highest variance (most informative) in the resulting class activation maps.

Follow the idea in coarse localization task [Unsupervised object discovery and co-localization by deep descriptor transformation], we achieve this gold by exploring the classical principle component analysis (PCA) method [LIII. On lines and planes of closest fit to systems of points in space], which is the simplest way of revealing the internal structure of the data in a way that best explains the variance in the data.

我覺的這個解釋有點牽強. 感覺邏輯有點不夠連貫: high class activation scores =?>the highest variance (most informative)

接下來就是溫習PCA的階段了:

給定\(\{\mathbf{I}^n\}\), 可以得到\(\{\mathbf{X}^n\}\)
旨在獲得一個變換, 可以從\(\{\mathbf{X}^n\}\)獲得一個有著最大方差的co-attetion maps\(\{\mathbf{A}^n\}\), 注意這里是一組結果, 這個變換則通過分析特征描述子\(\{\mathbf{x}^n(i, j)\}\)的協方差矩陣獲得
計算均值:\(\bar{\mathbf{x}} = \frac{1}{Z}\sum_n\sum_{i, j}\mathbf{x}^n(i, j)\)獲得, 這里的Z是一個\(N \times H \times W\)的張量
通過對\(\mathbf{x}^n(i, j)\)去均值處理獲得零均值版本的描述子\(\hat{\mathbf{x}}^n(i, j)\)
進一步獲得協方差矩陣:

(雖然原文是這么給的, 但是為什么還要再減均值呢?)

這里通過獲得Cov的最大的特征值對應的特征向量得到對應的線性變換:

這里的\(\xi^*\)表示對應的特征向量

可視化結果

這里需要注意, 得到的attention maps本身是灰度的, 具有極高的模糊性. 為了將其集成到已經由EGNet得到的saliency prior map上, 需要先對其進行處理, 文中使用了densecrf和manifold ranking來進一步細化.

實驗結果

也嘗試了基于其他SOD方法的實驗

討論和建議

SOD方法的良好表現并不一定意味著當前的資料集不夠復雜, 或者直接使用SOD方法可以獲得良好的性能: From the evaluation, we observe that, in most cases, the current SOD methods can obtain very competitive or even better performances than the CoSOD methods. However, this does not necessarily mean that the current datasets are not complex enough or using the SOD methods directly can obtain the good performances—the performances of the SOD methods on the CoSOD datasets are actually lower than those on the SOD datasets.
CoSOD的研究還存在一些問題: Consequently, the evaluation results reveal that many problems in CoSOD are still under-studied and this makes the existing CoSOD models less effective.
- Scalability: 現有方法很難應對更大的組的資料同時處理, 如何降低由于組內影像數量造成的計算損耗, 是實際應用需要考慮的關鍵問題.
- Stability: 一些方法對于陣列組內樣本的順序有依賴, 這損害了模型性能的穩定性(如果改變順序或者劃分的子組有變換, 可能性能有變化). 這會限制實際的應用.
- Compatibility: 在CoSOD框架中引入SOD方法被本文證明了有效性, 但是如何實作更加高效(時間消耗)端到端可訓練的檢測是一個值得研究的問題.
- Metrics: 現有指標主要基于單影像的目標的預測評估, 沒有考慮跨影像的目標預測的評估.

【CoSOD】Re-thinking Co-Salient Object Detection