schmmd/611f091db2785f855f181d22886f2c7f4c4afeba.json

## 611f091db2785f855f181d22886f2c7f4c4afeba.json
{
  "sections": [
    {
      "heading": "1 Introduction",
      "text": "Most vertebrates, including humans, can move their eyes. They use this ability to sample in detail the most relevant features of a scene, while spending only limited processing resources elsewhere. The ability to predict, given an image (or video), where a human might xate in a xed-time freeviewing scenario has long been of interest in the vision community. Besides the purely scienti c goal of understanding this remarkable behavior of humans, and animals in general, to consistently xate on \"important\" information, there is tremendous engineering application, e.g. in compression and recognition [13]. The standard approaches (e.g., [2], [9]) are based on biologically motivated feature selection, followed by center-surround operations which highlight local gradients, and nally a combination step leading to a \"master map\". Recently, Bruce [5] and others [4] have hypothesized that fundamental quantities such as \"self-information\" and \"surprise\" are at the heart of saliency/attention. However, ultimately, Bruce computes a function which is additive in feature maps, with the main contribution materializing as a method of operating on a feature map in such a way to get an activation, or saliency, map. Itti and Baldi de ne \"surprise\" in general, but ultimately compute a saliency map in the classical [2] sense for each of a number of feature channels, then operate on these maps using another function aimed at highlighting local variation. By organizing the topology of these varied approaches, we can compare them more rigorously: i.e., not just endto-end, but also piecewise, removing some uncertainty about the origin of observed performance differences.\nThus, the leading models of visual saliency may be organized into the these three stages:\n(s1) extraction: extract feature vectors at locations over the image plane\n(s2) activation: form an \"activation map\" (or maps) using the feature vectors\n(s3) normalization/combination: normalize the activation map (or maps, followed by a combination of the maps into a single map)\nIn this light, [5] is a contribution to step (s2), whereas [4] is a contribution to step (s3). In the classic algorithms, step (s1) is done using biologically inspired lters, step (s2) is accomplished by subtracting feature maps at different scales (henceforth, \"c-s\" for \"center\" - \"surround\"), and step (s3) is accomplished in one of three ways: 1. a normalization scheme based on local maxima\n[2] ( \"max-ave\"), 2. an iterative scheme based on convolution with a difference-of-gaussians lter (\"DoG\"), and 3. a nonlinear interactions (\"NL\") approach which divides local feature values by weighted averages of surrounding values in a way that is modelled to t psychophysics data [11].\nWe take a different approach, exploiting the computational power, topographical structure, and parallel nature of graph algorithms to achieve natural and ef cient saliency computations. We de ne Markov chains over various image maps, and treat the equilibrium distribution over map locations as activation and saliency values. This idea is not completely new: Brockmann and Geisel [8] suggest that scanpaths might be predicted by properly de ned Levy ights over saliency elds, and more recently Boccignone and Ferraro [7] do the same. Importantly, they assume that a saliency map is already available, and offer an alternative to the winner-takes-all approach of mapping this object to a set of xation locations. In an unpublished pre-print, L.F. Costa [6] notes similar ideas, however offers only sketchy details on how to apply this to real images, and in fact includes no experiments involving xations. Here, we take a uni ed approach to steps (s2) and (s3) of saliency computation, by using dissimilarity and saliency to de ne edge weights on graphs which are interpreted as Markov chains. Unlike previous authors, we do not attempt to connect features only to those which are somehow similar. We also directly compare our method to others, using power to predict human xations as a performance metric.\nThe contributions of this paper are as follows:\n(1) A complete bottom-up saliency model based on graph computations, GBVS, including a framework for \"activation\" and \"normalization/combination\".\n(2) A comparison of GBVS against existing benchmarks on a data set of grayscale images of natural environments (viz., foliage) with the eye-movement xation data of seven human subjects, from a recent study by Einhäuser et. al. [1]."
    },
    {
      "heading": "2 The Proposed Method: Graph-Based Saliency (GBVS)",
      "text": "Given an image I , we wish to ultimately highlight a handful of `signi cant' locations where the image is `informative' according to some criterion, e.g. human xation. As previously explained, this process is conditioned on rst computing feature maps (s1), e.g. by linear ltering followed by some elementary nonlinearity [15]. \"Activation\" (s2), \"normalization and combination\" (s3) steps follow as described below."
    },
    {
      "heading": "2.1 Forming an Activation Map (s2)",
      "text": "Suppose we are given a feature map1 M : [n]2 ! R. Our goal is to compute an activation map A : [n]2 ! R, such that, intuitively, locations (i; j) 2 [n]2 where I , or as a proxy, M(i; j); is somehow unusual in its neighborhood will correspond to high values of activation A."
    },
    {
      "heading": "2.1.1 Existing Schemes",
      "text": "Of course \"unusual\" does not constrain us suf ciently, and so one can choose several operating de nitions. \"Improbable\" would lead one to the formulation of Bruce [5], where a histogram of M(i; j) values is computed in some region around (i; j), subsequently normalized and treated as a probability distribution, so that A(i; j) = log(p(i; j)) is clearly de ned with p(i; j) = PrfM(i; j)jneighborhoodg: Another approach compares local \"center\" distributions to broader \"surround\" distributions and calls the Kullback-Leibler tension between the two \"surprise\" [4].\n1in the context of a mathematical formulation, let [n] , f1; 2; :::; ng. Also, the maps M , and later A, are presented as square (n n) only for expository simplicity. Nothing in this paper will depend critically on the square assumtion, and, in practice, rectangular maps are used instead."
    },
    {
      "heading": "2.1.2 A Markovian Approach",
      "text": "We propose a more organic (see below) approach. Let us de ne the dissimilarity of M(i; j) and M(p; q) as\nd((i; j)jj(p; q)) , log M(i; j)M(p; q)\n: This is a natural de nition of dissimilarity: simply the distance between one and the ratio of two quantities, measured on a logarithmic scale. For some of our experiments, we use jM(i; j) M(p; q)j instead, and we have found that both work well. Consider now the fully-connected directed graph GA, obtained by connecting every node of the lattice M , labelled with two indices (i; j) 2 [n]2, with all other n 1 nodes. The directed edge from node (i; j) to node (p; q) will be assigned a weight\nw1((i; j); (p; q)) , d((i; j)jj(p; q)) F (i p; j q), where\nF (a; b) , exp a\n2 + b2\n2 2\n:\nis a free parameter of our algorithm2. Thus, the weight of the edge from node (i; j) to node (p; q) is proportional to their dissimilarity and to their closeness in the domain of M . Note that the edge in the opposite direction has exactly the same weight. We may now de ne a Markov chain on GA by normalizing the weights of the outbound edges of each node to 1, and drawing an equivalence between nodes & states, and edges weights & transition probabilities . The equilibrium distribution of this chain, re ecting the fraction of time a random walker would spend at each node/state if he were to walk forever, would naturally accumulate mass at nodes that have high dissimilarity with their surrounding nodes, since transitions into such subgraphs is likely, and unlikely if nodes have similarM values. The result is an activation measure which is derived from pairwise contrast.\nWe call this approach \"organic\" because, biologically, individual nodes (neurons) exist in a connected, retinotopically organized, network (the visual cortex), and communicate with each other (synaptic ring) in a way which gives rise to emergent behavior, including fast decisions about which areas of a scene require additional processing. Similarly, our approach exposes connected (via F ) regions of dissimilarity (via w), in a way which can in principle be computed in a completely parallel fashion. Computations can be carried out independently at each node: in a synchronous environment, at each time step, each node simply sums incoming mass, then passes along measured partitions of this mass to its neighbors according to outbound edge weights. The same simple process happening at all nodes simultaneously gives rise to an equilibrium distribution of mass.\nTechnical Notes The equilibrium distribution of this chain exists and is unique because the chain is ergodic, a property which emerges from the fact that our underlying graph GA is by construction strongly connected. In practice, the equilibrium distribution is computed using repeated multiplication of the Markov matrix with an initially uniform vector. The process yields the principal eigenvector of the matrix. The computational complexity is thus O(n4K) where K n2 is some small number of iterations required to meet equilibrium3."
    },
    {
      "heading": "2.2 \"Normalizing\" an Activation Map (s3)",
      "text": "The aim of the \"normalization\" step of the algorithm is much less clear than that of the activation step. It is, however, critical and a rich area of study. Earlier, three separate approaches were mentioned as existing benchmarks, and also the recent work of Itti on surprise [4] comes into the saliency computation at this stage of the process (although it can also be applied to s2 as mentioned above). We shall state the goal of this step as: concentrating mass on activation maps. If mass is not concentrated on individual activation maps prior to additive combination, then the resulting master map may be too nearly uniform and hence uninformative. Although this may seem trivial, it is on some level the very soul of any saliency algorithm: concentrating activation into a few key locations.\n2In our experiments, this parameter was set to approximately one tenth to one fth of the map width. Results were not very sensitive to perturbations around these values.\n3Our implementation, not optimized for speed, converges on a single map of size 25 37 in fractions of a second on a 2.4 GHz Pentium.\nArmed with the mass-concentration de nition, we propose another Markovian algorithm as follows: This time, we begin with an activation map4 A : [n]2 ! R, which we wish to \"normalize\". We construct a graph GN with n2 nodes labelled with indices from [n]2. For each node (i; j) and every node (p; q) (including (i; j)) to which it is connected, we introduce an edge from (i; j) to (p; q) with weight:\nw2((i; j); (p; q)) , A(p; q) F (i p; j q): Again, normalizing the weights of the outbound edges of each node to unity and treating the resulting graph as a Markov chain gives us the opportunity to compute the equilibrium distribution over the nodes5. Mass will ow preferentially to those nodes with high activation. It is a mass concentration algorithm by construction, and also one which is parallelizable, as before, having the same natural advantages. Experimentally, it seems to behave very favorably compared to the standard approaches such as \"DoG\" and \"NL\"."
    },
    {
      "heading": "3 Experimental Results",
      "text": ""
    },
    {
      "heading": "3.1 Preliminaries and paradigm",
      "text": "We perform saliency computations on real images of the natural world, and compare the power of the resulting maps to predict human xations. The experimental paradigm we pursue is the following: for each of a set of images, we compute a set of feature maps using standard techniques. Then, we proccess each of these feature maps using some activation algorithm, and then some normalization algorithm, and then simply sum over the feature channels. The resulting master saliency map is scored (using an ROC area metric described below) relative to xation data collected for the corresponding image, and labelled according to the activation and normalization algorithms used to obtain it. We then pool over a corpus of images, and the resulting set of scored and labelled master saliency maps is analyzed in various ways presented below. Some notes follow:\nAlgorithm Labels: Hereafter, \"graph (i)\" and \"graph (ii)\" refer to the activation algorithm described in section 2.1.2. The difference is that in graph (i), the parameter = 2:5, whereas in graph (ii), = 5. \"graph (iii)\" and \"graph (iv)\" refer to the an iterated repitition of the normalization algorithm described in section 2.2. The difference is the termination rule associated with the iterative process: for graph (iii), a complicated termination rule is used which looks for a local maximum in the number of matrix multiplications required to achieve a stable equilibrium distribution6, and for graph (iv), the termination rule is simply \"stop after 4 iterations\". The normalization algorithm referred to as \"I\" corresponds to \"Identity\", with the most naive normalization rule: it does nothing, leaving activations unchanged prior to subsequent combination. The algorithm \"max-ave\" and \"DoG\" were run using the publicly available \"saliency toolbox\"7. The parameters of this were checked against the literature [2] and [3], and were found to be almost identical, with a few slight alterations that actually improved performance relative to the published parameters. The parameters of \"NL\" were set according to the better of the two sets of parameters provided in [11].\nPerformance metric: We wish to give a reward quantity to a saliency map, given some target locations, e.g., in the case of natural images, a set of locations at which human observers xated. For any one threshold saliency value, one can treat the saliency map as a classi er, with all points above threshold indicated as \"target\" and all points below threshold as \"background\". For any particular value of the threshold, there is some fraction of the actual target points which are labelled as such (true positive rate), and some fraction of points which were not target but labelled as such anyway (false positive rate). Varying over all such thresholds yields an ROC curve [14] and the area beneath it is generally regarded as an indication of the classifying power of the detector. This is the performance metric we use to measure how well a saliency map predicts xation locations on a given image.\n4To be clear, if A is the result of the eigenvector computation described in 2.1, i.e., if the graph-based activation step is concatenated with the graph-based normalization step, we will call the resulting algorithm GBVS. However, A may be computed using other techniques.\n5We note that this normalization step of GBS can be iterated times to improve performance. In practice, we use 2 f2; 3; 4g. Performance does not vary signi cantly in this regime with respect to .\n6with the intuition being that competition among competing saliency regions can settle, at which point it is wise to terminate\n7http://www.saliencytoolbox.net"
    },
    {
      "heading": "3.2 Human Eye-Movement Data on Images of Nature",
      "text": "In a study by Einhäuser et al. [1], human and primate xation data was collected on 108 images, each modi ed8 in nine ways. Figure 2 shows an example image from this collection, together with \"x\"s marking the xation points of three human subjects on this particular picture. In the present study, 749 unique modi cations of the 108 original images, and 24149 human xations from [1] were used. Only pictures for which xation data from three human subjects were available were used. Each image was cropped to 600 400 pixels and was presented to subjects so that it took up 76 55 of their visual eld. In order to facilitate a fair comparison of algorithms, the rst step of the saliency algorithm, feature extraction (s1), was the same for every experiment. Two spatial scales\n1 2 ; 1 4 were used, and for each of these, four orientation maps corresponding to orientations\n= f0 ; 45 ; 90 ; 135 g were computed using Gabor lters, one contrast map was computed using luminance variance in a local neighborhood of size 80 80, and the last map was simply a luminance map (the grayscale values). Each of these 12 maps was nally downsampled to a 25 37 raw feature map.\n\"c-s\" (center-surround) activation maps were computed by subtracting, from each raw feature map, a feature map on the same channel originally computed at a scale 4 binary orders of magnitude smaller in overall resolution and then resized smoothly to size 25 37. In [2], this overall scheme would be labelled c = f2; 3g, for 12 and 1 4 , and = f4g, corresponding to a scale change of 4 orders. The other activation procedures are described in section 2.1.2 and 2.1.1. The normalization procedures are all earlier described and named. Figure 2 shows an actual image with the resulting saliency maps from two different (activation, normalization) schemes.\nFinally, we show the performance of this algorithm on the corpus of images. For each image, a mean inter-subject ROC area was computed as follows: for each of the three subjects who viewed an image, the xation points of the remaining two subjects were convolved with a circular, decaying kernel with decay constant matched to the decaying cone density in the retina. This was treated as a saliency map derived directly from human xations, and with the target points being set to the\n8Modi cations were made to change the luminance contrast either up or down in selected circular regions. Both modi ed and unmodi ed stimuli were used in these experiments. Please refer to [1], [12].\nxations of the rst subject, an ROC area was computed for a single subject. The mean over the three is termed \"inter-subject ROC value\" in the following gures. For each range of this quantity, a mean performance metric was computed for various activation and normalization schemes. For any particular scheme, an ROC area was computed using the resulting saliency map together with the xations from all 3 human subjects as target points to detect. The results are shown below.\nIn both Figures 3 and 4, The boundary lines above and below show a rough upper9 and strict lower bounds on performance (based on a human control and chance performance). Figure 3(a) and Figure 3(b) clearly demonstrate the tremendous predictive power of the graph-based algorithms over standard approaches. Figure 4 demonstrates the especially effective performance of combining the best graph-based activation and normalization schemes, contrasted against the standard Itti & Koch approaches, and also the \"self-information\" approach which includes no mention of a normalization step (hence, set here to \"I\").\nm ea\nn R\nO C\nv al\nue fo\nr a lg\nor ith\nm\nThe combination of a few possible pairs of activation schemes together with normalization schemes is summarized in Table 1, with notes indicating where certain combinations correspond to established benchmarks. Performance is shown as a fraction of the inter-subject ROC area. Overall, we nd an median ROC area of 0.55 for the Itti & Koch saliency algorithms [2] on these images. In [1]\n9To form a true upper bound, one would need the xation data of many more than three humans on each image.\nthe mean is reported as 0.57, which is remarkably close and plausible if you assume slightly more sophisticated feature maps (for instance, at more scales)."
    },
    {
      "heading": "4 Discussion and Conclusion",
      "text": "Although a novel, simple approach to an old problem is always welcome, we must also seek to answer the scienti c question of how it is possible that, given access to the same feature information, GBVS predicts human xations more reliably than the standard algorithms. We nd experimentally that there are at least two reasons for this observed difference. The rst observation is that, because nodes are on average closer to a few center nodes than to any particular point along the image periphery, it is an emergent property that GBVS promotes higher saliency values in the center of the image plane. We hypothesize that this \"center bias\" is favorable with respect to predicting xations due to human experience both with photographs, which are typically taken with a central subject, and with everyday life in which head motion often results in gazing straight ahead. Notably, the images of foliage used in the present study had no central subject. One can quantify the GBVS-induced center bias by activating, then normalizing, a uniform image using our algorithms. However, if we introduce this center bias to the output of the standard algorithms' master maps (via pointwise multiplication), we nd that the standard algorithms predict xations better, but still worse than GBVS. In some cases (e.g., \"DoG\"), introducing this center bias only explains 20% of the performance gap to GBVS in the best case (viz., \"max-ave\"), it explains 90% of the difference. We conjecture that the other reason for the performance difference stems from the robustness of our algorithm with respect to differences in the sizes of salient regions. Experimentally, we nd that the \"c-s\" algorithm has trouble activating salient regions distant from object borders, even if one varies over many choices of scale differences and combinations thereof. Since most of the standard algorithms have \"c-s\" as a rst step, they are weakened ab initio. Similarly, the \"self-info\" algorithm suffers the same weakness, even if one varies over the neighborhood size parameter. On the other hand, GBVS robustly highlights salient regions, even far away from object borders.\nWe note here that what lacks from GBVS described as above is any notion of a multiresolution representation of map data. Therefore, because multiresolution representations are so basic, one may extend both the graph-based activation and normalization steps to a multiresolution version as follows: We begin with, instead of a single map A : [n]2 ! R, a collection of maps fAig, with each Ai : [ni]2 ! R representing the same underlying information but at different resolutions. Proceeding as we did before, we instantiate a node for every point on every map, introducing edges again between every pair of nodes, with weights computed same as before with one caveat: the\n10performance here is measured by the ratio of (ROC area using the given algorithm for xation detection) to (ROC area using a saliency map formed from the xations of other subjects on a single picture)\ndistance penalty function F (a; b) accepts two arguments each of which is a distance between two nodes along a particular dimension. In order to compute F in this case, one must de ne a distance over points taken from different underlying domains. The authors suggest a de nition whereby: (1) each point in each map is assigned a set of locations, (2) this set corresponds to the spatial support of this point in the highest resolution map, and (3) the distance between two sets of locations is given as the mean of the set of pairwise distances. The equilibrium distribution can be computed as before. We nd that this extension (say, GBVS Multiresolution, or GBVSM) improves performance with little added computation.\nTherefore, we have presented a method of computing bottom-up saliency maps which shows a remarkable consistency with the attentional deployment of human subjects. The method uses a novel application of ideas from graph theory to concentrate mass on activation maps, and to form activation maps from raw features. We compared our method with established models and found that ours performed favorably, for both of the key steps in our organization of saliency computations. Our model is extensible to multiresolutions for better performance, and it is biologically plausible to the extent that a parallel implementation of the power-law algorithm for Markov chains is trivially accomplished in hardware."
    },
    {
      "heading": "Acknowledgments",
      "text": "The authors express sincere gratitude to Wolfgang Einhäuser for his offering of natural images, and the xation data associated with them from a study with seven human subjects. We also acknowledge NSF, NIH, DARPA, and ONR for their generous support of our research."
    }
  ],
  "year": 2006,
  "references": [
    {
      "title": "Differences of Monkey and Human Overt Attention under Natural Conditions",
      "authors": [
        "W. Einhäuser",
        "W. Kruse",
        "K.P. Hoffmann",
        "P. König"
      ],
      "venue": "Vision Research",
      "year": 2006
    },
    {
      "title": "A model of saliency based visual attention for rapid scene analysis",
      "authors": [
        "L. Itti",
        "C. Koch",
        "E. Niebur"
      ],
      "venue": "IEEE Transactions on Pattern Analysis and Machine",
      "year": 1998
    },
    {
      "title": "A saliency-based search mechanism for overt and covert shifts of visual attention",
      "authors": [
        "L. Itti",
        "C. Koch"
      ],
      "venue": "Vision Research,",
      "year": 2000
    },
    {
      "title": "Visual Saliency and Attention as Random Walks on Complex Networks",
      "authors": [
        "L.F. Costa"
      ],
      "year": 2006
    },
    {
      "title": "Modelling gaze shift as a constrained random walk",
      "authors": [
        "G. Boccignone",
        "M. Ferraro"
      ],
      "venue": "Physica A 331,",
      "year": 2004
    },
    {
      "title": "Are human scanpaths Levy ights?",
      "authors": [
        "D. Brockmann",
        "T. Geisel"
      ],
      "venue": "ICANN",
      "year": 1999
    },
    {
      "title": "Modeling the role of salience in the allocation of overt visual attention",
      "authors": [
        "D. Parkhurst",
        "K. Law",
        "E. Niebur"
      ],
      "venue": "Vision Research,",
      "year": 2002
    },
    {
      "title": "Attention activates winner-take-all competition among visual features",
      "authors": [
        "D.K. Lee",
        "L. Itti",
        "C. Koch",
        "J. Braun"
      ],
      "venue": "Nature Neuroscience,",
      "year": 1999
    },
    {
      "title": "Does luminance-contrast contribute to saliency map for overt visual attention?",
      "authors": [
        "W. Einhäuser",
        "P. König"
      ],
      "venue": "Eur. J. Neurosci",
      "year": 2003
    },
    {
      "title": "Is bottom-up attention useful for object recognition?",
      "authors": [
        "U. Rutishauser",
        "D. Walther",
        "C. Koch",
        "P. Perona"
      ],
      "year": 2004
    },
    {
      "title": "Visual correlates of xation selection: Effects of scale and time.",
      "authors": [
        "B.W. Tatler",
        "R.J. Baddeley",
        "I.D. Gilchrist"
      ],
      "venue": "Vision Research",
      "year": 2005
    },
    {
      "title": "Preattentive texture discrimination with early vision mechanisms",
      "authors": [
        "J. Malik",
        "P. Perona"
      ],
      "venue": "Journal of the Optical Society of America A",
      "year": 1990
    }
  ],
  "id": "SP:7e93dd044471fa743903e3d599c177d7518bc5cb",
  "authors": [
    {
      "name": "Jonathan Harel",
      "affiliations": []
    }
  ],
  "abstractText": "A new bottom-up visual saliency model, Graph-Based Visual Saliency (GBVS), is proposed. It consists of two steps: rst forming activation maps on certain feature channels, and then normalizing them in a way which highlights conspicuity and admits combination with other maps. The model is simple, and biologically plausible insofar as it is naturally parallelized. This model powerfully predicts human xations on 749 variations of 108 natural images, achieving 98% of the ROC area of a human-based control, whereas the classical algorithms of Itti & Koch ([2], [3], [4]) achieve only 84%.",
  "title": "Graph-Based Visual Saliency"
}