Convolutional neural networks (CNNs) have their computation scale linearly with the number of pixels in the input. What if we could come up with a model that only looks at a sequence of small regions (patches) within the input image? The amount of computation is then independent on the size of the image, and dependent on the size and number of the patches extracted. This also reduces the task complexity as the model can focus on the object of interest, ignoring any surrounding clutter. This work is related to three branches of research: reducing computation in computer vision, "saliency detectors", and computer vision as a sequential decision task.
Biological inspirations: Humans do not perceive a whole scene at once, instead they focus attention on parts of the visual space to acquire information and then combine it to build an internal representation of the scene. Locations at which humans fixate have been shown to be task specific.