While deep classifiers develop high-dimensional representations where information is multiplexed over many simulated neurons, AAM has a low-dimensional code where single dimensions encode orthogonal information. Furthermore, deep classifiers and AAM differ in their representational form.
Such deep classifiers, however, currently do not explain the responses of single neurons in the primate face patch better than AAM 6. Such contemporary deep networks are trained with high-density teaching signals on multiway object recognition tasks 9, and in doing so form high-dimensional representations that, at the population level, closely resemble those in biological systems 10, 11, 12. Unlike AAM, these models are not limited to the domain of faces, and they develop their tuning distributions through data-driven learning. Recently, deep neural networks have emerged as popular models of computation in the primate ventral stream 7, 8. Can we find a general learning principle that could match AAM in terms of its explanatory power, while having the potential to generalise beyond faces? The most successful computational model of face processing, the active appearance model (AAM) 6, is a largely handcrafted framework which cannot help answer this question. An important yet unanswered question is how such representations may arise through learning from the statistics of the visual input. Faces appear to be represented within such patches using low-dimensional neural codes, where each neuron encodes an orthogonal axis of variation in the face space 3. A sub-network of the inferotemporal (IT) cortex specialised for face processing is particularly well studied 3, 4, 5. Decades of extracellular single neuron recordings have defined their canonical coding principles at different stages of the processing hierarchy, such as the sensitivity of early visual neurons to oriented contours and more anterior ventral stream neurons to complex objects and faces 2, 3. It is well known that neurons in the ventral visual stream support the perception of faces and objects 1. This points at disentangling as a plausible learning objective for the visual brain. Together our results imply that optimising the disentangling objective leads to representations that closely resemble those in the IT at the single unit level. Moreover, β-VAE is able to reconstruct novel face images using signals from just a handful of cells. Our results demonstrate a strong correspondence between the generative factors discovered by β-VAE and those coded by single IT neurons, beyond that found for the baselines, including the handcrafted state-of-the-art model of face perception, the Active Appearance Model, and deep classifiers. To answer this question, we model neural responses to faces in the macaque inferotemporal (IT) cortex with a deep self-supervised generative model, β-VAE, which disentangles sensory data into interpretable latent factors, such as gender or age. In order to better understand how the brain perceives faces, it is important to know what objective drives learning in the ventral visual stream.