Publications

2021

    • bibimg

      Li, W.-H., Liu, X., & Bilen, H. (2021). Universal Representation Learning from Multiple Domains for Few-shot Classification. International Conference on Computer Vision (ICCV).

      In this paper, we look at the problem of few-shot classification that aims to learn a classifier for previously unseen classes and domains from few labeled samples. Recent methods use adaptation networks for aligning their features to new domains or select the relevant features from multiple domain-specific feature extractors. In this work, we propose to learn a single set of universal deep representations by distilling knowledge of multiple separately trained networks after co-aligning their features with the help of adapters and centered kernel alignment. We show that the universal representations can be further refined for previously unseen domains by an efficient adaptation step in a similar spirit to distance learning methods. We rigorously evaluate our model in the recent Meta-Dataset benchmark and demonstrate that it significantly outperforms the previous methods while being more efficient.
      @inproceedings{Li21, title = {Universal Representation Learning from Multiple Domains for Few-shot Classification}, author = {Li, Wei-Hong and Liu, Xialei and Bilen, Hakan}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2021} }
    • bibimg

      Mariotti, O., Mac Aodha, O., & Bilen, H. (2021). ViewNet: Unsupervised Viewpoint Estimation From Conditional Generation. International Conference on Computer Vision (ICCV).

      Understanding the 3D world without supervision is currently a major challenge in computer vision as the annotations required to supervise deep networks for tasks in this domain are expensive to obtain on a large scale.In this paper, we address the problem of unsupervised viewpoint estimation. We formulate this as a self-supervised learning task, where image reconstruction from raw images provides the supervision needed to predict camera viewpoint. Specifically, we make use of pairs of images of the same object at training time, from unknown viewpoints, to self-supervise training by combining the viewpoint information from one image with the appearance information from the other. We demonstrate that using a perspective spatial transformer allows efficient viewpoint learning, outperforming existing unsupervised approaches on synthetic data and obtaining competitive results on the challenging PASCAL3D+.
      @inproceedings{Mariotti21, title = {ViewNet: Unsupervised Viewpoint Estimation From Conditional Generation}, author = {Mariotti, Octave and Mac~Aodha, Oisin and Bilen, Hakan}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2021} }
    • bibimg

      Chen, Y., Fernando, B., Bilen, H., Mensink, T., & Gavves, E. (2021). Shape Transformation with Deep Implicit Functions by Matching Implicit Features. International Conference on Machine Learning (ICML).

      Recently , neural implicit functions have achieved impressive results for encoding 3D shapes. Conditioning on low-dimensional latent codes generalises a single implicit function to learn shared representation space for a variety of shapes, with the advantage of smooth interpolation. While the benefits from the global latent space do not correspond to explicit points at local level, we propose to track the continuous point trajectory by matching implicit features with the latent code interpolating between shapes, from which we corroborate the hierarchical functionality of the deep implicit functions, where early layers map the latent code to fitting the coarse shape structure, and deeper layers further refine the shape details. Furthermore, the structured representation space of implicit functions enables to apply feature matching for shape deformation, with the benefits to handle topology and semantics inconsistency, such as from an armchair to a chair with no arms, without explicit flow functions or manual annotations.
      @inproceedings{Chen21, title = {Shape Transformation with Deep Implicit Functions by Matching Implicit Features}, author = {Chen, Yunlu and Fernando, Basura and Bilen, Hakan and Mensink, Thomas and Gavves, Efstratios}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }
    • bibimg

      Deecke, L., Ruff, L., Vandermeulen, R. A., & and Bilen, H. (2021). Transfer Based Semantic Anomaly Detection. International Conference on Machine Learning (ICML).

      Detecting semantic anomalies is challenging due to the countless ways in which they may appear in real-world data. While enhancing the robustness of networks may be sufficient for modeling simplistic anomalies, there is no good known way of preparing models for all potential and unseen anomalies that can potentially occur, such as the appearance of new object classes. In this paper, we show that a previously overlooked strategy for anomaly detection (AD) is to introduce an explicit inductive bias toward representations transferred over from some large and varied semantic task. We rigorously verify our hypothesis in controlled trials that utilize intervention, and show that it gives rise to surprisingly effective auxiliary objectives that outperform previous AD paradigms.
      @inproceedings{Deecke21, title = {Transfer Based Semantic Anomaly Detection}, author = {Deecke, Lucas and Ruff, Lucas and Vandermeulen, Robert~A. and and Bilen, Hakan}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021}, xcode = {https://github.com/VICO-UoE/TransferAD} }
    • bibimg

      Zhao, B., & Bilen, H. (2021). Dataset Condensation with Differentiable Siamese Augmentation. International Conference on Machine Learning (ICML).

      In many machine learning problems, large-scale datasets have become the de-facto standard to train state-of-the-art deep networks at the price of heavy computation load. In this paper, we focus on condensing large training sets into significantly smaller synthetic sets which can be used to train deep neural networks from scratch with minimum drop in performance. Inspired from the recent training set synthesis methods, we propose Differentiable Siamese Augmentation that enables effective use of data augmentation to synthesize more informative synthetic images and thus achieves better performance when training networks with augmentations. Experiments on multiple image classification benchmarks demonstrate that the proposed method obtains substantial gains over the state-of-the-art, 7% improvements on CIFAR10 and CIFAR100 datasets. We show with only less than 1% data that our method achieves 99.6%, 94.9%, 88.5%, 71.5% relative performance on MNIST, FashionMNIST, SVHN, CIFAR10 respectively. We also explore the use of our method in continual learning and neural architecture search, and show promising results.
      @inproceedings{Zhao21a, title = {Dataset Condensation with Differentiable Siamese Augmentation}, author = {Zhao, Bo and Bilen, Hakan}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021}, xcode = {https://github.com/vico-uoe/DatasetCondensation} }
    • bibimg

      Zhao, B., Konda, R. M., & Bilen, H. (2021). Dataset Condensation with Gradient Matching. International Conference on Learning Representations (ICLR).

      As the state-of-the-art machine learning methods in many fields rely on larger datasets, storing datasets and training models on them become significantly more expensive. This paper proposes a training set synthesis technique for data-efficient learning, called Dataset Condensation, that learns to condense large dataset into a small set of informative synthetic samples for training deep neural networks from scratch. We formulate this goal as a gradient matching problem between the gradients of deep neural network weights that are trained on the original and our synthetic data. We rigorously evaluate its performance in several computer vision benchmarks and demonstrate that it significantly outperforms the state-of-the-art methods. Finally we explore the use of our method in continual learning and neural architecture search and report promising gains when limited memory and computations are available.
      @inproceedings{Zhao21, title = {Dataset Condensation with Gradient Matching}, author = {Zhao, Bo and Konda, Reddy~Mopuri and Bilen, Hakan}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021}, xcode = {https://github.com/vico-uoe/DatasetCondensation} }

2020

    • bibimg

      Mariotti, O., & Bilen, H. (2020). Semi-supervised Viewpoint Estimation with Geometry-aware Conditional Generation. European Conference on Computer Vision (ECCV) Workshop.

      There is a growing interest in developing computer vision methods that can learn from limited supervision. In this paper, we consider the problem of learning to predict camera viewpoints, where obtaining ground-truth annotations are expensive and require special equipment, from a limited number of labeled images. We propose a semisupervised viewpoint estimation method that can learn to infer viewpoint information from unlabeled image pairs, where two images differ by a viewpoint change. In particular our method learns to synthesize the second image by combining the appearance from the first one and viewpoint from the second one. We demonstrate that our method significantly improves the supervised techniques, especially in the low-label regime and outperforms the state-of-the-art semi-supervised methods.
      @inproceedings{Mariotti20, title = {Semi-supervised Viewpoint Estimation with Geometry-aware Conditional Generation}, author = {Mariotti, Octave and Bilen, Hakan}, booktitle = {European Conference on Computer Vision (ECCV) Workshop}, year = {2020}, xcode = {https://github.com/VICO-UoE/SemiSupViewNet} }
    • bibimg

      Li, W.-H., & Bilen, H. (2020). Knowledge Distillation for Multi-task Learning. European Conference on Computer Vision (ECCV) Workshop.

      Multi-task learning (MTL) is to learn one single model that performs multiple tasks for achieving good performance on all tasks and lower cost on computation. Learning such a model requires to jointly optimize losses of a set of tasks with different difficulty levels, magnitudes, and characteristics (e.g. cross-entropy, Euclidean loss), leading to the imbalance problem in multi-task learning. To address the imbalance problem, we propose a knowledge distillation based method in this work. We first learn a task-specific model for each task. We then learn the multitask model for minimizing task-specific loss and for producing the same feature with task-specific models. As the task-specific network encodes different features, we introduce small task-specific adaptors to project multi-task features to the task-specific features. In this way, the adaptors align the task-specific feature and the multi-task feature, which enables a balanced parameter sharing across tasks. Extensive experimental results demonstrate that our method can optimize a multi-task learning model in a more balanced way and achieve better overall performance.
      @inproceedings{Li20, title = {Knowledge Distillation for Multi-task Learning}, author = {Li, Wei-Hong and Bilen, Hakan}, booktitle = {European Conference on Computer Vision (ECCV) Workshop}, year = {2020}, xcode = {https://github.com/VICO-UoE/KD4MTL} }
    • bibimg

      Goel, A., Fernando, B., Nguyen, T.-S., & Bilen, H. (2020). Injecting Prior Knowledge into Image Captioning. European Conference on Computer Vision (ECCV) Workshop.

      Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The state-of-the-art methods in image captioning struggles to approach human level performance, especially when data is limited. In this paper, we propose to improve the performance of the state-of-the-art image captioning models by incorporating two sources of prior knowledge: (i) a conditional latent topic attention, that uses a set of latent variables (topics) as an anchor to generate highly probable words and, (ii) a regularization technique that exploits the inductive biases in syntactic and semantic structure of captions and improves the generalization of image captioning models. Our experiments validate that our method produces more human interpretable captions and also leads to significant improvements on the MSCOCO dataset in both the full and low data regimes.
      @inproceedings{Goel20, title = {Injecting Prior Knowledge into Image Captioning}, author = {Goel, Arushi and Fernando, Basura and Nguyen, Thanh-Son and Bilen, Hakan}, booktitle = {European Conference on Computer Vision (ECCV) Workshop}, year = {2020} }
    • bibimg

      Jakab, T., Gupta, A., Bilen, H., & Vedaldi, A. (2020). Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

      We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight ‘geometric bottleneck’ which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training.
      @inproceedings{Jakab20, title = {Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos}, author = {Jakab, T. and Gupta, A. and Bilen, H and Vedaldi, A.}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2020}, xcode = {https://github.com/tomasjakab/keypointgan} }
    • bibimg

      Fernando, B., Tan, C., & Bilen, H. (2020). Weakly Supervised Gaussian Networks for Action Detection. IEEE Winter Conference on Applications of Computer Vision (WACV).

      Detecting temporal extents of human actions in videos is a challenging computer vision problem that requires detailed manual supervision including frame-level labels. This expensive annotation process limits deploying action detectors to a limited number of categories. We propose a novel method, called WSGN, that learns to detect actions from weak supervision, using only video-level labels. WSGN learns to exploit both video-specific and datasetwide statistics to predict relevance of each frame to an action category. This strategy leads to significant gains in action detection for two standard benchmarks THUMOS14 and Charades. Our method obtains excellent results compared to state-of-the-art methods that uses similar features and loss functions on THUMOS14 dataset. Similarly, our weakly supervised method is only 0.3% mAP behind a state-of-the-art supervised method on challenging Charades dataset for action localization.
      @inproceedings{Fernando20, title = {Weakly Supervised Gaussian Networks for Action Detection}, author = {Fernando, B. and Tan, C. and Bilen, H.}, booktitle = {IEEE Winter Conference on Applications of Computer Vision (WACV)}, year = {2020} }

2019

    • bibimg

      Thewlis, J., Albanie, S., Bilen, H., & Vedaldi, A. (2019). Unsupervised Learning of Landmarks by Descriptor Vector Exchange. International Conference on Computer Vision (ICCV).

      @inproceedings{Thewlis19, title = {Unsupervised Learning of Landmarks by Descriptor Vector Exchange}, author = {Thewlis, J. and Albanie, S. and Bilen, H. and Vedaldi, A.}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2019}, xcode = {http://www.robots.ox.ac.uk/\~{}vgg/research/DVE/} }
    • bibimg

      Deecke, L., Murray, I., & Bilen, H. (2019). Mode Normalization. International Conference on Learning Representations (ICLR).

      Normalization methods are a central building block in the deep learning toolbox. They accelerate and stabilize training, while decreasing the dependence on manually tuned learning rate schedules. When learning from multi-modal distributions, the effectiveness of batch normalization (BN), arguably the most prominent normalization method, is reduced. As a remedy, we propose a more flexible approach: by extending the normalization to more than a single mean and variance, we detect modes of data on-the-fly, jointly normalizing samples that share common features. We demonstrate that our method outperforms BN and other widely used normalization techniques in several experiments, including single and multi-task datasets.
      @inproceedings{Deecke19, title = {Mode Normalization}, author = {Deecke, L. and Murray, I. and Bilen, H.}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2019}, xcode = {https://github.com/ldeecke/mn-torch} }