Yusuf Aytar
Post-Doctoral Research Associate
Computer Science and Artificial Intelligence Laboratory
Dept. of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
yusuf@csail.mit.edu
Office: 32-D475B,
32 Vassar Street, Cambridge, MA
Research Interests
Computer Vision, Machine Learning
deep learning, object detection, image/video retrieval,
scene understanding, transfer learning, cross-modal learning
Short Bio
Yusuf is a post-doctoral research associate working with Prof. Antonio Torralba at CSAIL, MIT.
He obtained his PhD (2014) degree from the Visual Geometry Group at the University of Oxford
under the supervision of Prof. Andrew Zisserman.
As a Fulbright scholar, he obtained his MSc (2008) degree from Computer Vision Lab at the
University of Central Florida under the supervision of Prof. Mubarak Shah,
and his B.E. degree in Computer Engineering (2005) in Ege University.
He had internships in Siemens Corporate Research in Priceton, NJ, and Google Research in Mountain View, CA.
Outcomes of his research are published in major computer vision conferences such as BMVC, CVPR, and ICCV.
Curriculum Vitae
Publications
Y. Aytar, C. Vondrick, A. Torralba * denotes equal contribution
ArXiv, 2017
@InProceedings{Aytar17,
author = "Yusuf Aytar, Carl Vondrick, Antonio Torralba",
title = "See, Hear, and Read: Deep Aligned Representations",
booktitle = "arXiv:1706.00932",
year = "2017",
}
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.
A. Salvador*, N. Hynes*, Y. Aytar, J. Marin, F. Ofli, I. Weber, A. Torralba * denotes equal contribution
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
@InProceedings{Salvador17,
author = "A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, A. Torralba",
title = "Learning Cross-modal Embeddings for Cooking Recipes and Food Images",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2017",
}
In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to find a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Additionally, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general.
F. Ofli, Y. Aytar, I. Weber, R. al-Hammouri, A. Torralba
International Conference on World Wide Web (WWW), 2017
@inproceedings{ofli2017saki,
title = {Is Saki #delicious? The Food Perception Gap on Instagram and Its Relation to Health},
author = {Ofli, F. and Aytar, Y. and Weber, I. and Hammouri, R. and Torralba, A.},
booktitle = {Proceedings of the 26th International Conference on World Wide Web},
year = {2017},
}
Food is an integral part of our life and what and how much we eat crucially affects our health. Our food choices largely depend on how we perceive certain characteristics of food, such as whether it is healthy, delicious or if it qualifies as a salad. But these perceptions differ from person to person and one person's "single lettuce leaf" might be another person's "side salad". Studying how food is perceived in relation to what it actually is typically involves a laboratory setup. Here we propose to use recent advances in image recognition to tackle this problem. Concretely, we use data for 1.9 million images from Instagram from the US to look at systematic differences in how a machine would objectively label an image compared to how a human subjectively does. We show that this difference, which we call the "perception gap", relates to a number of health outcomes observed at the county level. To the best of our knowledge, this is the first time that image recognition is being used to study the "misalignment" of how people describe food images vs. what they actually depict.
Y. Aytar*, C. Vondrick*, A. Torralba * denotes equal contribution
Neural Information Processing Systems (NIPS), 2016
@InProceedings{Aytar16,
author = "Yusuf Aytar, Carl Vondrick, Antonio Torralba",
title = "Learning Sound Representations from Unlabeled Video",
booktitle = "Neural Information Processing Systems (NIPS)",
year = "2016",
}
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
L. Castrejón*, Y. Aytar*, C. Vondrick, H. Pirsiavash, A. Torralba * denotes equal contribution
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
@InProceedings{Castrejon16,
author = "Lluís Castrejón, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba",
title = "Learning Aligned Cross-Modal Representations from Weakly Aligned Data",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2016",
}
People can recognize scenes across many different
modalities beyond natural images. In this paper, we investigate
how to learn cross-modal scene representations
that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional
neural networks can categorize cross-modal scenes
well, they also learn an intermediate representation not
aligned across modalities, which is undesirable for crossmodal
transfer applications. We present methods to regularize
cross-modal convolutional neural networks so that
they have a shared representation that is agnostic of the
modality. Our experiments suggest that our scene representation
can help transfer representations across modalities
for retrieval. Moreover, our visualizations suggest that
units emerge in the shared representation that tend to activate
on consistent concepts independently of the modality.
Y. Aytar,
A. Zisserman
Journal of Computer Vision and Image Understanding, 2015
@InProceedings{Aytar15,
author = "Aytar, Y. and Zisserman, A.",
title = "Part level transfer regularization for enhancing exemplar SVMs",
journal = "Computer Vision and Image Understanding ",
volume = "138",
pages = "114 - 123",
year = "2015",
}
Exemplar SVMs (E-SVMs, Malisiewicz et al., ICCV 2011), where an SVM is trained with only a single positive sample, have found applications in the areas of object detection and content-based image retrieval (CBIR), amongst others.
In this paper we introduce a method of part based transfer regularization that boosts the performance of E-SVMs, with a negligible additional cost. This enhanced E-SVM (EE-SVM) improves the generalization ability of E-SVMs by softly forcing it to be constructed from existing classifier parts cropped from previously learned classifiers. In CBIR applications, where the aim is to retrieve instances of the same object class in a similar pose, the EE-SVM is able to tolerate increased levels of intra-class variation, including occlusions and truncations, over E-SVM, and thereby increases precision and recall.
In addition to transferring parts, we introduce a method for transferring the statistics between the parts and also show that there is an equivalence between transfer regularization and feature augmentation for this problem and others, with the consequence that the new objective function can be optimized using standard libraries.
EE-SVM is evaluated both quantitatively and qualitatively on the PASCAL VOC 2007 and ImageNet datasets for pose specific object retrieval. It achieves a significant performance improvement over E-SVMs, with greater suppression of negative detections and increased recall, whilst maintaining the same ease of training and testing.
Y. Aytar,
A. Zisserman
European Conference on Computer Vision Workshops (ECCVW), 2014
@InProceedings{Aytar14a,
author = "Aytar, Y. and Zisserman, A.",
title = "Multi-Task Multi-Sample Learning",
booktitle = "European Conference on Computer Vision Workshops",
year = "2014",
}
In the exemplar SVM (E-SVM) approach of Malisiewicz et
al., ICCV 2011, an ensemble of SVMs is learnt, with each SVM trained
independently using only a single positive sample and all negative samples
for the class. In this paper we develop a multi-sample learning (MSL)
model which enables joint regularization of the E-SVMs without any additional
cost over the original ensemble learning. The advantage of the
MSL model is that the degree of sharing between positive samples can be
controlled, such that the classification performance of either an ensemble
of E-SVMs (sample independence) or a standard SVM (all positive samples
used) is reproduced. However, between these two limits the model
can exceed the performance of either. This MSL framework is inspired
by multi-task learning approaches.
We also introduce a multi-task extension to MSL and develop a multi-task
multi-sample learning (MTMSL) model that encourages both sharing
between classes and sharing between sample specific classifiers within
each class. Both MSL and MTMSL have convex objective functions.
The MSL and MTMSL models are evaluated on standard benchmarks
including the MNIST, ‘Animals with attributes’ and the PASCAL VOC
2007 datasets. They achieve a significant performance improvement over
both a standard SVM and an ensemble of E-SVMs.
Y. Aytar,
A. Zisserman
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014
@InProceedings{Aytar14,
author = "Aytar, Y. and Zisserman, A.",
title = "Immediate, scalable object category detection",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2014",
}
The objective of this work is object category detection in large scale image datasets in the manner of Video Google - an object category is specified by a HOG classifier template, and retrieval is immediate at run time.
We make the following three contributions: (i) a new
image representation based on mid-level discriminative
patches, that is designed to be suited to immediate object
category detection and inverted file indexing; (ii) a sparse
representation of a HOG classifier using a set of mid-level
discriminative classifier patches; and (iii) a fast method for
spatial reranking images on their detections.
We evaluate the detection method on the standard PASCAL VOC 2007 dataset, together with a 100K image subset
of ImageNet, and demonstrate near state of the art detection
performance at low ranks whilst maintaining immediate retrieval speeds. Applications are also demonstrated using an exemplar-SVM for pose matched retrieval.
Y. Aytar,
A. Zisserman
British Machine Vision Conference (BMVC), 2012
@InProceedings{Aytar12,
author = "Aytar, Y. and Zisserman, A.",
title = "Enhancing Exemplar SVMs using Part Level Transfer Regularization",
booktitle = "British Machine Vision Conference",
year = "2012",
}
Exemplar SVMs (E-SVMs, Malisiewicz et al, ICCV 2011), where a SVM is trained
with only a single positive sample, have found applications in the areas of object detection and Content-Based Image Retrieval (CBIR), amongst others.
In this paper we introduce a method of part based transfer regularization that boosts
the performance of E-SVMs, with a negligible additional cost. This Enhanced E-SVM
(EE-SVM) improves the generalization ability of E-SVMs by softly forcing it to be constructed from existing classifier parts cropped from previously learned classifiers. In
CBIR applications, where the aim is to retrieve instances of the same object class in a
similar pose, the EE-SVM is able to tolerate increased levels of intra-class variation and
deformation over E-SVM, and thereby increases recall.
We make the following contributions: (a) introduce the EE-SVM objective function;
(b) demonstrate the improvement in performance of EE-SVM over E-SVM for CBIR;
and, (c) show that there is an equivalence between transfer regularization and feature
augmentation for this problem and others, with the consequence that the new objective
function can be optimized using standard libraries.
EE-SVM is evaluated both quantitatively and qualitatively on the PASCAL VOC
2007 and ImageNet datasets for pose specific object retrieval. It achieves a significant
performance improvement over E-SVMs, with greater suppression of negative detections
and increased recall, whilst maintaining the same ease of training and testing.
Y. Aytar,
A. Zisserman
IEEE International Conference on Computer Vision (ICCV), 2011
@InProceedings{Aytar11,
author = "Aytar, Y. and Zisserman, A.",
title = "Tabula Rasa: Model Transfer for Object Category Detection ",
booktitle = "IEEE International Conference on Computer Vision",
year = "2011",
}
Our objective is transfer training of a discriminatively
trained object category detector, in order to reduce the
number of training images required. To this end we propose three transfer learning formulations where a template
learnt previously for other categories is used to regularize
the training of a new category. All the formulations result
in convex optimization problems.
Experiments (on PASCAL VOC) demonstrate significant
performance gains by transfer learning from one class
to another (e.g. motorbike to bicycle), including one-shot
learning, specialization from class to a subordinate class
(e.g. from quadruped to horse) and transfer using multi-
ple components. In the case of multiple training samples it
is shown that a detection performance approaching that of
the state of the art can be achieved with substantially fewer
training samples.
Y. Aytar,
M. Shah,
J. Luo
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008
@InProceedings{Aytar08,
author = "Aytar, Y. and Shah, M. and Luo, J.",
title = "Utilizing Semantic Word Similarity Measures for Video Retrieval",
booktitle = "IEEE Conference on Computer Vision and Pattern Recognition",
year = "2008",
}
This is a high level computer vision paper, which employs concepts from Natural Language Understanding in
solving the video retrieval problem. Our main contribution
is the utilization of the semantic word similarity measures
(Lin and PMI-IR similarities) for video retrieval. In our
approach, we use trained concept detectors, and the visual
co-occurrence relations between such concepts. We pro pose two methods for content-based retrieval of videos: (1)
A method for retrieving a new concept(a concept which is
not known to the system, and no annotation is available) using semantic word similarity and visual co-occurrence. (2)
A method for retrieval of videos based on their relevance to
a user defined text query using the semantic word similarity
and visual content of videos. For evaluation purposes, we
have mainly used the automatic search and the high level
feature extraction test set of TRECVID'06 benchmark, and
the automatic search test set of TRECVID'07. These two
data sets consist of 250 hours of multilingual news video
captured from American, Arabic, German and Chinese TV
channels. Although our method for retrieving a new concept is an unsupervised method, it outperforms the trained
concept detectors (which are supervised) on 7 out of 20 test
concepts, and overall it performs very close to the trained
detectors. On the other hand, our visual content based semantic retrieval method performs 81% better than the text-
based retrieval method. This shows that using visual content alone we can obtain significantly good retrieval results.
Y. Aytar
MSc. Thesis, University of Central Florida, Orlando, FL, US, May 2008
@MscThesis{aytar08a,
author = "Aytar, Y.",
title = "Semantic Video Retrieval Using High-Level Context",
school = "University of Central Florida",
year = "2008",
}
Video retrieval - searching and retrieving videos relevant to a user defined query - is one of
the most popular topics in both real life applications and multimedia research. This thesis employs
concepts from Natural Language Understanding in solving the video retrieval problem. Our
main contribution is the utilization of the semantic word similarity measures for video retrieval
through the trained concept detectors, and the visual co-occurrence relations between such concepts.
We propose two methods for content-based retrieval of videos: (1) A method for retrieving
a new concept(a concept which is not known to the system, and no annotation is available) using
semantic word similarity and visual co-occurrence, which is an unsupervised method. (2) A
method for retrieval of videos based on their relevance to a user defined text query using the semantic
word similarity and visual content of videos. For evaluation purposes, we mainly used the
automatic search and the high level feature extraction test set of TRECVID'06 and TRECVID'07
benchmarks. These two data sets consist of 250 hours of multilingual news video captured from
American, Arabic, German and Chinese TV channels. Although our method for retrieving a new
concept is an unsupervised method, it outperforms the trained concept detectors (which are supervised)
on 7 out of 20 test concepts, and overall it performs very close to the trained detectors.
On the other hand, our visual content based semantic retrieval method performs more than 100%
better than the text-based retrieval method. This shows that using visual content alone we can have
significantly good retrieval results.
Y. Aytar, O. B. Orhan and M. Shah
International Conference on Multimedia & Expo (ICME), 2007
@InProceedings{Aytar07,
author = "Aytar, Y. and O. B. Orhan and Shah, M.",
title = "Improving Semantic Concept Detection and Retrieval Using Contextual Estimates",
booktitle = "International Conference on Multimedia & Expo",
year = "2007",
}
In this paper we introduce a novel contextual fusion method
to improve the detection scores of semantic concepts in
images and videos. Our method consists of three phases.
For each individual concept, the prior probability of the
concept is incorporated with detection score of an individual
SVM detector. Then probabilistic estimates of the target
concept are computed using all of the individual SVM
detectors. Finally, these estimates are linearly combined
using weights learned from the training set. This procedure
is applied to each target concept individually. We show
significant improvements to our detection scores on the
TRECVID 2005 development set and LSCOM-Lite
annotation set. We achieved on average +3.9%
improvements in 29 out of 39 concepts.