Associating the visual and tactile properties of fabrics
The mechanical properties of objects largely decide our perception of them, while those properties can be hard to explicitly name or measure. In this project, we asked the question, is it possible to find a representation of objects that is related to multiple useful properties, while the names and values of the properties are unknown? We took fabric as an example, and designed an architecture to learn the property representations from both vision and touch, based on the fact that the observations of the same piece of fabric may differ according to the sensory modality or viewing angles, but they are generated from the same property set.
Examples of the visual and tactile data of 2 different fabrics are shown below. For the visual data, drape the fabrics on a cylindrical post and take images (both color images and depth images); for the tactile data, we press the GelSight sensor on the fabrics, either on the flat folds or on the random positions on the fabrics.
Examples of the visual and tactile data of 2 different fabrics are shown below. For the visual data, drape the fabrics on a cylindrical post and take images (both color images and depth images); for the tactile data, we press the GelSight sensor on the fabrics, either on the flat folds or on the random positions on the fabrics.
We collect 118 fabrics for the experiment, and ask humans to evaluate them on different properties. According to human's evaluation, we devide the fabrics into 8 clusters.
We used a joint CNN architecture to learn the invariant representation of the fabrics, and use the representation vectors to compare the similarity degrees between the source fabrics. We take depth image of the fabrics as the visual input, and the GelSight data of pressing on the flat folds of the fabrics as the tactile input. The joint neural network architecture is shown in left figure: we use the Alex Net as the CNN, and using the fc7 layer of the network as the embeddings to represent the fabric. We also design two more complicated networks, shown in the right figure below, which proved better performance. The cluster in the networks are the k-mean cluster based on human subject estimation on the fabrics.
We test the network performance by asking it to rank the possibility of the 10 candidate depth images matching the given GelSight input images. The following figure shows an example of how the test works. For the multiple-input-with-auxiliary network, the chance of it ranking the correct image as top1 is 0.485, and top 3 is 0.842, compared to the random chance of 0.1 and 0.3.
Here shows the confusion matrix of comparing the embedding vectors of different fabrics. For better visualization, we arrange the order of the fabrics in the clusters, so that the nearby fabrics appear similar to humans. The confusion matrix shows that the cofusion mostly occurs in the diagnal area, which means although the network mismatches the fabrics some time, but it mostly confused by the similar fabrics.
We also found that, learning with multiple modalities improve the performance on the matching task within the single modality. For example, for the task of 'matching depth image to the given depth image', we could train a siamese network with depth images, or train a cross-modal network with depth images and tactile data, but only using the vision branch for test. Resutls show that, the performance of network that is trained with two modalities is much better than the siamese network.
This work is selected as the oral presentation at CVPR 2017.