Recent Research Projects




In this work, we propose CrossModal Active Contrastive Coding that builds an actively sampled dictionary with diverse and informative samples, which improves the quality of negative samples and achieves substantially improved results on tasks where incomplete representations are a major challenge.




In this work, we propose an adversarial cycle consistency training scheme with paired and unpaired triplets to ensure the use of information from all style dimensions. We use this method to transfer emotion from a dataset containing four emotions to a dataset with only a single emotion.


Characterizing Bias in Classifiers using Generative Models


In this work, we incorporate an efficient search procedure to identify failure cases and then show how this approach can be used to identify biases in commercial facial classification systems. 


Skip-Modal Generative Networks for Image-to-Speech Synthesis​

In this work, we introduce the problem of translating instances from one modality to another without paired data. Specifically, we perform image-to-speech synthesis for  demonstration.

Click to play audio


Neural TTS Stylization with Adversarial and Collaborative Games

In this work, we introduce an end-to-end TTS stylization model that offers enhanced content-style disentanglement ability and controllability.  Given a text and a reference audio as input, our model can generate human fidelity speech that satisfies the desired style conditions. 


M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention

We present a unified model, M3D-GAN, that can translate across a wide range of modalities (e.g., text, image, and speech) and domains (e.g., attributes in images or emotions in speech). We introduce a universal attention module that is jointly trained with the whole network and learns to encode a large range of domain information into a highly structured latent space. We use this to control synthesis in novel ways, such as producing diverse realistic pictures from a sketch or varying the emotion of synthesized speech. We evaluate our approach on extensive benchmark tasks, including imageto-image, text-to-image, image captioning, text-to-speech, speech recognition, and machine translation. Our results show state-of-the-art performance on some of the tasks.


DA-GAN: Instance-level Image Translation by Deep Attention Generative Adversarial Networks

In this work, we propose a novel framework for instance-level image translation by Deep Attention GAN (DA-GAN). Such a design enables DA-GAN to decompose the task of translating samples from two sets into translating instances in a highly-structured latent space.


Automatic Generation of Personalized Visual Summaries from Unstructured Social Media feeds for Mobile Browsing

This work propose a framework that is capable of (1) summarizing information overloading unstructured social media feeds (pictures, texts and videos) with the consideration of user interests, and (2) capturing intelligently the semantics of the generated digest and converting into visually appealing representations for display on constrained mobile devices.


D-Sempre: Learning Deep Semantic-Preserving Embeddings for User interests-Social Contents Modeling

In this work, we propose a novel deep learning framework for user interests-social contents modeling that can effectively integrates the multimodal data from heterogeneous social media feeds and captures the hidden semantic correlations between users’ interests and social contents.

Pose Maker: A Pose Recommendation System for Person in the Landscape Photographing

In this work, we proposed pose recommendation system. Given a user-provided clothing color and gender, this system shall not only offer some suitable poses, but also assist users to take high visual quality photos by generating the visual effect of person in the landscape pictures.