Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few-Shot Unsupervised Image-to-Image Translation #58

Open
howardyclo opened this issue Jun 13, 2019 · 1 comment
Open

Few-Shot Unsupervised Image-to-Image Translation #58

howardyclo opened this issue Jun 13, 2019 · 1 comment

Comments

@howardyclo
Copy link
Owner

Metadata

@howardyclo
Copy link
Owner Author

Abstract

  • Current unsupervised/unpaired image-to-image translation (UIT) methods (see ref) typically requires many images in both source and target classes, which greatly limits their use.
  • This paper proposes novel framework that only needs a few examples (few-shot) and can work on unseen target classes.
  • The proposed framework can also be applied to few-shot image classification and outperform a SoTA method based on feature hallucination.

Method Overview

  • Motivation: Human can imagine the unseen target classes (e.g., seeing a standing tiger for the first time and imagine it lying down) by past visual experiences (e.g., seeing another animal standing and lying down before).
    • Past visual experience: Learn on images of many different classes.
    • Imagine unseen classes: Translate images from source class to target class with few examples of target class.
  • Data: Source class images: Many source classes with each contain many images (e.g., species of animals).
  • Training: Use source class images to train a multi-class UIT model (the target class is still from source classes).
  • Inference: Few seen/unseen target class images only accessible during inference.

Model

  • = G(x, {y_1, ..., y_K}): A conditional few-shot image generator (translator) takes a content image x and 1-way (class) K-shot images {y_1, ... y_K} as input and generates the output image .
    • z_x = E_x(x): A content encoder maps content image x to content latent code z_x.
    • z_y = E_y({y_1, ..., y_K}): A class (style) encoder maps {y_1, ... y_K} to latent vectors individually and averages them into a class latent code z_y.
    • = F_x(z_x, z_y): A decoder consisted of several adaptive instance normalization (AdaIN) residual blocks followed by several upscale conv layers.
    • By feeding z_y to the decoder via the AdaIN layers, we let the class images control the global look (style), while maintaining the local structure (content).
    • The generalization capability depends on the number of source classes during training (more is better).
  • D: A multi-task adversarial discriminator.

Training

  • |S|: Number of source classes.
  • For D, each task determines whether an input image is real or fake of the source class. As there are |S| source classes, we have |S| binary outputs for D.
  • Input an real image x of a source class c_x, penalize D if its c_x-th output is fake. However, no penalization for outputting fake for other (|S|-1) source classes.
  • Input an fake image of a source class c_x, penalize D if its c_x-th output is real. Otherwise, penalize G.

Losses

  • Overall loss

  • GAN loss: As described above.

  • Reconstruction loss encourages content similar to image of source class.

  • Feature matching loss encourages style similar to images of target class.
  • D_f is the feature extractor of the discriminator D without the last layer.

UIT methods with Different Constraints (Enforce translation to preserve certain properties)

Related work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant