Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radioactive data: tracing through training #5

Open
YeonwooSung opened this issue Aug 26, 2020 · 1 comment
Open

Radioactive data: tracing through training #5

YeonwooSung opened this issue Aug 26, 2020 · 1 comment
Labels

Comments

@YeonwooSung
Copy link
Owner

Abstract

  • Neural classifiers can improve their performance by training on more data
  • But given a trained classifier, it's difficult to tell what data it was trained on
  • This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models
  • Train CNN with Vanilla data first, then train the CNN with Radioactive Data (i.e. Distorted images)

Basically, this paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

Details

fig2

  • When you radioactively mark the data points, it simply adds a feature

    • Let's assume that there are 10 classes for our problem
      • We can imagine that there are 10 vectors (each vector is a unique axis, and it depicts a corresponding class)
      • The classifier plots the point on that data space, and find the class that is aligned most
    • 다른 예로, 이미지 분류 문제에서 4개의 클래스가 있다고 가정하자.
      • 이 경우, 분류기는 총 4개의 축 벡터 (w, x, y, z)를 가진다.
      • 정확히 말하자면 축이 아니라 학습을 통해서 배운 벡터이다.
      • 이 분류기가 데이터를 분류할 때에는 이 데이터 공간 위에 점을 찍은 뒤, 4개의 벡터 중 가장 거리가 가까운 벡터의 클래스로 분류
  • So, here, we are introducing a fake class vector

    • Clearly, this is cheating!
  • By using this method, we modify the training data only

  • We will give a little bit of generalization capability, but this will force to pay attention to the fake features (radioactive)
    - This is something that you could detect

  • For testing, they create random vectors on the augmented dataspace and look up the cosine value between fake vector and each random vector
    - Authors of this paper stated that if you distort the data well, then theoretically the distribution of the cosine between fake vector and random vectors should follow the given distribution

distribution

  • The paper also shows the methods for re-aligning the feature spaces

Personal Thoughts

Clearly, data is the modern gold. Neural classifiers can improve their performance by training on more data, but given a trained classifier, it's difficult to tell what data it was trained on. This is especially relevant if you have proprietary or personal data and you want to make sure that other people don't use it to train their models. This paper introduces a method to mark a dataset with a hidden "radioactive" tag, such that any resulting classifier will clearly exhibit this tag, which can be detected.

@YeonwooSung
Copy link
Owner Author

As I mentioned above, the main aim of this paper is to find the method to check if other people are training their model with your data.

I assume this might be related to the data privacy issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant