Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for subsampling when inputting large training data #1

Open
inoueakimitsu opened this issue Oct 3, 2021 · 11 comments
Open

Allow for subsampling when inputting large training data #1

inoueakimitsu opened this issue Oct 3, 2021 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@inoueakimitsu
Copy link
Owner

When the total number of instances is large, the computation time of most scikit-learn trainers becomes unacceptably long.
In early iterations of multiple instance learning, it may not be necessary to include all instances in the training data.
Therefore, we will add a function to adjust the amount of subsamples used for training according to the progress of convergence of multiple instance learning.

@inoueakimitsu inoueakimitsu added the enhancement New feature or request label Oct 3, 2021
@inoueakimitsu
Copy link
Owner Author

It might be better to allow the user to set the learning schedule freely by allowing the learned learner to be given as an argument.

@inoueakimitsu
Copy link
Owner Author

Should the subsampling be for bags or instances?

The advantage of the method performed on bags is its simplicity. Because of its simplicity, this method goes well with the solution of giving a trained model as an argument. However, if the number of bags is not large, there is a concern that the accuracy will deteriorate.

The method for instances is more complicated in that you need to change the values of lower_threshold and upper_threshold. For example, if you want to subsample by a factor of 10, you should multiply lower_threshold and upper_threshold by 0.1. If this makes the lower_threshold and upper_threshold too small, the target instance may not be included in any of the subsamples. Therefore, you may need to change the subsample ratio for each bag.

@inoueakimitsu
Copy link
Owner Author

By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.

Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.

@inoueakimitsu
Copy link
Owner Author

We can get the posterior probability that the number of target instances does not reach the lower limit, and conversely, the posterior probability that the number of target instances exceeds the upper limit. Let's use this.

@inoueakimitsu
Copy link
Owner Author

In addition to subsampling, the following approaches can be taken:

  • online algorithms
  • distributed algorithms

One way to implement a distributed algorithm is as follows:

  1. Divide bags randomly into subsets.
  2. From each subset, a classifier is trained.
  3. Integrate them with sklearn.ensemble.VotingClassifier

@inoueakimitsu
Copy link
Owner Author

By subsampling for instances, the lower and upper limits of the number of correct instances are changed. The actual number of target instances in the subsampled bag is then expressed as some probability distribution. As a prior distribution, it would be assumed that the bag contains a uniformly distributed number of target instances between the specified upper and lower limits.

Rather than simply dividing the upper and lower limits by the subsampling ratio, it would be better to make the upper limit higher and the lower limit lower, taking uncertainty into account.

When subsample the sample of a bag, it is very likely that the fewest classes in that bag will disappear.
If the class imbalance is strong, subsampling the sample is not practical.

@inoueakimitsu
Copy link
Owner Author

I came up with a fast way to find the initial label without subsampling.
Proposing method uses clustering and bag-wise regression.
The specific steps are as follows:

  1. Flatten all the bags.
  2. Clustering of all instances.
  3. Express the clustering results in One-hot vectors.
  4. Find the within-bag sum of the One-hot vectors of the instances in each bag.
  5. Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

  1. In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.
    Based on this result, determine the initial label of the instances.

@inoueakimitsu
Copy link
Owner Author

I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:

  1. Flatten all the bags.
  2. Clustering of all instances.
  3. Express the clustering results in One-hot vectors.
  4. Find the within-bag sum of the One-hot vectors of the instances in each bag.
  5. Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

  1. In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.
    Based on this result, determine the initial label of the instances.

minimal code:

from sklearn.cluster import KMeans
flatten_features = np.vstack(bags)
n_clusters = 15
cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)
cluster_generator.fit(flatten_features)
# flatten_features.shape
# cluster_generator.labels_
from sklearn.preprocessing import OneHotEncoder
cluster_encoder = OneHotEncoder(sparse=False)
cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)

one_hot_encoded = [
  cluster_encoder.transform(
    cluster_generator.predict(b).reshape((-1, 1))) for b in bags]

cluster_encoder.transform(
    cluster_generator.predict(bags[0]).reshape((-1, 1))
)

bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])
print(bag_cluster_table[:8])
print("---")
print(lower_threshold[:8])

# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]
# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]

@inoueakimitsu
Copy link
Owner Author

I came up with a fast way to find the initial label without subsampling. Proposing method uses clustering and bag-wise regression. The specific steps are as follows:

  1. Flatten all the bags.
  1. Clustering of all instances.
  1. Express the clustering results in One-hot vectors.
  1. Find the within-bag sum of the One-hot vectors of the instances in each bag.
  1. Regress the result of 4 to the correct label of the bag (lower or upper limit of instances). Putting a constraint on the L0 norm may be necessary. The intercept term is set to 0.

n_class1 ~ 0 + a1 * n_cluster1 + a2 * n_cluster2 + ... ... Eq(1)

  1. In the above equation, a1 represents the ratio of instances belonging to n_class1 among the instances of n_cluster1.

Based on this result, determine the initial label of the instances.

minimal code:

from sklearn.cluster import KMeans

flatten_features = np.vstack(bags)

n_clusters = 15

cluster_generator = KMeans(n_clusters=n_clusters, random_state=0)

cluster_generator.fit(flatten_features)

# flatten_features.shape

# cluster_generator.labels_

from sklearn.preprocessing import OneHotEncoder

cluster_encoder = OneHotEncoder(sparse=False)

cluster_encoder.fit(np.array([np.arange(n_clusters)]).T)



one_hot_encoded = [

  cluster_encoder.transform(

    cluster_generator.predict(b).reshape((-1, 1))) for b in bags]



cluster_encoder.transform(

    cluster_generator.predict(bags[0]).reshape((-1, 1))

)



bag_cluster_table = np.array([np.sum(x, axis=0) for x in one_hot_encoded])

print(bag_cluster_table[:8])

print("---")

print(lower_threshold[:8])



# lower_threshold [bag * feature] * W_lower[feature * cluster] <= one_hot_encoded [bag * cluster]

# upper_threshold [bag * feature] * W_upper[feature * cluster] >= one_hot_encoded [bag * cluster]

I have implemented this idea.
clustermil

It required some non-trivial improvements, but I basically followed this idea. The accuracy of this method may not be good. However, the convergence is fast and it is a good candidate as a method for generating initial labels.

@inoueakimitsu
Copy link
Owner Author

inoueakimitsu commented Oct 31, 2021

how about Kernel Approximation and SGDClassifier based Linear SVM?

@inoueakimitsu
Copy link
Owner Author

inoueakimitsu commented Oct 31, 2021

The following policy is a good one.

  • Add support for skorch and sklearn.SGDClassifier.
  • Add online argument to fit method.
  • When the online argument is true, allow bags to use tf.data.Dataset or torch's DataLoader instead of np.ndarray's list.

@inoueakimitsu inoueakimitsu self-assigned this Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant