online hard-example mining/examining under Multi-GPU ='dp' #1170

KentChun33333 · 2020-03-17T09:32:12Z

KentChun33333
Mar 17, 2020

Background

Hi, I try to track the prediction of each individual sample during training/validation-step. The main purpose is to do online hard-example mining/examining.

I found out a way of doing this is to make the input variable of the functions training/validation_step carrying the sample-id information, for example, the file-name. So I made the input to be a dictionary.

Example Code

class LightningModule():
    def validation_step(self, batch, batch_idx):
        y = batch['target'].float()
        y_hat = self.forward(batch) 
        loss = self.get_loss(y_hat, y) 

        # append the individual result 
        for i in range(len(batch['sample_id'])):
            self.validation_result['prediction_result'].append(y_hat[i])
            self.validation_result['sample_id'].append(batch['sample_id'][i])
            self.validation_result['target'].append(batch['target'][i])
        return {'val_loss': loss}

    def forward(self, batch):
        x = batch['x']
        y_hat = self.model( x)
        return y_hat

Input-Dict works in Single GPU but fail under multi-GPUs-dp

input_batch = {  
    'x' : Tensor (1st dimension as batch), 
    'target':  Tensor (1st dimension as batch), 
    'sample-id': [a, b, c] (list-object) 
}

AND It takes me some time to realize that all value-objects inside the input-dictionary should be torch.Tensor, not list contains strings, otherwise while training under Multi-GPU ='dp' mode, the list-obj won't be separated properly.

Input-Dict works in both Single/multi-GPUs-dp

input_batch = {  
    'x' : Tensor (1st dimension as batch), 
    'target':  Tensor (1st dimension as batch), 
    'sample-id': 1D-Tensor for sample-id  ex Tensor([1 , 3, 5]) 
}

Currently, I still have some doubts on this approach...
Does anyone try to implement similar functions, online hard-example mining, with different approaches?
Tks : )

Answered by jeremyjordan

Mar 21, 2020

have you considered using a library such as pytorch-metric-learning?

in general, it would look something like

class MinerNetwork(pl.LightningModule):
  def __init__(...):
    self.network = # define network here
    self.miner_function = miners.DistanceWeightedMiner()
    self.objective = losses.TripletMarginLoss()

  def forward(self, data, labels):
    embeddings = self.network(data)
    return embeddings

  def training_step(self, batch, batch_idx):
    data, labels = batch
    embeddings = self(data)
    pairs = self.miner_function(embeddings, labels)
    loss = self.objective(embeddings, labels, pairs)
    return loss

this does mining within each batch that you pass in. i'm not sure…

View full answer

Borda · 2020-03-20T22:49:50Z

Borda
Mar 20, 2020
Maintainer

@neggert @jeffling @jeremyjordan pls ^^

0 replies

jeremyjordan · 2020-03-21T01:56:51Z

jeremyjordan
Mar 21, 2020

have you considered using a library such as pytorch-metric-learning?

in general, it would look something like

class MinerNetwork(pl.LightningModule):
  def __init__(...):
    self.network = # define network here
    self.miner_function = miners.DistanceWeightedMiner()
    self.objective = losses.TripletMarginLoss()

  def forward(self, data, labels):
    embeddings = self.network(data)
    return embeddings

  def training_step(self, batch, batch_idx):
    data, labels = batch
    embeddings = self(data)
    pairs = self.miner_function(embeddings, labels)
    loss = self.objective(embeddings, labels, pairs)
    return loss

this does mining within each batch that you pass in. i'm not sure where you're doing the mining currently but it seems suspicious to be appending data to a class attribute (self.validation_result). this will likely break if you try running on ddp because you send a copy of the model to each worker.

0 replies

KentChun33333 · 2020-03-23T08:10:47Z

KentChun33333
Mar 23, 2020
Author

Thanks for reply!

My original purpose is to pick-out and record the hard-samples during the training/validation after every epoch. Therefore I append the result into the lightning-model-instance. Thanks for pointing out that it would be a failure design on multi-gpus with ddp mode.

I didn't know pytorch-metric-learning before. It seems to be one of right libraries that I should look at. Really appreciate!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

online hard-example mining/examining under Multi-GPU ='dp' #1170

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

online hard-example mining/examining under Multi-GPU ='dp' #1170

KentChun33333 Mar 17, 2020

Replies: 3 comments

Borda Mar 20, 2020 Maintainer

jeremyjordan Mar 21, 2020

KentChun33333 Mar 23, 2020 Author

KentChun33333
Mar 17, 2020

Borda
Mar 20, 2020
Maintainer

jeremyjordan
Mar 21, 2020

KentChun33333
Mar 23, 2020
Author