Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read through the code and give questions #6

Closed
7 tasks done
DavidCarlyn opened this issue Jun 11, 2024 · 5 comments
Closed
7 tasks done

Read through the code and give questions #6

DavidCarlyn opened this issue Jun 11, 2024 · 5 comments
Assignees
Labels
good first issue Good for newcomers research Research new technology / approach

Comments

@DavidCarlyn
Copy link
Member

DavidCarlyn commented Jun 11, 2024

@DavidCarlyn DavidCarlyn added good first issue Good for newcomers research Research new technology / approach labels Jun 11, 2024
@DavidCarlyn DavidCarlyn added this to the ML Summer Project milestone Jun 11, 2024
@liu9756
Copy link
Collaborator

liu9756 commented Jun 14, 2024

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess:
Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

@liu9756
Copy link
Collaborator

liu9756 commented Jun 14, 2024

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess: Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

@DavidCarlyn

@kanishkkov
Copy link
Collaborator

[training]: A question I have about the training code is about the model saving. When does the model save on loss and when would it save on pearson correlation? Also, do you think it would be helpful to learn about the structure of the SoyBeanNet model and how it works?

[evaluation]: The whole idea of the evaluation code is something that I am not experienced with so I would like to know if my interpretation of the code is correct. The "get_attribution_points" function uses occlusion to compute attributions. This means that certain parts of the input data are masked to see how the model's output changes to see what the most important parts of the input are. The "get_shapley_sampling_attr" function utilizes shapley value sampling. I am unfamiliar with this and would like to know how this sampling works. The Guided Grad-CAM method is used in the functions "get_guided_gradcam_attr" and in "get_guided_gradcam_attr_test". What is the difference between these two functions? The "get_saliency_attr"function computes attributions using the saliency method. Again, I am not very familiar with how this method works. Also, what does the attribution graph look like?

[preprocessing]: I do not have too many questions, but am I right to assume that the code in the run_pipeline.ipynb processes phenotype data while run_pipeline.ipynb processes genotype data?

@DavidCarlyn

@DavidCarlyn
Copy link
Member Author

DavidCarlyn commented Jun 19, 2024

Training: For the training section in, src/gtp/train_whole_genome.py, I have a couple of questions: firstly, in the step where you initialize the variables to accumulate the Root Mean Square Error (RMSE) for the current epoch, I see that you use Pierce's coefficient for determining the accuracy, and I'd like to ask how you dealt with the possibility of potentially influencing the presence of outliers in the data? Are we pre-filtering the genetic data before training the model? Or do we keep all the genes going for training to ensure the integrity of the data?

Evaluation:In the evaluation section, I noticed that three functions were used: get_shapley_sampling_attr, get_guided_gradcam_attr, and get_saliency_attr. Based on my understanding and analysis, get_shapley_sampling_attr uses the Shapley Values Sampling method to calculate the model's impact on each feature in the input data. It iterates through each batch in the DataLoader, calculates the Shapley Value of each feature and sums it up to get the total impact; the get_guided_gradcam_attr function uses the Guided Grad-CAM method to calculate the model's impact on each pixel in the input data. It iterates through each batch in the DataLoader, calculates the gradient information for each pixel and sums it up to get the total impact; the get_saliency_attr function uses the Saliency method to evaluate the pixel impact as well, and my question is whether the get_guided_gradcam_attr and get_saliency_attr functions are the same as the get_guided_gradcam_attr function. attr are both used to evaluate pixels, so do they have different evaluation dimensions? Do their evaluation results show some kind of linear correlation?

Porcess: Is the convert_bytes(num) function applied to standardize the format? And what is the purpose of the data stored in 'futures'? It seems to me that this is to store the results of the tasks submitted by the ThreadPoolExecutor, so is it possible that the data that is not stored in 'futures' has no results to be processed?

Great questions!

  1. All data is kept, there is no filtering of the data. It may be something to talk about in the future, but currently we don't.
  2. As for the difference between the different saliency methods, they vary based on where in the model they capture the signal and how they aggregate them. I encourage you to read more about them here: https://captum.ai/api/attribution.html
  3. Due to the size of the data, I tried to implement a multithreading approach to preprocessing the data. Futures are a way to say "I will return a value eventually". Since I'm launching multiple instances of the same code, they all won't be ready when I initially launch them. I may have not understood your question, so feel free to ping me if you would like more information.

@DavidCarlyn
Copy link
Member Author

[training]: A question I have about the training code is about the model saving. When does the model save on loss and when would it save on pearson correlation? Also, do you think it would be helpful to learn about the structure of the SoyBeanNet model and how it works?

[evaluation]: The whole idea of the evaluation code is something that I am not experienced with so I would like to know if my interpretation of the code is correct. The "get_attribution_points" function uses occlusion to compute attributions. This means that certain parts of the input data are masked to see how the model's output changes to see what the most important parts of the input are. The "get_shapley_sampling_attr" function utilizes shapley value sampling. I am unfamiliar with this and would like to know how this sampling works. The Guided Grad-CAM method is used in the functions "get_guided_gradcam_attr" and in "get_guided_gradcam_attr_test". What is the difference between these two functions? The "get_saliency_attr"function computes attributions using the saliency method. Again, I am not very familiar with how this method works. Also, what does the attribution graph look like?

[preprocessing]: I do not have too many questions, but am I right to assume that the code in the run_pipeline.ipynb processes phenotype data while run_pipeline.ipynb processes genotype data?

@DavidCarlyn

Great questions!

  1. The model was originally saved via the lowest loss, but I switched to Pearson correlation coefficient since I believed it was a better signal vs. loss. More about the SoyBeanNet code can be seen in the code, or via this paper: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.01091/full
  2. Many of the differences can be found at their API link: https://captum.ai/api/attribution.html. Some of my differences are based on how I aggregate the values across samples, and other small differences such as taking the mean, median, max, min across the samples. Attribution is commonly done either perturbation (masking, adding noise, shuffling, etc.) and observing the change in the model output/loss, or saliency based which either looks a the model activation or gradient when the input is passed through the model.
  3. I had intended run_pipeline.ipynb to do all the preprocessing before training and evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers research Research new technology / approach
Projects
None yet
Development

No branches or pull requests

3 participants