Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloatingPointError in select_significant_rules() #73

Open
Myrtle-bio opened this issue Nov 30, 2023 · 1 comment
Open

FloatingPointError in select_significant_rules() #73

Myrtle-bio opened this issue Nov 30, 2023 · 1 comment

Comments

@Myrtle-bio
Copy link

Hi!

I've noticed that when I use either my own data or the example data provided by you, every time I execute C.select_significant_rules(), I encounter the following error messages:

C.select_significant_rules()
INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
ERROR: Exception (FloatingPointError) occurred while fitting data to 'lognorm' distribution; skipping this distribution. Error message was: invalid value encountered in log 
ERROR: Exception (FloatingPointError) occurred while fitting data to 'truncnorm' distribution; skipping this distribution. Error message was: underflow encountered in exp 
INFO: Creating subset of TFBS and rules using thresholds
<CombObj: 83705 TFBS (86 unique names) | Market basket analysis: 282 rules>

However, the output still generates a result graph similar to the example provided by you. I am curious to know the reason for this. I would be very grateful if you could help me understand this issue.

image

@vheger
Copy link
Collaborator

vheger commented Nov 30, 2023

Hi @Myrtle-bio,

thank you for your interest in TF-COMB and your question.

Since you do not give a fixed threshold, the default behavior of .select_significant_rules() tries to find an optimal threshold with .get_threshold() as indicated in the output:

INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold

The threshold is estimated by fitting different distributions to the data. We use the following distributions: (To learn more about them please check the scipy.stats documentation):

scipy.stats.norm, 
scipy.stats.lognorm, 
scipy.stats.laplace,
scipy.stats.expon, 
scipy.stats.truncnorm, 
scipy.stats.truncexpon,
scipy.stats.wald, 
scipy.stats.weibull_min	

After fitting all distributions we take the best fitting one and use the Percent point function (.ppf()) to get the threshold value.
[Note: default behavior is upper 5%, which translates to "5 % of the data is above the selected threshold given the selected distribution", one can change the percentage with the parameters x_threshold_percent and y_threshold_percent respectively ]

However, errors may occur during the fitting step, this may be due to the data or also arise from old scipy versions.
The first error message you encountered:

ERROR: Exception (FloatingPointError) occurred while fitting data to 'lognorm' distribution; skipping this distribution. Error message was: invalid value encountered in log 

Means there was a problem with invalid values for applying the log() function for the lognorm distribution.

The second message:

ERROR: Exception (FloatingPointError) occurred while fitting data to 'truncnorm' distribution; skipping this distribution. Error message was: underflow encountered in exp 

indicates a problem with too small values, python can not handle, during the fit of the truncnorm distribution .

In your case only the two reported distributions (truncnorm and lognorm respectively) are effected. This means the other 6 distributions were still tested for the best fit.
Hence the function was able to infer the threshold from the best fitting distribution out of the remaining 6 ones. Resulting in thresholds of cosine slightly above 0.3 and Z-score around 10, if I see it correctly in your screenshot.

I hope this clarifies your question. If you have further question please let me know and I'm happy to assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants