FloatingPointError in select_significant_rules() #73

Myrtle-bio · 2023-11-30T03:00:51Z

Hi!

I've noticed that when I use either my own data or the example data provided by you, every time I execute C.select_significant_rules(), I encounter the following error messages:

C.select_significant_rules()
INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold
ERROR: Exception (FloatingPointError) occurred while fitting data to 'lognorm' distribution; skipping this distribution. Error message was: invalid value encountered in log 
ERROR: Exception (FloatingPointError) occurred while fitting data to 'truncnorm' distribution; skipping this distribution. Error message was: underflow encountered in exp 
INFO: Creating subset of TFBS and rules using thresholds
<CombObj: 83705 TFBS (86 unique names) | Market basket analysis: 282 rules>

However, the output still generates a result graph similar to the example provided by you. I am curious to know the reason for this. I would be very grateful if you could help me understand this issue.

vheger · 2023-11-30T09:13:38Z

Hi @Myrtle-bio,

thank you for your interest in TF-COMB and your question.

Since you do not give a fixed threshold, the default behavior of .select_significant_rules() tries to find an optimal threshold with .get_threshold() as indicated in the output:

INFO: x_threshold is None; trying to calculate optimal threshold
INFO: y_threshold is None; trying to calculate optimal threshold

The threshold is estimated by fitting different distributions to the data. We use the following distributions: (To learn more about them please check the scipy.stats documentation):

scipy.stats.norm, 
scipy.stats.lognorm, 
scipy.stats.laplace,
scipy.stats.expon, 
scipy.stats.truncnorm, 
scipy.stats.truncexpon,
scipy.stats.wald, 
scipy.stats.weibull_min

After fitting all distributions we take the best fitting one and use the Percent point function (.ppf()) to get the threshold value.
[Note: default behavior is upper 5%, which translates to "5 % of the data is above the selected threshold given the selected distribution", one can change the percentage with the parameters x_threshold_percent and y_threshold_percent respectively ]

However, errors may occur during the fitting step, this may be due to the data or also arise from old scipy versions.
The first error message you encountered:

ERROR: Exception (FloatingPointError) occurred while fitting data to 'lognorm' distribution; skipping this distribution. Error message was: invalid value encountered in log

Means there was a problem with invalid values for applying the log() function for the lognorm distribution.

The second message:

ERROR: Exception (FloatingPointError) occurred while fitting data to 'truncnorm' distribution; skipping this distribution. Error message was: underflow encountered in exp

indicates a problem with too small values, python can not handle, during the fit of the truncnorm distribution .

In your case only the two reported distributions (truncnorm and lognorm respectively) are effected. This means the other 6 distributions were still tested for the best fit.
Hence the function was able to infer the threshold from the best fitting distribution out of the remaining 6 ones. Resulting in thresholds of cosine slightly above 0.3 and Z-score around 10, if I see it correctly in your screenshot.

I hope this clarifies your question. If you have further question please let me know and I'm happy to assist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FloatingPointError in select_significant_rules() #73

FloatingPointError in select_significant_rules() #73

Myrtle-bio commented Nov 30, 2023

vheger commented Nov 30, 2023

FloatingPointError in select_significant_rules() #73

FloatingPointError in select_significant_rules() #73

Comments

Myrtle-bio commented Nov 30, 2023

vheger commented Nov 30, 2023