Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the core ontology tree browser is slow #77

Open
bradfordcondon opened this issue Jan 29, 2019 · 6 comments
Open

the core ontology tree browser is slow #77

bradfordcondon opened this issue Jan 29, 2019 · 6 comments

Comments

@bradfordcondon
Copy link
Contributor

bradfordcondon commented Jan 29, 2019

this is really a core issue, we'll get thhere eventually.

I was reviewing a PR used that used the old tripal 2 tripal_cv ontology browser: tripal/tripal_analysis_go#12

It was fast. really fast.

Our module is slow, but its slow because the new tripal 3 core tree browser is slow.

Is there any way we can speed it up?

@almasaeed2010
Copy link
Contributor

almasaeed2010 commented Jan 30, 2019

I've been inspecting the code and here are my observations so far of what could be slowing us down:

  • Use of chado_generate_var('cvterm', $match); for every term in tripal_get_vocabulary_root_terms
  • In tripal_cv_xray_lookup_entities_for_terms_count, we get the counts per term by running the query once per term when we could use a group by and run the query only once for all terms
  • After the trem counts are obtained as outlined above, we then call tripal_chado_vocab_get_term_children which again hits the DB twice for every term.

@almasaeed2010
Copy link
Contributor

Ok after testing each function individually, it looks like tripal_cv_xray_lookup_entities_for_terms_count takes the longest to return results.

@almasaeed2010
Copy link
Contributor

A few trials changing how the query is structured:

Original

This is run once for every accession. In this case, multiply by 3 since those are the root GO terms. Effectively we are looking at 9,379.218 ms for only 3 terms!

SELECT COUNT(TCEL.entity_id) 
from tripal_cvterm_entity_linker TCEL
inner join chado_bio_data_7 CB on CB.entity_id = TCEL.entity_id 
inner join chado.feature CF on CF.feature_id = CB.record_id
where CF.organism_id = 46 and TCEL.database = 'GO' and accession = '0003674';
 count 
-------
 15156
(1 row)

Time: 3126.406 ms

The fastest

This is run once for all given accessions so the time shown below is final.

SELECT TCEL.database, TCEL.accession, COUNT(TCEL.entity_id) 
from tripal_cvterm_entity_linker TCEL 
inner join chado_bio_data_7 CB on CB.entity_id = TCEL.entity_id 
inner join chado.feature CF on CF.feature_id = CB.record_id 
where CF.organism_id = 46 and TCEL.database = 'GO' 
      and accession in ('0008150', '0003674', '0005575') group by TCEL.database, TCEL.accession;
 database | accession | count 
----------+-----------+-------
 GO       | 0003674   | 15156
 GO       | 0008150   | 13532
 GO       | 0005575   |  4337
(3 rows)

Time: 2969.810 ms

This means we manage to be 3 times faster if we switch ti eager loading.

The Drawback

In-memory operations are increased heavily using this approach and therefore we need to be careful about how many accessions we process at any given time. That said, we are already in the safe side since we never process over 25 accessions at a time.

@almasaeed2010
Copy link
Contributor

I am going to implement this change and see how that affects our dev site.

@almasaeed2010
Copy link
Contributor

Dev Stats

Stats are obtained for Fraxinus excelsior

Pre-eager loading

16.46s

screen shot 2019-01-30 at 8 59 45 am

Post-eager loading

12.97 s

screen shot 2019-01-30 at 9 03 08 am

Well that's disappointing 😞 Not nearly enough speedup but some progress

@almasaeed2010
Copy link
Contributor

almasaeed2010 commented Jan 30, 2019

Ok I think the answer lies in an mview that acts as a cache of counts for each entity:

The mview should look something like this

entity_id database accession count
46 GO 0003674 15156
46 GO 0008150 13532
46 GO 0005575 4337

If we do implement this MView we need to populate it after the end of every indexing job which should be easy enough to do programatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants