Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simultaneous query of multiple nodes #55

Open
zietzm opened this issue Jul 6, 2017 · 5 comments
Open

Simultaneous query of multiple nodes #55

zietzm opened this issue Jul 6, 2017 · 5 comments

Comments

@zietzm
Copy link
Collaborator

zietzm commented Jul 6, 2017

Add functionality in hetmech to query a set of nodes in order to return a ranked list of connection predictions along with the corresponding metapaths.

For example, if a set of genes were queried, we would want output of the form:

Predictions

Rank End Node Metapath DWPC p-DWPC r-DWPC
1 Compound A CbGpPpG ... ... ...
2 Disease B DtCcCcCbG ... ... ...
3 Anatomy C AlDtC<rG ... ... ...
... ... ... ... ... ...

In order to reduce computation time, it may be useful to cache DWPC matrices so that a set of query nodes (given as a vector) can be queried almost instantly (order ~ 100 microseconds).

It should be noted that after work done in #54 and #43 to add sparse matrix functionality, the computation time for DWPC over all 752 compatible Rephetio metapaths has been reduced from a total of 6.5 hours to 48 minutes! In fact, the longest computation time for DWPC over a single metapath is now around 35 seconds, while the average time is about 3.9 seconds (see below).

Caching the DWPC matrices for the 752 Rephetio metapaths using scipy.io.savemat saves 752 sparse matrices as a .mat file. The file size for all these matrices is 461 MB.


The histogram below shows the distribution of DWPC times over the 752 metapaths.
sparse_rephetio_v2

@zietzm
Copy link
Collaborator Author

zietzm commented Jul 7, 2017

Numpy has built-in functionality for using multiple processor cores when performing the multiplications involved in DWPC. Scipy, however, does not natively support multi-processing for sparse matrix multiplication.

Following a suggestion by @dhimmel, I used a multi-core approach with sparse matrices to re-run the 752 hetmech-compatible Rephetio metapaths.

This cut the computation time down even more, to a total of just under 20 minutes.

The individual metapath computation times did not significantly change, and the maximum path computation time was 31 seconds (see below).


multicore_dwpc_times


Following yet another suggestion by @dhimmel, I used the concurrent.futures.ProcessPoolExecutor to run all paths in around 14 minutes. This method did show an increase in the maximum DWPC time to 70 seconds, but the distribution looks very promising (see below). Of the 752 paths, only 9 took longer than 32 seconds to run, and the vast majority (644/752) took less than five seconds.

I also attempted to use the concurrent.futures.ThreadPoolExecutor as a way to conserve memory while running, but this method actually at times used just as much memory, and was considerably slower than ProcessPoolExecutor.

concurrent_futures

@kkloste
Copy link
Collaborator

kkloste commented Jul 10, 2017

Nice work!
Good to see our "efficient algorithm prototype" tested in a more high throughput setting with some positive results.

I still think we can bring down the runtime on the longest-running instances by using optimal matrix-chain-ordering.

@zietzm
Copy link
Collaborator Author

zietzm commented May 9, 2018

DWPC P/SD-DWPC Z-DWPC Information
0 0 NaN No information either way
Nonzero 0 Inf Some evidence of a connection. Need more info
0 Nonzero 0 Some evidence of no connection
Nonzero Nonzero Nonzero Evidence depends on Z-DWPC

@dhimmel
Copy link
Collaborator

dhimmel commented May 10, 2018

We can probably set Z-DWPC in the first row (where all DPWCs are zero) to 0. The Inf situations will be harder to deal with because they are potentially of interest, but it will be difficult to rank them.

@zietzm
Copy link
Collaborator Author

zietzm commented May 16, 2018

p-dwpc-heatmap

Regarding the interpolation of remaining Z-DWPC infinities, I'm not sure we could interpolate based on nearby source and target degrees. We could use the maximum P-DWPC, and maybe the average SD-DWPC, in which case we may be overly conservative. Alternatively, we could do mean P-DWPC as well as mean SD-DWPC, in which case the results seem less conservative but still within the range of normal possibility.

For reference, here are the DWPC values for those paths having Z-DWPC = Infinity:
image

Here are resulting Z-DWPCs for those paths having Z-DWPC = Infinity originally:

  • P-DWPC mean, SD-DWPC mean
    image

  • P-DWPC max, SD-DWPC mean
    image

  • P-DWPC max, SD-DWPC max
    image

  • P-DWPC max, SD-DWPC median
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants