Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change np.hstack to np.concatenate #207

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EllaNguyen1711
Copy link

This change allows loading trajectory assignment arrays of different shapes. Previously an error would be thrown by np.hstack if assignment arrays of different shapes were used, which is common when using folding@home data sets (for example).

@justinrporter
Copy link
Collaborator

justinrporter commented Sep 15, 2021

Hi! Thanks for taking the time to make a contribution!

We use enspara on F@H datasets all the time, so I'm surprised you ran in to trouble with this function.

I'm also surprised that changing it from np.hstack to np.concatenate(..., axis=1) fixed it, since (I thought, at least) those are the same!

Could you post a minimal example of the problem you were having?

@EllaNguyen1711
Copy link
Author

EllaNguyen1711 commented Sep 15, 2021 via email

@justinrporter
Copy link
Collaborator

justinrporter commented Sep 15, 2021

Sorry, I'm still having some trouble figuring out your situation. Because you responded with email, your image didn't make it into your post and GH censored the F@H dataset as "@.***" (thinking that it was an email address, I guess?).

I had to delete a part of the code used for checking whether the input data had the shape 1d

This change is not in your PR.

More importantly, I find it very strange that your data triggered the condition if len(assigns.shape) == 1. The only time you assignments could reasonably be 1d (i.e. shape == (n_frames,) rather than shape == (n_trjs, n_frames)) is if you have only one trajectory.

I strongly suspect there is some issue with how you've loaded the data. Perhaps you have loaded your data concatenated, so that each trajectory is appended end-to-end with the last? If so, this is unlikely to be what you want, since it will create artifactual transitions between the last frames of trajectory n and the first few frames of trajectory n+1.

Cheers!

@EllaNguyen1711
Copy link
Author

Thanks again for your reply. Maybe posting the error will help clarify the issue:

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
100%|█████████████████████████████████████████████████████████████████████████████| 1301/1301 [00:02<00:00, 620.61it/s]
MSM_building_05.py:85: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  cluster_assig = np.array(cluster_assig)
MSM_building_05.py:89: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  assig = np.array(assig)
(25500, 1)
MSM is being built ...
Traceback (most recent call last):
  File "MSM_building_05.py", line 92, in <module>
    tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
  File "MSM_building_05.py", line 61, in assigns_to_counts1
    'The given assignments array has 1-dimensional shape %s. '
enspara.exception.DataInvalid: The given assignments array has 1-dimensional shape %s. Two dimensional shapes = (n_trj, n_frames) are expected. If this is really what you want, try using assignments.reshape(1, -1) to create a single-row 2d array.

I believe this happens due to the length of each trajectory in the array having a different number of frames. Perhaps you are right that I am loading the data in a way the program was not expecting, but, as far as I can tell, each trajectory has different lengths and that this should be the correct way to load it (for the reason you stated about having artifactual transitions). The numpy array, therefore, shows a shape of (n_frames, ) due to the second dimension having different shapes for each item.

Thanks for looking into this.

@justinrporter
Copy link
Collaborator

What code are you running that reaches this error? Can you post the contents of MSM_building_05.py as well?

@EllaNguyen1711
Copy link
Author

EllaNguyen1711 commented Sep 15, 2021

import mdtraj as md
import pyemma.coordinates as coor
import numpy as np
import pickle
import pyemma
import numbers
import os
import enspara
import h5py
import pandas as pd
import numpy as np
import logging
import scipy
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse.csgraph import connected_components
from enspara import exception
from enspara.msm import MSM, builders, transition_matrices
from enspara.msm.transition_matrices import _transitions_helper
import pyemma.plots as pyemma_plots
from pyemma.util.contexts import settings

def assigns_to_counts1(
        assigns, lag_time, max_n_states=None, sliding_window=True):
    """Count transitions between states in a single trajectory.
    Parameters
    ----------
    assigns : array, shape=(traj_len, )
        A 2-D array where each row is a trajectory consisting of a
        sequence of state indices.
    lag_time : int
        The lag time (i.e. observation interval) for counting
        transitions.
    max_n_states : int, default=None
        The number of states. This is useful for controlling the
        dimensions of the transition count matrix in cases where the
        input trajectory does not necessarily visit every state.
    sliding_window : bool, default=True
        Whether to use a sliding window for counting transitions or to
        take every lag_time'th state.
    Returns
    -------
    C :  array, shape=(n_states, n_states)
        A transition count matrix.
    """

    if not isinstance(lag_time, numbers.Integral):
        raise exception.DataInvalid(
            "The lag time must be an integer. Got %s type %s." %
            lag_time, type(lag_time))
    if lag_time < 1:
        raise exception.DataInvalid(
            "Lag times must be be strictly greater than 0. Got '%s'." %
            lag_time)       
    # if it's 1d, later stuff will fail
    if len(assigns.shape) == 1:
        raise exception.DataInvalid(
            'The given assignments array has 1-dimensional shape %s. '
            'Two dimensional shapes = (n_trj, n_frames) are expected. '
            'If this is really what you want, try using '
            'assignments.reshape(1, -1) to create a single-row 2d array.')
    assigns = np.array([a[np.where(a != -1)] for a in assigns])

    if max_n_states is None:
        max_n_states = np.concatenate(assigns).max() + 1

    transitions = [
        _transitions_helper(
            assign, lag_time=lag_time, sliding_window=sliding_window)
        for assign in assigns]
    # generate sparse matrix
    mat_coords = np.hstack(transitions)
    mat_data = np.ones(mat_coords.shape[1], dtype=int)
    C = scipy.sparse.coo_matrix(
        (mat_data, mat_coords), shape=(max_n_states, max_n_states))
    return C
#Building MSM
msm_lag = 8 
cluster_numbers = 5000
cluster_assig = coor.load('clustering/chi1_2_5000_trajs_n_stride_5.h5')
cluster_assig = np.array(cluster_assig)
assig = []
for frame in cluster_assig:
	assig.append(frame.astype(int))
assig = np.array(assig)
print (assig[0].shape)
print ('MSM is being built ...')
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
prior_counts = 1/tcounts.shape[0]
tcounts = builders._apply_prior_counts(tcounts, prior_counts)
probs = builders._row_normalize(tcounts)
eq_probs_ = builders.eq_probs(probs)

print ('transition maxtrix: ', tcounts)
print ('transition probabilities: ', probs)
print ('equilibrium probabilities: ', eq_probs_)

np.save (f'MSM/tcounts_{cluster_numbers}_{msm_lag}.npy', tcounts)
np.save (f'MSM/tprobs_{cluster_numbers}_{msm_lag}.npy', probs)
np.save (f'MSM/populations_{cluster_numbers}_{msm_lag}.npy', eq_probs_)

print('MSM was built successfully')

Here you are!

@justinrporter
Copy link
Collaborator

Without sitting down with this code and it's inputs and debugging more carefully, I'm not sure what's going on.

Needless to say, this isn't how we intended the code to be used. It's just a convention, but mostly you shouldn't need to use the methods with names beginning with _.

Building an MSM from your assignments should be as easy as shown here!

@justinrporter
Copy link
Collaborator

justinrporter commented Sep 16, 2021

Oh, also, if you are looking to do prior counts, check out the example here! It uses the built-in "partial" method to wrap normalize with the pertinent prior_counts value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants