Change np.hstack to np.concatenate #207

EllaNguyen1711 · 2021-09-14T16:25:52Z

This change allows loading trajectory assignment arrays of different shapes. Previously an error would be thrown by np.hstack if assignment arrays of different shapes were used, which is common when using folding@home data sets (for example).

…ifferent shapes.

justinrporter · 2021-09-15T14:13:12Z

Hi! Thanks for taking the time to make a contribution!

We use enspara on F@H datasets all the time, so I'm surprised you ran in to trouble with this function.

I'm also surprised that changing it from np.hstack to np.concatenate(..., axis=1) fixed it, since (I thought, at least) those are the same!

Could you post a minimal example of the problem you were having?

EllaNguyen1711 · 2021-09-15T15:42:07Z

Actually, changing it from np.hstack to np.concatenate(..., axis=1) is not enough, I had to delete a part of the code used for checking whether the input data had the shape 1d or not (from line 149 to 154). And then it was successful to build MSM for ***@***.*** input data that I had. [image: image.png] So when you got the dataset that does not have the shape of (n_traj, n_frames) but only (n_traj,) due to different number frames in each trajectory. It would raise this error. Sorry that I have found that there is code for transforming from 2d to 1d in your code before using np.hstack. However, it still raise an error for the case when we have trajectories of different number frames as input if codes from the line 149 to 154 is kept.

justinrporter · 2021-09-15T15:53:37Z

Sorry, I'm still having some trouble figuring out your situation. Because you responded with email, your image didn't make it into your post and GH censored the F@H dataset as "@.***" (thinking that it was an email address, I guess?).

I had to delete a part of the code used for checking whether the input data had the shape 1d

This change is not in your PR.

More importantly, I find it very strange that your data triggered the condition if len(assigns.shape) == 1. The only time you assignments could reasonably be 1d (i.e. shape == (n_frames,) rather than shape == (n_trjs, n_frames)) is if you have only one trajectory.

I strongly suspect there is some issue with how you've loaded the data. Perhaps you have loaded your data concatenated, so that each trajectory is appended end-to-end with the last? If so, this is unlikely to be what you want, since it will create artifactual transitions between the last frames of trajectory n and the first few frames of trajectory n+1.

Cheers!

EllaNguyen1711 · 2021-09-15T16:30:33Z

Thanks again for your reply. Maybe posting the error will help clarify the issue:

INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
100%|█████████████████████████████████████████████████████████████████████████████| 1301/1301 [00:02<00:00, 620.61it/s]
MSM_building_05.py:85: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  cluster_assig = np.array(cluster_assig)
MSM_building_05.py:89: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  assig = np.array(assig)
(25500, 1)
MSM is being built ...
Traceback (most recent call last):
  File "MSM_building_05.py", line 92, in <module>
    tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
  File "MSM_building_05.py", line 61, in assigns_to_counts1
    'The given assignments array has 1-dimensional shape %s. '
enspara.exception.DataInvalid: The given assignments array has 1-dimensional shape %s. Two dimensional shapes = (n_trj, n_frames) are expected. If this is really what you want, try using assignments.reshape(1, -1) to create a single-row 2d array.

I believe this happens due to the length of each trajectory in the array having a different number of frames. Perhaps you are right that I am loading the data in a way the program was not expecting, but, as far as I can tell, each trajectory has different lengths and that this should be the correct way to load it (for the reason you stated about having artifactual transitions). The numpy array, therefore, shows a shape of (n_frames, ) due to the second dimension having different shapes for each item.

Thanks for looking into this.

justinrporter · 2021-09-15T16:53:42Z

What code are you running that reaches this error? Can you post the contents of MSM_building_05.py as well?

EllaNguyen1711 · 2021-09-15T17:50:46Z

import mdtraj as md
import pyemma.coordinates as coor
import numpy as np
import pickle
import pyemma
import numbers
import os
import enspara
import h5py
import pandas as pd
import numpy as np
import logging
import scipy
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse.csgraph import connected_components
from enspara import exception
from enspara.msm import MSM, builders, transition_matrices
from enspara.msm.transition_matrices import _transitions_helper
import pyemma.plots as pyemma_plots
from pyemma.util.contexts import settings

def assigns_to_counts1(
        assigns, lag_time, max_n_states=None, sliding_window=True):
    """Count transitions between states in a single trajectory.
    Parameters
    ----------
    assigns : array, shape=(traj_len, )
        A 2-D array where each row is a trajectory consisting of a
        sequence of state indices.
    lag_time : int
        The lag time (i.e. observation interval) for counting
        transitions.
    max_n_states : int, default=None
        The number of states. This is useful for controlling the
        dimensions of the transition count matrix in cases where the
        input trajectory does not necessarily visit every state.
    sliding_window : bool, default=True
        Whether to use a sliding window for counting transitions or to
        take every lag_time'th state.
    Returns
    -------
    C :  array, shape=(n_states, n_states)
        A transition count matrix.
    """

    if not isinstance(lag_time, numbers.Integral):
        raise exception.DataInvalid(
            "The lag time must be an integer. Got %s type %s." %
            lag_time, type(lag_time))
    if lag_time < 1:
        raise exception.DataInvalid(
            "Lag times must be be strictly greater than 0. Got '%s'." %
            lag_time)       
    # if it's 1d, later stuff will fail
    if len(assigns.shape) == 1:
        raise exception.DataInvalid(
            'The given assignments array has 1-dimensional shape %s. '
            'Two dimensional shapes = (n_trj, n_frames) are expected. '
            'If this is really what you want, try using '
            'assignments.reshape(1, -1) to create a single-row 2d array.')
    assigns = np.array([a[np.where(a != -1)] for a in assigns])

    if max_n_states is None:
        max_n_states = np.concatenate(assigns).max() + 1

    transitions = [
        _transitions_helper(
            assign, lag_time=lag_time, sliding_window=sliding_window)
        for assign in assigns]
    # generate sparse matrix
    mat_coords = np.hstack(transitions)
    mat_data = np.ones(mat_coords.shape[1], dtype=int)
    C = scipy.sparse.coo_matrix(
        (mat_data, mat_coords), shape=(max_n_states, max_n_states))
    return C
#Building MSM
msm_lag = 8 
cluster_numbers = 5000
cluster_assig = coor.load('clustering/chi1_2_5000_trajs_n_stride_5.h5')
cluster_assig = np.array(cluster_assig)
assig = []
for frame in cluster_assig:
	assig.append(frame.astype(int))
assig = np.array(assig)
print (assig[0].shape)
print ('MSM is being built ...')
tcounts = assigns_to_counts1(assig, lag_time = msm_lag)
prior_counts = 1/tcounts.shape[0]
tcounts = builders._apply_prior_counts(tcounts, prior_counts)
probs = builders._row_normalize(tcounts)
eq_probs_ = builders.eq_probs(probs)

print ('transition maxtrix: ', tcounts)
print ('transition probabilities: ', probs)
print ('equilibrium probabilities: ', eq_probs_)

np.save (f'MSM/tcounts_{cluster_numbers}_{msm_lag}.npy', tcounts)
np.save (f'MSM/tprobs_{cluster_numbers}_{msm_lag}.npy', probs)
np.save (f'MSM/populations_{cluster_numbers}_{msm_lag}.npy', eq_probs_)

print('MSM was built successfully')

Here you are!

justinrporter · 2021-09-16T22:33:15Z

Without sitting down with this code and it's inputs and debugging more carefully, I'm not sure what's going on.

Needless to say, this isn't how we intended the code to be used. It's just a convention, but mostly you shouldn't need to use the methods with names beginning with _.

Building an MSM from your assignments should be as easy as shown here!

justinrporter · 2021-09-16T22:35:39Z

Oh, also, if you are looking to do prior counts, check out the example here! It uses the built-in "partial" method to wrap normalize with the pertinent prior_counts value.

Change np.hstack to np.concatenate to allow loading trajectories of d…

366ae17

…ifferent shapes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change np.hstack to np.concatenate #207

Change np.hstack to np.concatenate #207

EllaNguyen1711 commented Sep 14, 2021

justinrporter commented Sep 15, 2021 •

edited

Loading

EllaNguyen1711 commented Sep 15, 2021 via email •

edited by justinrporter

Loading

justinrporter commented Sep 15, 2021 •

edited

Loading

EllaNguyen1711 commented Sep 15, 2021

justinrporter commented Sep 15, 2021

EllaNguyen1711 commented Sep 15, 2021 •

edited by justinrporter

Loading

justinrporter commented Sep 16, 2021

justinrporter commented Sep 16, 2021 •

edited

Loading

Change np.hstack to np.concatenate #207

Are you sure you want to change the base?

Change np.hstack to np.concatenate #207

Conversation

EllaNguyen1711 commented Sep 14, 2021

justinrporter commented Sep 15, 2021 • edited Loading

EllaNguyen1711 commented Sep 15, 2021 via email • edited by justinrporter Loading

justinrporter commented Sep 15, 2021 • edited Loading

EllaNguyen1711 commented Sep 15, 2021

justinrporter commented Sep 15, 2021

EllaNguyen1711 commented Sep 15, 2021 • edited by justinrporter Loading

justinrporter commented Sep 16, 2021

justinrporter commented Sep 16, 2021 • edited Loading

justinrporter commented Sep 15, 2021 •

edited

Loading

EllaNguyen1711 commented Sep 15, 2021 via email •

edited by justinrporter

Loading

justinrporter commented Sep 15, 2021 •

edited

Loading

EllaNguyen1711 commented Sep 15, 2021 •

edited by justinrporter

Loading

justinrporter commented Sep 16, 2021 •

edited

Loading