Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what did you intend to do with drug_name.split("+")] #132

Open
bhomass opened this issue Aug 14, 2023 · 8 comments
Open

what did you intend to do with drug_name.split("+")] #132

bhomass opened this issue Aug 14, 2023 · 8 comments

Comments

@bhomass
Copy link

bhomass commented Aug 14, 2023

in data.py, there is this statement

for d in self.drugs_names:
    [drugs_names_unique.add(i) for i in d.split("+")]

It led to the code bombing in the following line

    name_to_smiles_map = {
        drug: canonicalize_smiles(smiles)
        for drug, smiles in dataset.obs.groupby(
            [perturbation_key, smiles_key]
        ).groups.keys()
    }

Upon examination of the drug names, there are only 4 drug names that contains '+'
(+)-3-(1-propyl-piperidin-3-yl)-phenol
(+|-)-7-hydroxy-2-(N,N-di-n-propylamino)tetralin
flurbiprofen-(+|-)
atenolol-(+|-)

I assume the sensible thing to do would be to eliminate (+) or (+|-) and the trailing or preceding -.
But [drugs_names_unique.add(i) for i in d.split("+")] would not be doing that. It would simply leave fragments like '(' as a possible drug name.

If someone can point out if my interpretation is correct.

The comment for drug_names_to_once_canon_smiles() says
#This function will need to be rewritten to handle datasets with combinations
but I don't get what is meant by "combinations". Are there some drugs that uses '+' to combine multiple formula together, and that is why you are doing split('+'). If so, the (+) cases should be exemplified from the split processing. But I don't see that mechanism in place.

@bhomass
Copy link
Author

bhomass commented Aug 26, 2023

I used the re split with
plus_pattern = r'(?<!\()\+'

and it worked.

@MxMstrmn
Copy link
Collaborator

The "+" sign is meant for indicating drug combinations, this can be done with chemCPA as demonstrated here. I will make this more clear in a future PR

@bhomass
Copy link
Author

bhomass commented Sep 13, 2023

yes, I do understand the intent for the '+' in the code. my point is the code for splitting by '+' will fail for the dataset, because of drug names that already contains '+', as in (+)-3-(1-propyl-piperidin-3-yl)-phenol

@sepidism
Copy link

Were you able to run manual_seml_sweep.py? I keep getting random errors regarding the data. I'm trying with sciplex_complete_middle_subset.h5ad and slincs_full_smiles_sciplex_genes.h5ad.

@bhomass
Copy link
Author

bhomass commented Jan 11, 2024 via email

@bhomass
Copy link
Author

bhomass commented Feb 7, 2024

apparently the drug names with the funny + were eliminated during preprocessing, if you were able to run through the code in the preprocessing folder. For us, that is not possible due to unposted input files.

@MxMstrmn
Copy link
Collaborator

MxMstrmn commented Mar 4, 2024

Hi @bhomass,

why were you not able to remove those '+' signs from the drug names? What do you mean by unposted input files?

@bhomass
Copy link
Author

bhomass commented Mar 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants