Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phenotype data does not use pandas dtype inference #74

Open
hardingnj opened this issue Sep 22, 2021 · 1 comment
Open

Phenotype data does not use pandas dtype inference #74

hardingnj opened this issue Sep 22, 2021 · 1 comment

Comments

@hardingnj
Copy link

By skipping the read_csv function, we lose the detection of nan values, so columns that are numeric are coded as objects.

ie

import GEOparse

geo = GEOparse.get_GEO("GSE112676")

geo.phenotype_data["characteristics_ch1.3.age_onset"]

gives

GSM3076582    72.69
GSM3076584    66.97
GSM3076586    73.73
GSM3076588       NA
GSM3076590       NA
              ...  
GSM3078502    74.88
GSM3078503    73.57
GSM3078505    71.29
GSM3078507    61.84
GSM3078510    74.49
Name: characteristics_ch1.3.age_onset, Length: 741, dtype: object

So despite being "NA" strings, they are not interpreted as being consistent with floats.

my fix is something like this:

from io import StringIO
out = StringIO()
pheno.to_csv(out)
pheno = pd.read_csv(StringIO(out.getvalue()), index_col=0)

I can put in a quick PR, but it feels a little crude to do this, but I haven't been able to find a more elegant way.

@guma44
Copy link
Owner

guma44 commented Oct 19, 2021

Thanks for reporting. Let me think how to do this - maybe a PR would be good to do so we can test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants