Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for an improvement of the GEOparse.utils.smart_open() function #76

Open
abysslover opened this issue Sep 23, 2021 · 0 comments

Comments

@abysslover
Copy link

I found that some GEO files contain carriage return characters in the meta data, causing exceptions (GEOparse.GEOTypes.DataIncompatibilityException). To reproduce the error you can test functions with "GPL10740" dataset as follows:

gpl = GEOparse.get_GEO(geo="GPL10740", silent=True, include_data=True, destdir=".")

(<class 'GEOparse.GEOTypes.DataIncompatibilityException'>, DataIncompatibilityException('\nData columns do not match columns description index in GSM1530106\nColumns in table are: )\nIndex in columns are: ID_REF, VALUE, DETECTION P-VALUE\n',), <traceback object at 0x7f1fee64be48>)

columns variable taken from GEOparse.parse_columns(soft) is:

Index(['ID_REF', 'VALUE', 'DETECTION P-VALUE'], dtype='object')

table_data.columns variable taken from GEOparse.parse_table_data(soft) is:
Index([')'], dtype='object')

This is due to the line containing a carriage return:

!Sample_relation = Alternative to: GSM1530054 (gene-level analysis^M)
!Sample_series_id = GSE62617
!Sample_series_id = GSE70707
#ID_REF =
#VALUE = RMA normalized signal intensity
#DETECTION P-VALUE =
!sample_table_begin
ID_REF	VALUE	DETECTION P-VALUE

I suggest a small modification on the GEOparse.utils.smart_open() function for working with such a dataset as follows:

@contextmanager
def smart_open(filepath, **open_kwargs):
    """Open file intelligently depending on the source and python version.

    Args:
        filepath (:obj:`str`): Path to the file.

    Yields:
        Context manager for file handle.

    """
    if "errors" not in open_kwargs:
        open_kwargs["errors"] = "ignore"
    if filepath[-2:] == "gz":
        open_kwargs["mode"] = "rt"
        fopen = gzip.open
    else:
        open_kwargs["mode"] = "r"
        fopen = open
    open_kwargs["newline"] = "\n"
    # I do not know why here is an "if" statement because this always calls fopen with the same parameters. 
    if sys.version_info[0] < 3:
        fh = fopen(filepath, **open_kwargs)
    else:
        fh = fopen(filepath, **open_kwargs)
    try:
        yield fh
    except IOError:
        fh.close()
    finally:
        fh.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant