Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KN Picard Metrics Aggregation Script #364

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

krithikanathamuni
Copy link

No description provided.

@asmirnov239 asmirnov239 self-requested a review June 29, 2022 21:28
Copy link

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krithikanathamuni Thank you for putting this together. I made a few general comments on code structure and formatting, as well as more specific suggestions on code refactoring. Let me know if you have any questions!

@@ -0,0 +1,276 @@
def read_in(file_name, label):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can be more descriptive with the function name here. You could also consider adding a docstring with a description of the function and potentially inputs/outputs, where you can describe the format of the file_name file, and what does the label represent.

Versions of python > 3.5 also support typing (https://docs.python.org/3/library/typing.html) - you can specify types of the inputs, and function output. This really helps with readability and especially if you are reading someone else's code.

How detailed you want to be depends on whether this script will be used or extended again by you or other people in the future.

Comment on lines 8 to 9
intable = line.startswith(label)
if intable == True:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace these two lines with

Suggested change
intable = line.startswith(label)
if intable == True:
if line.startswith(label):

newlist = []
wgsfile = file_name

with tfio.gfile.GFile(file_name, "r") as inp:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using TensorFlow's GFile here? Can this be replace with Python's built-in open method?

def read_in(file_name, label):
started = False
newlist = []
wgsfile = file_name

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used.

newlist.append(line.rstrip('\n'))
if started and line == '\n':
return newlist
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break statement is not reachable.


def sequencing_artifact_metrics(table):
sadfone = pd.DataFrame(table.loc[1]).T.reset_index(drop = True)
sadf1 = sadfone.add_suffix("_1").drop(['SAMPLE_ALIAS_1', 'LIBRARY_1', 'WORST_CXT_1',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be refactored to get rid of the duplicate code, but also to store the strings SAMPLE_ALIAS, SAMPLE_ALIAS_1 strings into variables. This way if the name of the Picard attribute changes you only need to change one variable - but also there is less chance of having a typo somewhere that will break the code.

return SAMresult

def windows(table, window):
if len(table.columns) == 2:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to use in the script, but in general having the output of the function be conditioned on the hardcoded variable like this is a sign that the code should be refactored.

One way to refactor it is to rewrite this code in an object-oriented manner. Again, I don't think it is necessary here but for code that is more than few hundred lines long it's a good idea, and it will provide more readability and robustness.

table_name = "sample"
samples = pd.read_csv(io.StringIO(fiss.fapi.get_entities_tsv(project, workspace, 'sample').text), sep='\t')
samples.rename(columns = {'entity:sample_id':'sample'}, inplace = True)
# specificcolumns = samples[['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove old code here.

# dropemptycolumns = dropemptyrows.dropna(axis = 1)
# files = ! ls

Dict = {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the variable name more descriptive and lower case. dict is also name for a built-in type in Python and typing library has a class Dict so it's best to avoid using either.

result = pd.concat([result, newrow], ignore_index = True)
return result


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants