New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KN Picard Metrics Aggregation Script #364

Draft

krithikanathamuni wants to merge 3 commits into main from kn_picard_aggregation_script

krithikanathamuni commented Jun 29, 2022

No description provided.


Add picard metrics aggregation script

cddd4b3

asmirnov239 self-requested a review

June 29, 2022 21:28


Add picard metrics aggregation script

asmirnov239 reviewed

View reviewed changes

asmirnov239 left a comment

@krithikanathamuni Thank you for putting this together. I made a few general comments on code structure and formatting, as well as more specific suggestions on code refactoring. Let me know if you have any questions!

src/sv-pipeline/scripts/aggregate_picard.py Outdated

		@@ -0,0 +1,276 @@
		def read_in(file_name, label):

asmirnov239 Jun 30, 2022

You can be more descriptive with the function name here. You could also consider adding a docstring with a description of the function and potentially inputs/outputs, where you can describe the format of the file_name file, and what does the label represent.

Versions of python > 3.5 also support typing (https://docs.python.org/3/library/typing.html) - you can specify types of the inputs, and function output. This really helps with readability and especially if you are reading someone else's code.

How detailed you want to be depends on whether this script will be used or extended again by you or other people in the future.

src/sv-pipeline/scripts/aggregate_picard.py Outdated

Comment on lines 8 to 9

		intable = line.startswith(label)
		if intable == True:

asmirnov239 Jun 30, 2022

You can replace these two lines with

Suggested change

 intable = line.startswith(label)

 if intable == True:

 if line.startswith(label):

src/sv-pipeline/scripts/aggregate_picard.py Outdated

+ newlist = []
+ wgsfile = file_name
+ with tfio.gfile.GFile(file_name, "r") as inp:

asmirnov239 Jun 30, 2022

Is there a reason for using TensorFlow's GFile here? Can this be replace with Python's built-in open method?

src/sv-pipeline/scripts/aggregate_picard.py Outdated

+def read_in(file_name, label):
+ started = False
+ newlist = []
+ wgsfile = file_name

asmirnov239 Jun 30, 2022

This variable is not used.

src/sv-pipeline/scripts/aggregate_picard.py Outdated

+ newlist.append(line.rstrip('\n'))
+ if started and line == '\n':
+ return newlist
+ break

asmirnov239 Jun 30, 2022

This break statement is not reachable.

src/sv-pipeline/scripts/aggregate_picard.py Outdated

+def sequencing_artifact_metrics(table):
+ sadfone = pd.DataFrame(table.loc[1]).T.reset_index(drop = True)
+ sadf1 = sadfone.add_suffix("_1").drop(['SAMPLE_ALIAS_1', 'LIBRARY_1', 'WORST_CXT_1',

asmirnov239 Jul 1, 2022

This can also be refactored to get rid of the duplicate code, but also to store the strings SAMPLE_ALIAS, SAMPLE_ALIAS_1 strings into variables. This way if the name of the Picard attribute changes you only need to change one variable - but also there is less chance of having a typo somewhere that will break the code.

src/sv-pipeline/scripts/aggregate_picard.py

+ return SAMresult
+def windows(table, window):
+ if len(table.columns) == 2:

asmirnov239 Jul 1, 2022

I think this is fine to use in the script, but in general having the output of the function be conditioned on the hardcoded variable like this is a sign that the code should be refactored.

One way to refactor it is to rewrite this code in an object-oriented manner. Again, I don't think it is necessary here but for code that is more than few hundred lines long it's a good idea, and it will provide more readability and robustness.

src/sv-pipeline/scripts/aggregate_picard.py Outdated

+ table_name = "sample"
+ samples = pd.read_csv(io.StringIO(fiss.fapi.get_entities_tsv(project, workspace, 'sample').text), sep='\t')
+ samples.rename(columns = {'entity:sample_id':'sample'}, inplace = True)
+# specificcolumns = samples[['alignment_summary_metrics', 'base_distribution_by_cycle_table', 'gc_bias_summary_metrics',

asmirnov239 Jul 1, 2022

Remove old code here.

src/sv-pipeline/scripts/aggregate_picard.py

+# dropemptycolumns = dropemptyrows.dropna(axis = 1)
+# files = ! ls
+ Dict = {}

asmirnov239 Jul 1, 2022

Make the variable name more descriptive and lower case. dict is also name for a built-in type in Python and typing library has a class Dict so it's best to avoid using either.

src/sv-pipeline/scripts/aggregate_picard.py

		result = pd.concat([result, newrow], ignore_index = True)
		return result

asmirnov239 Jul 1, 2022

Extra spaces here.


updated read_in, newtable, addrows

190a808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment