HWE output issues: need to accommodate longer colon-delimited allele names #19

sjmack · 2017-07-12T22:14:50Z

:pypop sjmack$ ./bin/pypop.py -c WS_BDCtrl_Test_HW.ini BIGDAWG_SynthControl_Data.pop is generating the following output for the Hardy Weinberg common genotypes test:

------------------------------------------------------------------------------
Common genotypes
:01:01+01:01:01:01           7        6.88        0.00        0.9621      
:01:01:01+02:05:01          11       11.76        0.05        0.8241      
:01:01+03:01:01:01          12       12.01        0.00        0.9975      
:01:01:01+03:01:03           9        8.95        0.00        0.9856      
:01:01+24:02:01:01          13       13.00        0.00        0.9989      
:01:01:01+26:01:01           8        7.95        0.00        0.9864      
 01:01:01:01+26:08           9        9.19        0.00        0.9488      
:01:01+29:02:01:02           9        8.95        0.00        0.9856      
:01:01+31:01:02:01           8        7.95        0.00        0.9864      
 01:01:01:01+32:02          16       16.82        0.04        0.8424

As with Issue #18 the increased size of the colon-delimited allele names is larger than the hard-coded 18 characters allowed for a + delimited allele pair.

With digit delimited allele names, the maximum length for a pair would have been 20 characters [but in retrospect, I think that PyPop is stripping expression variant suffixes like N and L, as those are optional)], so the maximum length would have been 17 characters (01010101+01010101).

As with Issue #18 the size of the Common genotypes allele pair field needs to be increased to at least 25 characters (e.g., 104:01:01:01+06:127:01:01), and may possibly need to be made flexible to accommodate longer allele names.

The text was updated successfully, but these errors were encountered:

sjmack · 2017-07-12T22:18:51Z

This is actually a more pressing issue than Issue #18, as the first allele is left-truncated, so that the allele-family and protein fields are being truncated.

sjmack · 2017-07-12T22:33:13Z

Also, allele-lengths appear to be properly accommodated by the Chen/Diff test results:

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0037** 0.0037**
05:03:01:01+03:02:12 (0/18.105788) 0.0001**** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0000***** 0.0000*****
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000***** 0.0000*****
06:05:01+06:05:01 (18/8.818363) 0.0019** 0.0019**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03:02:12+03:02:12 (18/8.818363) 0.0017** 0.0017**
05:03:01:01+03:02:12 (0/18.105788) 0.0000***** 0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****
06:02:01+05:03:01:01 (0/23.499002) 0.0000***** 0.0000*****
06:02:01+06:02:01 (27/14.854291) 0.0007*** 0.0007***
06:03:01+04:02:01 (8/0.287425) 0.0000***** 0.0000*****
06:05:01+05:03:01:01 (0/18.105788) 0.0000**** 0.0000****
06:05:01+06:05:01 (18/8.818363) 0.0016** 0.0016**

Although these Chen/diff results would be easier to read if they were structured more like the other HW results:

03:02:12+03:02:12       (18/8.818363) 0.0037**    0.0037**
05:03:01:01+03:02:12    (0/18.105788) 0.0001****  0.0000*****
05:03:01:01+05:03:01:01 (39/9.293663) 0.0000***** 0.0000*****

However, this latter would be a "nice to have".

alexlancaster · 2017-07-13T03:17:38Z

Got it, so we'll make this dynamic too. How often is the plain text output used these days? The TSV/CSV output obviously doesn't have this problem since there is no attempt to format things. Everything is always generated from the complete "output of record" (which is the XML), i.e. both the TSV/CSV files and the plain text are derived from the XML, and so can be regenerated at any time without having to redo the analysis.

In case you want to try this, you just need the xsltproc tool to be installed (part of libxml2) and you can run the following to regenerate the text output:

 xsltproc xslt/text.xsl BIGDAWG_SynthControl_Data-out.xml > BIGDAWG_SynthControl_Data-out.txt

…use this as the column width (#19)

alexlancaster · 2017-07-13T05:14:21Z

git pull again. I changed this to use the maximum possible genotype length across the whole input XML and use this as the initial column width for the common genotype (rather than be locus-specific). Currently it is globally hardcoded anyway. Try it on a few files.

sjmack · 2017-07-13T06:41:20Z

Common Genotypes look good, but as I commented in Issue #18, the common genotypes are right aligned, which looks weird.

I can't really say how frequently PyPop users are relying on the .txt output as opposed to the .xml output.

Personally, I always look at the .txt output first, because I want to know if I need to rerun the data because I set something wrong, or need to change a setting. I use popmeta to generate .dat files from the .xml output, but then I have to look through each .dat file to see what's what.

I do know that I get questions from the community asking how to get the d'ij values out of PyPop; I direct the questioners to the .xml and popmeta, so that suggests to me that many others are relying primarily on the .txt outputs.

alexlancaster · 2017-07-13T16:03:26Z

Great, if this is fixed from your POV, please close. I opened up a new issue on the Chen/diff stat alignment issue in #20.

There is also the --generate-tsv command line option to pypop.py that generates the .dat files and the .txt and .xml files all in one hit. Probably should document this a bit more (feel free to open up an issue on this too).

sjmack · 2017-07-13T16:44:43Z

Okay. I'll close. I hadn't been using --generate-tsv, but it is good to be reminded of it.

… the labels even if the allele names are short (#19).

alexlancaster · 2017-07-18T17:49:30Z

Hi @sjmack , in testing the new code with dynamic columns based on the allele lengths, I noticed that if the allele names were short it could sometimes cutoff the labels (using the USAFEL example). Try running the USAFEL example before you update to see what I mean. (I'm going to move this to the test suite now so we catch this).

In any case, I just committed a fix to master that should address these, try running again. You might want to look at the output to make sure it looks OK. I had to hardcode the dashed row separator as being 90 characters. it could be made dynamic, but it doesn't seem to me to be a big priority.

sjmack · 2017-07-18T19:44:13Z

Okay. I see.

./bin/pypop.py -c data/samples/minimal.ini data/samples/USAFEL-UchiTelle.pop

Generated results like this before a pull:

               Observed    Expected  Chi-square   DoF   p-value   
------------------------------------------------------------------------------
     Common Too many parameters for chi-square test.

------------------------------------------------------------------------------
d genotypes Value not calculated.

------------------------------------------------------------------------------
on + lumped Value not calculated.

------------------------------------------------------------------------------
homozygotes          17        9.82        5.25     1  0.0219*     
------------------------------------------------------------------------------
terozygotes          30       37.18        1.39     1  0.2389      
------------------------------------------------------------------------------
Common heterozygotes by allele
0202                 14       15.74        0.19        0.6602      
0301                 11        9.71        0.17        0.6796      
03032                14       19.66        1.63        0.2018      
0304                  4        7.32        1.51        0.2199      
05031                 6        5.62        0.03        0.8716      
06011                11       16.31        1.73        0.1887      

------------------------------------------------------------------------------
Common genotypes
0202+03032            4        5.96        0.64        0.4226      
03032+06011           3        6.26        1.69        0.1931      
      Total           7       12.21
------------------------------------------------------------------------------

and ./bin/pypop.py -c tests/data/minimal.ini data/samples/USAFEL-UchiTelle.pop

generated these results with the current master branch:

                          Observed         Expected       Chi-square   DoF   p-value        
------------------------------------------------------------------------------------------
           Common Too many parameters for chi-square test.

------------------------------------------------------------------------------------------
 Lumped genotypes Value not calculated.

------------------------------------------------------------------------------------------
  Common + lumped Value not calculated.

------------------------------------------------------------------------------------------
  All homozygotes               17             9.82             5.25     1  0.0219*          
------------------------------------------------------------------------------------------
All heterozygotes               30            37.18             1.39     1  0.2389           
------------------------------------------------------------------------------------------
Common heterozygotes by allele
0202                            14            15.74             0.19        0.6602           
0301                            11             9.71             0.17        0.6796           
03032                           14            19.66             1.63        0.2018           
0304                             4             7.32             1.51        0.2199           
05031                            6             5.62             0.03        0.8716           
06011                           11            16.31             1.73        0.1887           

------------------------------------------------------------------------------------------
Common genotypes
0202+03032                       4             5.96             0.64        0.4226           
03032+06011                      3             6.26             1.69        0.1931           
            Total                7            12.21
------------------------------------------------------------------------------------------

I think that those fixes generate acceptable presentations. Nice catch!

alexlancaster · 2017-07-18T19:59:42Z

Great.

sjmack mentioned this issue Jul 12, 2017

HWP output issues: missing common genotypes and incorrect Guo & Thompson stats #17

Closed

alexlancaster added a commit that referenced this issue Jul 13, 2017

find the maximum possible genotype length across the whole input XML …

4be6e47

…use this as the column width (#19)

alexlancaster added the bug Something isn't working, should not be used for new features: use "enhancement" for those label Jul 13, 2017

alexlancaster added this to the High milestone Jul 13, 2017

alexlancaster mentioned this issue Jul 13, 2017

Format Chen/diff statistics in columns #20

Open

sjmack closed this as completed Jul 13, 2017

alexlancaster added a commit that referenced this issue Jul 18, 2017

adjust HW table column length so that it is long enough to accomodate…

15426a6

… the labels even if the allele names are short (#19).

alexlancaster modified the milestones: High, 1.0.0 Release Jul 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HWE output issues: need to accommodate longer colon-delimited allele names #19

HWE output issues: need to accommodate longer colon-delimited allele names #19

sjmack commented Jul 12, 2017

sjmack commented Jul 12, 2017 •

edited

Loading

sjmack commented Jul 12, 2017

alexlancaster commented Jul 13, 2017

alexlancaster commented Jul 13, 2017 •

edited

Loading

sjmack commented Jul 13, 2017

alexlancaster commented Jul 13, 2017

sjmack commented Jul 13, 2017

alexlancaster commented Jul 18, 2017

sjmack commented Jul 18, 2017

alexlancaster commented Jul 18, 2017

HWE output issues: need to accommodate longer colon-delimited allele names #19

HWE output issues: need to accommodate longer colon-delimited allele names #19

Comments

sjmack commented Jul 12, 2017

sjmack commented Jul 12, 2017 • edited Loading

sjmack commented Jul 12, 2017

alexlancaster commented Jul 13, 2017

alexlancaster commented Jul 13, 2017 • edited Loading

sjmack commented Jul 13, 2017

alexlancaster commented Jul 13, 2017

sjmack commented Jul 13, 2017

alexlancaster commented Jul 18, 2017

sjmack commented Jul 18, 2017

alexlancaster commented Jul 18, 2017

sjmack commented Jul 12, 2017 •

edited

Loading

alexlancaster commented Jul 13, 2017 •

edited

Loading