Updated documentation

Sanitization of input file
iquasere · Nov 7, 2023 · 0eda356 · 0eda356
1 parent 47f3464
commit 0eda356
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,6 @@ A tool for representing genomic potential and transcriptomic expression into KEG
 * [Features](https://github.com/iquasere/KEGGCharter#features)
 * [Installation](https://github.com/iquasere/KEGGCharter#installation)
 * [Running KEGGCharter](https://github.com/iquasere/KEGGCharter#running-keggcharter)
-* [Testing KEGGCharter](https://github.com/iquasere/KEGGCharter#testing-keggcharter)
 * [Outputs](https://github.com/iquasere/KEGGCharter#outputs)
 * [Arguments for KEGGCharter](https://github.com/iquasere/KEGGCharter#arguments-for-keggcharter)
 * [Referencing KEGGCharter](https://github.com/iquasere/KEGGCharter#referencing-keggcharter)
@@ -27,32 +26,24 @@ conda install -c conda-forge -c bioconda keggcharter
 
 ## Running KEGGCharter
 
-To run KEGGCharter, an input file must be supplied - see ```Testing KEGGCharter``` section - and columns 
-with genomic and/or transcriptomic information, as well as one column with either KEGG IDs, KOs or EC numbers, must be 
-present in the file and specified through the command line.
-```
-keggcharter.py -f input_file.tsv -o output_folder -mgc mg_column1,mg_column2 -mtc mt_column1,mt_column2 ...
-```
-
-## Testing KEGGCharter
+To run KEGGCharter, an input file must be supplied. This file only needs to contain one column with either KEGG IDs, KOs or EC numbers. Beyond that:
+* to obtain distinct taxonomic identifications in the maps, a column with taxonomic identification must be specified with the `-tcol` parameter. If no such column exists, KEGGCharter must be run with the `-it` parameter.
+* to obtain maps with differential expression, at least one column with genomic and/or transcriptomic quantification must be specified with the `-qcol`parameter. If no such column exists, KEGGCharter must be run with the `-iq` parameter.
 
-An example input file is available [here](https://github.com/iquasere/KEGGCharter/blob/master/MOSCA_Entry_Report.xlsx). 
-This is one output of [MOSCA](https://github.com/iquasere/MOSCA), which can be directly inputted to KEGGCharter to obtain
-metabolic representations by running:
+An example input file is available [here](https://github.com/iquasere/KEGGCharter/blob/master/cicd/keggcharter_input.tsv). 
+It contains all fields referenced above, and should be used as guidance for building inputs for KEGGCharter.
+To obtain metabolic representations for "Methane Metabolism" and "Fatty Acid Degradation" with KEGGCharter, for this input file, KEGGCharter can be run with the following command:
 ```
-keggcharter -f MOSCA_Entry_Report.xlsx -gcol mg -tcol mt_0.01a_normalized,mt_1a_normalized,mt_100a_normalized,mt_0.01b_normalized,mt_1b_normalized,mt_100b_normalized,mt_0.01c_normalized,mt_1c_normalized,mt_100c_normalized -keggc "Cross-reference (KEGG)" -o test_keggcharter -tc "Taxonomic lineage (GENUS)"
+keggcharter -f keggcharter_input.tsv -o test_keggcharter -qcol mt_0.01a,mt_1a,mt_100a,mt_0.01b,mt_1b,mt_100b,mt_0.01c,mt_1c,mt_100c -keggc "KEGG ID" -tc "Species" -mm 00680,00071
 ```
-Just make sure ```MOSCA_Entry_Report.xlsx``` is in the present folder, or indicate the path to it. This command will create
-representations for all 252 default maps of KEGGCharter. If you want to represent for less or more, run with the ```--metabolic-maps``` 
-parameter to indicate to KEGGCharter what maps to run (comma separated).
 
 ### First time KEGGCharter runs it will take a long time
 
 KEGGCharter needs KGMLs and EC numbers to boxes relations, which it will automatically retrieve for every map inputted. 
 This might take some time, but you only need to run it once. 
 
-Default directory for storing these files is the folder containing the ```keggcharter.py``` script, but it can be customized
-with the ```--resources-directory``` parameter.
+Default directory for storing these files is the folder containing the `keggcharter` script, but it can be customized
+with the `-rd` parameter.
 
 ## Outputs
 

diff --git a/cicd/meta.yaml b/cicd/meta.yaml
@@ -1,6 +1,6 @@
 {% set name = "keggcharter" %}
-{% set version = "0.2.3" %}
-{% set sha256 = "905268158c74058228faca2b34d44eae8f84b23bfa5dee49285635cf59684d7f" %}
+{% set version = "0.7.0" %}
+{% set sha256 = "8e79e92eef943d0b7cfbd035b4d3d9ef5f6f51a6f4d43f8d58989fd849625696" %}
 
 package:
  name: {{ name|lower }}
@@ -13,13 +13,14 @@ source:
 build:
  noarch: generic
  number: 0
- script: |
- dir="${PREFIX}/share/KEGGCharter"
- mkdir -p "${dir}"
- cp *.py resources/KEGGCharter_prokaryotic_maps.txt "${dir}/"
- mkdir -p "${PREFIX}/bin"
- chmod +x "${dir}/keggcharter.py"
- ln -s "${dir}/keggcharter.py" "${PREFIX}/bin/keggcharter"
+ run_exports:
+ - {{ pin_subpackage(name, max_pin="x.x") }}
+ script: >
+ mkdir -p $PREFIX/bin && 
+ mkdir -p $PREFIX/share && 
+ cp *.py resources/KEGGCharter_prokaryotic_maps.txt $PREFIX/share &&
+ chmod +x $PREFIX/share/keggcharter.py &&
+ ln -s $PREFIX/share/keggcharter.py $PREFIX/bin/keggcharter
 
 requirements:
  run:
@@ -34,7 +35,7 @@ requirements:
 
 test:
  commands:
- - keggcharter.py -v
+ - keggcharter -v
 
 about:
  home: https://github.com/iquasere/KEGGCharter

diff --git a/keggcharter.py b/keggcharter.py
@@ -18,7 +18,7 @@
 
 from keggpathway_map import KEGGPathwayMap, expand_by_list_column
 
-__version__ = "0.7.0"
+__version__ = "0.7.1"
 
 
 def get_arguments():
@@ -40,9 +40,8 @@ def get_arguments():
  parser.add_argument("-keggc", "--kegg-column", help="Column with KEGG IDs.")
  parser.add_argument("-koc", "--ko-column", help="Column with KOs.")
  parser.add_argument("-ecc", "--ec-column", help="Column with EC numbers.")
- # TODO - test this argument without UniProt shenanigans
  parser.add_argument(
- "-tc", "--taxa-column", default='Taxonomic lineage (GENUS)',
+ "-tc", "--taxa-column", default=None,
  help="Column with the taxa designations to represent with KEGGCharter."
  " NOTE - for valid taxonomies, check: https://www.genome.jp/kegg/catalog/org_list.html")
  parser.add_argument(
@@ -74,7 +73,8 @@ def get_arguments():
  help="Outputs KEGG maps IDs and descriptions to the console (so you may pick the ones you want!)")
 
  args = parser.parse_args()
-
+ if not os.path.isfile(args.file):
+ exit("Input file doesn't exist! Exiting...")
  args.output = args.output.rstrip('/')
  for directory in [args.output] + [f'{args.resources_directory}/{folder}' for folder in ['', 'kc_kgmls', 'kc_csvs']]:
  if not os.path.isdir(directory):
@@ -449,7 +449,7 @@ def chart_map(
  kegg_pathway_map = KEGGPathwayMap(pathway=mmap, ec_list=ec_list)
  kegg_pathway_map.differential_expression_sample(
  data, quantification_columns, ko_column, mmaps2taxa=mmaps2taxa, taxa_column=taxa_column,
- output_basename=f'{output}/differential', log=False)
+ output_basename=f'{output}/differential')
  plt.close()
 
 
@@ -496,6 +496,11 @@ def read_input():
  args.taxa_list = args.input_taxonomy
  args.metabolic_maps = args.metabolic_maps.split(',')
  args.quantification_columns = args.quantification_columns.split(',')
+ # check if all columns supposed to be in the input data are in the input data
+ for col in [args.taxa_column, args.kegg_column, args.ko_column, args.ec_column] + args.quantification_columns:
+ if col is not None:
+ if col not in data.columns:
+ exit(f'"{col}" column not in input file! Exiting...')
  timed_message('Arguments valid.')
  return args, data
 

diff --git a/keggpathway_map.py b/keggpathway_map.py
@@ -282,27 +282,20 @@ def pathway_box_list(self, taxa_in_box, dic_colors, maxshared=10):
  if self.orthologs[boxidx].graphics[0].width is not None:
  create_tile_box(self.orthologs[boxidx])
 
- def pathway_boxes_differential(self, dataframe, log=True, colormap="coolwarm"):
+ def pathway_boxes_differential(self, dataframe, colormap="coolwarm"):
  """
  Represents expression values present in a dataframe in the
  pathway map
  :param dataframe: pandas DataFrame with each column representing a sample
  and index corresponding to int list index of the ortholog element in the
  pathway
- :param log: bol providing the option for a log transformation of data
  :param colormap: str representing a costum matplotlib colormap to be used
  """
- if log:
- norm = cm.colors.LogNorm(vmin=dataframe.min().min(), vmax=dataframe.max().max())
- else:
- norm = cm.colors.Normalize(vmin=dataframe.min().min(), vmax=dataframe.max().max())
-
+ norm = cm.colors.Normalize(vmin=dataframe.min().min(), vmax=dataframe.max().max())
  colormap = cm.get_cmap(colormap)
- dataframe = dataframe.apply(conv_value_rgb, args=(colormap, norm)) # TODO - Doesn't work if using log
+ dataframe = dataframe.apply(conv_value_rgb, args=(colormap, norm))
  dataframe = dataframe.apply(conv_rgb_hex)
-
  dataframe = dataframe[dataframe.columns.tolist()]
-
  nrboxes = len(dataframe.columns.tolist()) # number of samples
 
  for box in dataframe.index.tolist():
@@ -423,8 +416,6 @@ def genomic_potential_taxa(
  name_pdf = f'{output_basename}_{name}.pdf'
  self.to_pdf(name_pdf)
 
- # TODO - legend should be ajusted for the maps - otherwise, makes no sense to have one legend for each map -
- # they all become the same, except for "Other taxa"
  self.create_potential_legend(dic_colors.values(), dic_colors.keys(), name_pdf.replace('.pdf', '_legend.png'))
 
  self.add_legend(
@@ -441,18 +432,16 @@ def differential_colorbar(self, dataframe, filename):
 
  def differential_expression_sample(
  self, data, samples, ko_column, mmaps2taxa, taxa_column='Taxonomic lineage (GENUS)',
- output_basename=None, log=True):
+ output_basename=None):
  """
  Represents in small heatmaps the expression levels of each sample on the
- dataset present in the given pathway map. The values can be transford to
- a log10 scale
+ dataset present in the given pathway map.
  :param data: pandas.DataFrame with data already processed by KEGGPathway
  :param samples: list - column names of the dataset corresponding to expression values
  :param ko_column: str - column with KOs to represent
  :param mmaps2taxa: dict - of taxa to color
  :param taxa_column: str - column with taxonomic classification
  :param output_basename: string - basename of outputs
- :param log: bol - convert the expression values to logarithmic scale?
  """
  if mmaps2taxa is not None:
  data = data[data[taxa_column].isin(mmaps2taxa[self.name.split('ko')[1]])]
@@ -465,7 +454,7 @@ def differential_expression_sample(
  return 1
  df = df.groupby('Boxes')[samples].sum()
 
- self.pathway_boxes_differential(df, log)
+ self.pathway_boxes_differential(df)
 
  name = self.name.split(':')[-1]
  name_pdf = f'{output_basename}_{name}.pdf'