Skip to content

Commit

Permalink
Updated documentation
Browse files Browse the repository at this point in the history
Sanitization of input file
  • Loading branch information
iquasere committed Nov 7, 2023
1 parent 47f3464 commit 0eda356
Show file tree
Hide file tree
Showing 4 changed files with 36 additions and 50 deletions.
27 changes: 9 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ A tool for representing genomic potential and transcriptomic expression into KEG
* [Features](https://github.com/iquasere/KEGGCharter#features)
* [Installation](https://github.com/iquasere/KEGGCharter#installation)
* [Running KEGGCharter](https://github.com/iquasere/KEGGCharter#running-keggcharter)
* [Testing KEGGCharter](https://github.com/iquasere/KEGGCharter#testing-keggcharter)
* [Outputs](https://github.com/iquasere/KEGGCharter#outputs)
* [Arguments for KEGGCharter](https://github.com/iquasere/KEGGCharter#arguments-for-keggcharter)
* [Referencing KEGGCharter](https://github.com/iquasere/KEGGCharter#referencing-keggcharter)
Expand All @@ -27,32 +26,24 @@ conda install -c conda-forge -c bioconda keggcharter

## Running KEGGCharter

To run KEGGCharter, an input file must be supplied - see ```Testing KEGGCharter``` section - and columns
with genomic and/or transcriptomic information, as well as one column with either KEGG IDs, KOs or EC numbers, must be
present in the file and specified through the command line.
```
keggcharter.py -f input_file.tsv -o output_folder -mgc mg_column1,mg_column2 -mtc mt_column1,mt_column2 ...
```

## Testing KEGGCharter
To run KEGGCharter, an input file must be supplied. This file only needs to contain one column with either KEGG IDs, KOs or EC numbers. Beyond that:
* to obtain distinct taxonomic identifications in the maps, a column with taxonomic identification must be specified with the `-tcol` parameter. If no such column exists, KEGGCharter must be run with the `-it` parameter.
* to obtain maps with differential expression, at least one column with genomic and/or transcriptomic quantification must be specified with the `-qcol`parameter. If no such column exists, KEGGCharter must be run with the `-iq` parameter.

An example input file is available [here](https://github.com/iquasere/KEGGCharter/blob/master/MOSCA_Entry_Report.xlsx).
This is one output of [MOSCA](https://github.com/iquasere/MOSCA), which can be directly inputted to KEGGCharter to obtain
metabolic representations by running:
An example input file is available [here](https://github.com/iquasere/KEGGCharter/blob/master/cicd/keggcharter_input.tsv).
It contains all fields referenced above, and should be used as guidance for building inputs for KEGGCharter.
To obtain metabolic representations for "Methane Metabolism" and "Fatty Acid Degradation" with KEGGCharter, for this input file, KEGGCharter can be run with the following command:
```
keggcharter -f MOSCA_Entry_Report.xlsx -gcol mg -tcol mt_0.01a_normalized,mt_1a_normalized,mt_100a_normalized,mt_0.01b_normalized,mt_1b_normalized,mt_100b_normalized,mt_0.01c_normalized,mt_1c_normalized,mt_100c_normalized -keggc "Cross-reference (KEGG)" -o test_keggcharter -tc "Taxonomic lineage (GENUS)"
keggcharter -f keggcharter_input.tsv -o test_keggcharter -qcol mt_0.01a,mt_1a,mt_100a,mt_0.01b,mt_1b,mt_100b,mt_0.01c,mt_1c,mt_100c -keggc "KEGG ID" -tc "Species" -mm 00680,00071
```
Just make sure ```MOSCA_Entry_Report.xlsx``` is in the present folder, or indicate the path to it. This command will create
representations for all 252 default maps of KEGGCharter. If you want to represent for less or more, run with the ```--metabolic-maps```
parameter to indicate to KEGGCharter what maps to run (comma separated).

### First time KEGGCharter runs it will take a long time

KEGGCharter needs KGMLs and EC numbers to boxes relations, which it will automatically retrieve for every map inputted.
This might take some time, but you only need to run it once.

Default directory for storing these files is the folder containing the ```keggcharter.py``` script, but it can be customized
with the ```--resources-directory``` parameter.
Default directory for storing these files is the folder containing the `keggcharter` script, but it can be customized
with the `-rd` parameter.

## Outputs

Expand Down
21 changes: 11 additions & 10 deletions cicd/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{% set name = "keggcharter" %}
{% set version = "0.2.3" %}
{% set sha256 = "905268158c74058228faca2b34d44eae8f84b23bfa5dee49285635cf59684d7f" %}
{% set version = "0.7.0" %}
{% set sha256 = "8e79e92eef943d0b7cfbd035b4d3d9ef5f6f51a6f4d43f8d58989fd849625696" %}

package:
name: {{ name|lower }}
Expand All @@ -13,13 +13,14 @@ source:
build:
noarch: generic
number: 0
script: |
dir="${PREFIX}/share/KEGGCharter"
mkdir -p "${dir}"
cp *.py resources/KEGGCharter_prokaryotic_maps.txt "${dir}/"
mkdir -p "${PREFIX}/bin"
chmod +x "${dir}/keggcharter.py"
ln -s "${dir}/keggcharter.py" "${PREFIX}/bin/keggcharter"
run_exports:
- {{ pin_subpackage(name, max_pin="x.x") }}
script: >
mkdir -p $PREFIX/bin &&
mkdir -p $PREFIX/share &&
cp *.py resources/KEGGCharter_prokaryotic_maps.txt $PREFIX/share &&
chmod +x $PREFIX/share/keggcharter.py &&
ln -s $PREFIX/share/keggcharter.py $PREFIX/bin/keggcharter
requirements:
run:
Expand All @@ -34,7 +35,7 @@ requirements:

test:
commands:
- keggcharter.py -v
- keggcharter -v

about:
home: https://github.com/iquasere/KEGGCharter
Expand Down
15 changes: 10 additions & 5 deletions keggcharter.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

from keggpathway_map import KEGGPathwayMap, expand_by_list_column

__version__ = "0.7.0"
__version__ = "0.7.1"


def get_arguments():
Expand All @@ -40,9 +40,8 @@ def get_arguments():
parser.add_argument("-keggc", "--kegg-column", help="Column with KEGG IDs.")
parser.add_argument("-koc", "--ko-column", help="Column with KOs.")
parser.add_argument("-ecc", "--ec-column", help="Column with EC numbers.")
# TODO - test this argument without UniProt shenanigans
parser.add_argument(
"-tc", "--taxa-column", default='Taxonomic lineage (GENUS)',
"-tc", "--taxa-column", default=None,
help="Column with the taxa designations to represent with KEGGCharter."
" NOTE - for valid taxonomies, check: https://www.genome.jp/kegg/catalog/org_list.html")
parser.add_argument(
Expand Down Expand Up @@ -74,7 +73,8 @@ def get_arguments():
help="Outputs KEGG maps IDs and descriptions to the console (so you may pick the ones you want!)")

args = parser.parse_args()

if not os.path.isfile(args.file):
exit("Input file doesn't exist! Exiting...")
args.output = args.output.rstrip('/')
for directory in [args.output] + [f'{args.resources_directory}/{folder}' for folder in ['', 'kc_kgmls', 'kc_csvs']]:
if not os.path.isdir(directory):
Expand Down Expand Up @@ -449,7 +449,7 @@ def chart_map(
kegg_pathway_map = KEGGPathwayMap(pathway=mmap, ec_list=ec_list)
kegg_pathway_map.differential_expression_sample(
data, quantification_columns, ko_column, mmaps2taxa=mmaps2taxa, taxa_column=taxa_column,
output_basename=f'{output}/differential', log=False)
output_basename=f'{output}/differential')
plt.close()


Expand Down Expand Up @@ -496,6 +496,11 @@ def read_input():
args.taxa_list = args.input_taxonomy
args.metabolic_maps = args.metabolic_maps.split(',')
args.quantification_columns = args.quantification_columns.split(',')
# check if all columns supposed to be in the input data are in the input data
for col in [args.taxa_column, args.kegg_column, args.ko_column, args.ec_column] + args.quantification_columns:
if col is not None:
if col not in data.columns:
exit(f'"{col}" column not in input file! Exiting...')
timed_message('Arguments valid.')
return args, data

Expand Down
23 changes: 6 additions & 17 deletions keggpathway_map.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,27 +282,20 @@ def pathway_box_list(self, taxa_in_box, dic_colors, maxshared=10):
if self.orthologs[boxidx].graphics[0].width is not None:
create_tile_box(self.orthologs[boxidx])

def pathway_boxes_differential(self, dataframe, log=True, colormap="coolwarm"):
def pathway_boxes_differential(self, dataframe, colormap="coolwarm"):
"""
Represents expression values present in a dataframe in the
pathway map
:param dataframe: pandas DataFrame with each column representing a sample
and index corresponding to int list index of the ortholog element in the
pathway
:param log: bol providing the option for a log transformation of data
:param colormap: str representing a costum matplotlib colormap to be used
"""
if log:
norm = cm.colors.LogNorm(vmin=dataframe.min().min(), vmax=dataframe.max().max())
else:
norm = cm.colors.Normalize(vmin=dataframe.min().min(), vmax=dataframe.max().max())

norm = cm.colors.Normalize(vmin=dataframe.min().min(), vmax=dataframe.max().max())
colormap = cm.get_cmap(colormap)
dataframe = dataframe.apply(conv_value_rgb, args=(colormap, norm)) # TODO - Doesn't work if using log
dataframe = dataframe.apply(conv_value_rgb, args=(colormap, norm))
dataframe = dataframe.apply(conv_rgb_hex)

dataframe = dataframe[dataframe.columns.tolist()]

nrboxes = len(dataframe.columns.tolist()) # number of samples

for box in dataframe.index.tolist():
Expand Down Expand Up @@ -423,8 +416,6 @@ def genomic_potential_taxa(
name_pdf = f'{output_basename}_{name}.pdf'
self.to_pdf(name_pdf)

# TODO - legend should be ajusted for the maps - otherwise, makes no sense to have one legend for each map -
# they all become the same, except for "Other taxa"
self.create_potential_legend(dic_colors.values(), dic_colors.keys(), name_pdf.replace('.pdf', '_legend.png'))

self.add_legend(
Expand All @@ -441,18 +432,16 @@ def differential_colorbar(self, dataframe, filename):

def differential_expression_sample(
self, data, samples, ko_column, mmaps2taxa, taxa_column='Taxonomic lineage (GENUS)',
output_basename=None, log=True):
output_basename=None):
"""
Represents in small heatmaps the expression levels of each sample on the
dataset present in the given pathway map. The values can be transford to
a log10 scale
dataset present in the given pathway map.
:param data: pandas.DataFrame with data already processed by KEGGPathway
:param samples: list - column names of the dataset corresponding to expression values
:param ko_column: str - column with KOs to represent
:param mmaps2taxa: dict - of taxa to color
:param taxa_column: str - column with taxonomic classification
:param output_basename: string - basename of outputs
:param log: bol - convert the expression values to logarithmic scale?
"""
if mmaps2taxa is not None:
data = data[data[taxa_column].isin(mmaps2taxa[self.name.split('ko')[1]])]
Expand All @@ -465,7 +454,7 @@ def differential_expression_sample(
return 1
df = df.groupby('Boxes')[samples].sum()

self.pathway_boxes_differential(df, log)
self.pathway_boxes_differential(df)

name = self.name.split(':')[-1]
name_pdf = f'{output_basename}_{name}.pdf'
Expand Down

0 comments on commit 0eda356

Please sign in to comment.