Skip to content

Releases: iquasere/KEGGCharter

"--resume" now evaluates the files already produced

11 Jan 17:03
Compare
Choose a tag to compare

For both data_for_charting.tsvand taxon_to_mmap_to_orthologs .json:

  • if the --resume parameter was used and the file is found, KEGGCharter won't generate it again.
  • else, KEGGCharter will again generate the data, and overwrite the file if it exists.

Also, an important fix

On retrieving kegg taxa prefixes - checks with type(taxa) == str now, instead of taxa != np.nan.

New needs, new regexes

05 Jan 14:06
Compare
Choose a tag to compare

Changed regex for EC numbers to account for provisional ECs

Changed ^(\d+)(\.(\d+|-)){3}$ to ^(\d+)(\.(\d+|-)){2}(\.(.*))?$, which accepts provisional EC numbers (e.g., 1.1.1.n1).

Changed regex for KEGG IDs to account for other taxonomy codes

Changed ^[A-Za-z]{3}:.+$ to ^[A-Za-z]+:.+$ to accept taxonomy codes that have less or more than three characters (e.g., pall:UYA_22060).

Also, some bug fixes

  • One of the weirdest bugs ever - pandas.DataFrame.groupby has a maximum number of columns (20).
  • Fix on saving box2taxon when it is empty
  • Also removed some code from the time only one functional column was considered at a time

Important fixes on ID cross-referencing, validation of functional ID columns and colormap picking

03 Jan 18:43
Compare
Choose a tag to compare

Validation of input data columns implemented

Four regexes will check if values in columns are valid.

  • KEGG ID: ^[A-Za-z]{3}:.+$
  • KO: ^K\d{5}$
  • EC number: ^(\d+)(\.(\d+|-)){3}$
  • COG: ^COG\d{4}$

Values can come in comma separated values, but each value between commas must obey the regexes.

Also, several fixes

Fix on adding new ids from API

Merging new IDs with old ones was creating some disconnect between old and new columns, and the new IDs were being placed in new columns disconnected from the rest of the dataframe. It's fixed now.

Differential colormap starts at 0

Before, the colormap was being generated between the maximmum and minimum values of the dataframe. Now, it begins at 0, up to the maximum of the dataframe.

Implemented new parameter for chosing colormap of differential maps

--differential-colormap allows to chose a new colormap instead of the default (summer). Valid values can be consulted at matplotlib.

Also, KEGGCharter now only creates output dirs when it passes input file validation

Fix on having cog2ko available

27 Dec 10:04
Compare
Choose a tag to compare

Must be updated on the meta. Lines change:

cp *.py resources/KEGGCharter_prokaryotic_maps.txt resources/cog2ko_keggcharter.tsv $PREFIX/share &&

to

cp *.py *.txt *.tsv $PREFIX/share &&

Fix on checking for columns of functional IDs

22 Dec 18:20
Compare
Choose a tag to compare

KEGGCharter was only looking for KEGG IDs, KOs and EC numbers columns to check if some functional IDs column was inputted.

This would make it exit with error if only a column with COG IDs was inputted.

Now it also looks for COGs columns, and accepts to only input a COG IDs column.

Also am trying to understand with it doesn't find cog2ko.tsv.

KEGGCharter as a proper tool of science

20 Dec 18:58
Compare
Choose a tag to compare

Implemented COG2KO

This idea belongs to Lovro Grum. For each KO, COGs are extracted from their KEGG HTML page. This information is reversed, and becomes COG to KO conversion.

New database, making KEGGCharter far more powerful! Makes for a great synergy with reCOGnizer.

Because this is webscrapping, 403 - Forbidden and Timeouts may often occur.

KEGGCharter gives some time between failed tries, and at the end checks for any KOs whose HTMLs were not retrieved. It tries to retrieve those as well.

Sanitization of input file

Checks if:

  • inputted columns exist in the input file
  • if --kegg-column, --ko-column, --ec-column, --cog-column columns don't have invalid values / bad characters (" " and ";").

Added parameter for dividing quantification of each enzyme by the KOs assigned to it

When set, the --distribute-quantification parameter will instruct KEGGCharter to split the quantification of each enzyme by all the KOs that were assigned to it.

This information is outputted in data_for_charting.tsv.

New tests for several different parameters' combinations

show-available-maps for --show-available-maps parameter.

input-quantification-and-taxonomy for --input-taxonomy and --input-quantification parameters.

include-missing-genomes for --include-missing-genomes parameter.

map-all for --map-all parameter.

New output folders and writting of JSON information

KEGGCharter now stores metabolic maps representations in a maps folder. No brainer.

KEGGCharter additionally stores the information concerning the maps into a json folder. This folder will contain the dictionaries used for generating both the potential and differential maps.

"Potential" JSONs come in the form {box_id: [tax1, tax2, ...]}.

"Differential" JSONs come in the form {box_id: [col1, col2, ...]}. In the future, these should include the quantification value instead.

Also added lxml as dependency.

Sanitization of input file

07 Nov 18:52
Compare
Choose a tag to compare

Forces input file to have the columns specified through the command line.

Applies to taxa-column, kegg-column, ko-column, ec-column and columns specified through --quantification-columns.

Information from "kegg-column", "ko-column" and "ec-column" is now all combined

20 Sep 16:21
Compare
Choose a tag to compare

Multiple new columns are now outputted, depending on the source of information, e.g., KO (kegg-column) contains the KOs obtained from the IDs on the column specified with -keggc.
All KOs obtained are grouped into the KO (KEGGCharter) column, now the only used for charting functions.

Multiple IDs in the same cell now accepted and considered properly

Comma , is the only delimiter accepted for parsing multiple IDs inside the same cell.
Multiple KEGG IDs were accepted before, if separated by semi-comma (;). This is now deprecated, and they most come comma-separated.
"Data" dataframe extends and compresses with each cycle of ID conversion.

Simplified input of quantification columns

No more --genomic-columns nor --transcriptomic-columns, only --quantification-columns (-tcols) now.
All maps ("potential" and "differential") are produced for those columns.

"gene" features now also mapped

KEGGCharter was only considering the orthologs attribute of the Pathway instances, but some boxes are present in the KGML as gene features. Now, KEGGCharter considers those as well.

Reestructured the repo, simplified CICD, improved output to the command line, performance improvements

Maps inside resources folder, all yamls and CI files in cicd folder.
Much smaller keggcharter_input.tsv is still enough to build nice maps.
Had to specify version of libarchive (3.6.2=h039dbb9_1) in the Dockerfile.
More comprehensive messages.
Lighter progress bars.
--map-all workflow was running write_kgmls function for all taxa. Simply runs for ko now, and associates information to all taxa. Much faster, less dumber.

New options for dealing with tax information

29 May 09:17
Compare
Choose a tag to compare

Original workflow of KEGGCharter attempts to download taxa specific KGMLs for organisms in KEGG Genomes (Fig. 1).

Fig. 1 - Original KEGGCharter workflow. Only arcticus had KOs with functions for the TCA cycle attributed that, simultaneously, were present in the KGML for the TCA cycle and the taxon arcticus.

This type of workflow uses both taxon-specific information and results from the datasets inputted. All functions represented validated by KEGG (i.e., those functions are available for those organisms), but many identifications may be lacking, since information at KEGG is often incomplete.

Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes

Organisms that are not identified in KEGG Genomes can still be represented, if running KEGGCharter with the option --include-missing-genomes. All functions for the KOs identified for that organism will be represented (Fig. 2).

Fig. 2 - KEGGCharter output expanded with --include-missing-genomes parameter. hydrocola is not present in KEGG Genomes, but all functions attributed to its KOs are still represented.

This setting allows to still obtain validated information for the taxonomies that are present in KEGG Genomes, while also allowing for representation of organisms not present in KEGG Genomes. It should offer the best compromise between false positives and false negatives.

Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified

Functions that are not present organisms specific KGMLs can still be represented, if running KEGGCharter with the option --map-all. This will bypass all taxon specific KGMLs, and map all functions for all KOs present in the input dataset (Fig. 3).

Fig. 3 - KEGGCharter output expanded with --map-all parameter. No functions for oleophylus and franklandus were simultaneously present in the KOs identified and available in their KGMLs. In this case, the requirement for presence in the KGMLs is bypassed, and all functions are represented for all taxa.

This setting represents the most information on the KEGG maps, and will produce the most colourful representations, but will likely return many false positives. Maps produced should be analyzed with caution This setting may be required, however, if information for organisms in KEGG Genomes is very incomplete.

Fixed mapping boxes' IDs and submitting too many IDs to KEGG

26 May 14:52
Compare
Choose a tag to compare

Major fix in mapping boxes IDs and positions in orthologs array

Difference between mapping by box.id and by the index in the pathway.orthologs array.

Also changed default "step" to 40

KEGG's API will report on less ID mappings if many IDs are submitted in the same request.
This will take much longer, but all information will be obtained.