Skip to content

Commit

Permalink
docs + minor fixes + bubblecaller
Browse files Browse the repository at this point in the history
  • Loading branch information
dubssieg committed Sep 25, 2023
1 parent 5fc55ae commit 032fe2f
Show file tree
Hide file tree
Showing 3 changed files with 258 additions and 5 deletions.
155 changes: 154 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@

Implementations of many functions for performing various actions on GFA-like graphs in a command-line tool, such as extracting or offseting a pangenome graph. Is capable of comparing graphs topology between graphs that happen to contain the same set of sequences. Does pangenome graphs visualisation with interactive html files.
Uses the [gfagraphs library](https://pypi.org/project/gfagraphs/) to load and manipulate pangenome graphs.
Details about implementation can be [found here](https://hal.science/hal-04213245) (in french only, sorry).

![](https://media.discordapp.net/attachments/878301351753429072/1154788148577058886/Screenshot_from_2023-09-22_16-35-22.png)

## Installation

Expand All @@ -29,8 +32,158 @@ Other tools are in the `scripts` folder.
Are available through `pangraphs`:

- **grapher** creates interactive graph representation from a GFA file
- **stats** gathers basic stats from the input GFA file
- **reconstruct** recreates the linear sequences from the graph
- **offset** adds relative position information as a tag in GFA file
- **isolate** extracts a subgraph from positions in the paths
- **neigborhood** extracts a subgraph from a set of nodes around a node
- **edit** computes a edit distance between variation graphs
- **edit** computes a edit distance between variation graphs

## Render interactive html view

With this command, you can create a html interactive view of your graph, with sequence in the nodes (S-lines) and nodes connected by edges (L-lines). If additional information is given (as such as W-lines or P-lines), supplementary edges will be drawn in order to show the path that the genomes follows in the graph.

```bash
pangraphs grapher [-h] [-b BOUNDARIES [BOUNDARIES ...]] file output

positional arguments:
file Path to a gfa-like file
output Output path for the html graph file.

options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
```
When using this command, please only work with graphs with under 10k nodes. To do so, you may flatten the graph or extract subgraphs (using for instance **pangraphs neighborhood** or **pangraphs isolate**).
The `-b`/`--boundaries` option lets you choose size classes to differentiate. They will have a different color, and their number will be computed separately.
The `output` argument may be : a path to a folder (existing or not) or a path to a file (with .HTML extension or not).
## Compute stats on your graph
With this command, you can output basic stats on your graph.
```bash
pangraphs stats [-h] [-b BOUNDARIES [BOUNDARIES ...]] file

positional arguments:
file Path to a gfa-like file

options:
-h, --help show this help message and exit
-b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
and one for nodes in range 2001-inf bp).
```
This program displays stats in command-line (stdout). You may pipe it to a file if you want to use it on a cluster. (pangraphs stats graph.gfa > out.txt)
The `-b`/`--boundaries` option lets you choose size classes to differentiate. Their number will be computed separately.
## Extract sequences from the graph
With this command, you can reconstruct linear sequences from the graph.
```bash
pangraphs reconstruct [-h] -r REFERENCE [--start START] [--stop STOP] [-s] file out
positional arguments:
file Path to a gfa-like file
out Output path (without extension)
options:
-h, --help show this help message and exit
-r REFERENCE, --reference REFERENCE
Tells the reference sequence we seek start and stop into
--start START To specifiy a starting node on reference to create a subgraph
--stop STOP To specifiy a ending node on reference to create a subgraph
-s, --split Tells to split in different files
```
For this function, the `-r`/`--reference` option is needed only if you specify starting and ending points.
## Adding coordinate system
With this command, you ca add a JSON GFA-compatible string to each S-line of the graph (each node). This field will contain starting position, ending position and orientation, for each path in the graph.
```bash
pangraphs offset [-h] file out
positional arguments:
file Path to a gfa-like file
out Output path (with extension)
options:
-h, --help show this help message and exit
```
## Isolate a subgraph
### By neighbors around a node
With this function, you can extract the *n* closest nodes from a node, keeping topology and informations about the selected nodes, creating a subgaph.
```bash
pangraphs neighborhood [-h] [-s START_NODE [START_NODE ...]] [-c COUNT] file out
positional arguments:
file Path to a gfa-like file
out Output path (with extension)
options:
-h, --help show this help message and exit
-s START_NODE [START_NODE ...], --start_node START_NODE [START_NODE ...]
To specifiy a starting node on reference to create a subgraph
-c COUNT, --count COUNT
Number of nodes around each starting point
```
### By starting and ending position
With this function, you need to have coordinates in your input GFA, meaning you need to use `pangraphs offset` beforehand.
```bash
pangraphs isolate [-h] [-s START] [-e END] [-r REFERENCE] file out
positional arguments:
file Path to a gfa-like file
out Output path (with extension)
options:
-h, --help show this help message and exit
-s START, --start START
To specifiy a starting point (in bp) to create a subgraph
-e END, --end END To specifiy a end point (in bp) to create a subgraph
-r REFERENCE, --reference REFERENCE
To specifiy the path to follow
```
## Compute edition between graphs
In order to compare two graphs, they need to :
+ have the same sequence content
+ have the same number and names of paths
+ the reconstruction of paths must yield the same sequences
If those criteria are met, you may compare your graphs.
```bash
pangraphs edit [-h] -o OUTPUT_FOLDER [-p] file [file ...]
positional arguments:
file Path(s) to two or more gfa-like file(s).
options:
-h, --help show this help message and exit
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Path to a folder for results.
-p, --perform_edition
Asks to perform edition on graph and outputs it.
```
The `-p`/`--perform_edition` applies the merge/split identified operations to the second loaded graph, to make it with the same segmentation as the first one.
90 changes: 90 additions & 0 deletions workspace/bubble_seeker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
from typing import Generator
from gfagraphs import Graph


def grouper(iterable, n=2, m=1):
"""Collect data into overlapping fixed-length chunks or blocks"""
return [iterable[i:i+n] for i in range(0, len(iterable)-1, n-m)]


def common_members(elements: list[set]):
first_path, other_paths = elements[0], elements[1:]
return sorted(list(first_path.intersection(*other_paths)))


def bubble_caller(gfa_graph: Graph) -> list[dict]:
"""Calls out the bubbles in the graph.
A bubble can be defined as having a starting and an ending node
with a in and out node with degree equal to the number of paths
for superbubble level we don't have to watch the order, as
Args:
gfa_file (str): path to a gfa-like file
Returns:
list[dict]: a list of mappings between paths names and the subchain in the bubble
one element per bubble
"""
gfa_paths: list = gfa_graph.get_path_list()

all_sets = {
path.datas['name']:
[
node_name for node_name, _ in path.datas['path']
]
for path in gfa_paths
}

bubbles_endpoints: list = sorted(common_members(
list(
set(x) for x in all_sets.values()
)
), key=int)
bubbles: list[dict] = [{}
for _ in range(len(bubbles_endpoints)-1)]
for path in gfa_paths:
# Computing endpoint positions in list for each path
endpoints_indexes: list = grouper(
[
all_sets[
path.datas['name']
].index(
endpoint
) for endpoint in bubbles_endpoints
],
2
)
print(endpoints_indexes)
# Getting bubble chains
for i, (start, end) in enumerate(endpoints_indexes):
bubbles[i][path.datas['name']
] = all_sets[path.datas['name']][start:end+1]
return bubbles


def call_variants(gfa_file: str, gfa_type: str, reference_name: str) -> Generator:
"""Given a GFA file and a path name, calls rank 1 variants against it
Args:
gfa_file (str): path to a gfa file
gfa_type (str): subformat
reference_name (str): a path name in the gfa file
"""
gfa_graph: Graph = Graph(
gfa_file=gfa_file,
gfa_type=gfa_type,
with_sequence=True)
bubbles: list[dict] = bubble_caller(gfa_graph=gfa_graph)
print(bubbles)
for bubble in bubbles:
yield {path_name: ''.join([gfa_graph.get_segment(node=node).datas['seq'] for node in path_chain]) for path_name, path_chain in bubble.items()}


"""
def flatten_graph(gfa_file: str, gfa_type: str) -> None:
gfa_graph: Graph = Graph(
gfa_file=gfa_file,
gfa_type=gfa_type,
with_sequence=True)
bubbles: list[dict] = bubble_caller(gfa_graph=gfa_graph)
"""
18 changes: 14 additions & 4 deletions workspace/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,14 +96,16 @@
parser_grapher.add_argument("output", type=str,
help="Output path for the html graph file.")
parser_grapher.add_argument(
"-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+')
"-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+', default=[50])

## Subparser for stats ##

parser_stats: ArgumentParser = subparsers.add_parser(
'stats', help="Retrieves basic stats on a pangenome graph.")

parser_stats.add_argument("file", type=str, help="Path to a gfa-like file")
parser_stats.add_argument(
"-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+', default=[50])

## Subparser for reconstruct_sequences ##

Expand Down Expand Up @@ -235,9 +237,17 @@ def main() -> None:
paths_step(args.file, output, nodes,
gfa_version_info, gfa_version_info)
elif args.subcommands == 'stats':
pangenome_graph: MultiDiGraph = (pgraph := pGraph(
args.file, gfa_version_info, with_sequence=True)).compute_networkx()
graph_stats = compute_stats(pgraph)
pgraph: pGraph = pGraph(
args.file, gfa_version_info, with_sequence=True)
bounds: list = []
boundaries = [
0] + [bound+x for bound in args.boundaries for x in [0, 1]] + [float('inf')]
for i in range(0, len(boundaries), 2):
x = i
bounds.append([boundaries[x], boundaries[x+1]])

graph_stats = compute_stats(pgraph, length_classes=tuple(bounds))

for key, value in graph_stats.items():
print(f"{key}: {value}")
elif args.subcommands == 'grapher':
Expand Down

0 comments on commit 032fe2f

Please sign in to comment.