docs + minor fixes + bubblecaller

dubssieg · Sep 25, 2023 · 032fe2f · 032fe2f
1 parent 5fc55ae
commit 032fe2f
Show file tree

Hide file tree

Showing 3 changed files with 258 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -9,6 +9,9 @@
 
 Implementations of many functions for performing various actions on GFA-like graphs in a command-line tool, such as extracting or offseting a pangenome graph. Is capable of comparing graphs topology between graphs that happen to contain the same set of sequences. Does pangenome graphs visualisation with interactive html files.
 Uses the [gfagraphs library](https://pypi.org/project/gfagraphs/) to load and manipulate pangenome graphs.
+Details about implementation can be [found here](https://hal.science/hal-04213245) (in french only, sorry).
+
+![](https://media.discordapp.net/attachments/878301351753429072/1154788148577058886/Screenshot_from_2023-09-22_16-35-22.png)
 
 ## Installation
 
@@ -29,8 +32,158 @@ Other tools are in the `scripts` folder.
 Are available through `pangraphs`:
 
 - **grapher** creates interactive graph representation from a GFA file
+- **stats** gathers basic stats from the input GFA file
 - **reconstruct** recreates the linear sequences from the graph
 - **offset** adds relative position information as a tag in GFA file
 - **isolate** extracts a subgraph from positions in the paths
 - **neigborhood** extracts a subgraph from a set of nodes around a node
-- **edit** computes a edit distance between variation graphs
+- **edit** computes a edit distance between variation graphs
+
+## Render interactive html view
+
+With this command, you can create a html interactive view of your graph, with sequence in the nodes (S-lines) and nodes connected by edges (L-lines). If additional information is given (as such as W-lines or P-lines), supplementary edges will be drawn in order to show the path that the genomes follows in the graph.
+
+```bash
+pangraphs grapher [-h] [-b BOUNDARIES [BOUNDARIES ...]] file output
+
+positional arguments:
+  file                  Path to a gfa-like file
+  output                Output path for the html graph file.
+
+options:
+  -h, --help            show this help message and exit
+  -b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
+                        One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
+                        and one for nodes in range 2001-inf bp).
+```
+
+When using this command, please only work with graphs with under 10k nodes. To do so, you may flatten the graph or extract subgraphs (using for instance **pangraphs neighborhood** or **pangraphs isolate**).
+
+The `-b`/`--boundaries` option lets you choose size classes to differentiate. They will have a different color, and their number will be computed separately.
+
+The `output` argument may be : a path to a folder (existing or not) or a path to a file (with .HTML extension or not).
+
+## Compute stats on your graph
+
+With this command, you can output basic stats on your graph.
+
+```bash
+pangraphs stats [-h] [-b BOUNDARIES [BOUNDARIES ...]] file
+
+positional arguments:
+  file                  Path to a gfa-like file
+
+options:
+  -h, --help            show this help message and exit
+  -b BOUNDARIES [BOUNDARIES ...], --boundaries BOUNDARIES [BOUNDARIES ...]
+                        One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp
+                        and one for nodes in range 2001-inf bp).
+```
+
+This program displays stats in command-line (stdout). You may pipe it to a file if you want to use it on a cluster. (pangraphs stats graph.gfa > out.txt)
+
+The `-b`/`--boundaries` option lets you choose size classes to differentiate. Their number will be computed separately.
+
+## Extract sequences from the graph
+
+With this command, you can reconstruct linear sequences from the graph.
+
+```bash
+pangraphs reconstruct [-h] -r REFERENCE [--start START] [--stop STOP] [-s] file out
+
+positional arguments:
+  file                  Path to a gfa-like file
+  out                   Output path (without extension)
+
+options:
+  -h, --help            show this help message and exit
+  -r REFERENCE, --reference REFERENCE
+                        Tells the reference sequence we seek start and stop into
+  --start START         To specifiy a starting node on reference to create a subgraph
+  --stop STOP           To specifiy a ending node on reference to create a subgraph
+  -s, --split           Tells to split in different files
+```
+
+For this function, the `-r`/`--reference` option is needed only if you specify starting and ending points.
+
+## Adding coordinate system
+
+With this command, you ca add a JSON GFA-compatible string to each S-line of the graph (each node). This field will contain starting position, ending position and orientation, for each path in the graph.
+
+```bash
+pangraphs offset [-h] file out
+
+positional arguments:
+  file        Path to a gfa-like file
+  out         Output path (with extension)
+
+options:
+  -h, --help  show this help message and exit
+```
+
+## Isolate a subgraph
+
+### By neighbors around a node
+
+With this function, you can extract the *n* closest nodes from a node, keeping topology and informations about the selected nodes, creating a subgaph.
+
+```bash
+pangraphs neighborhood [-h] [-s START_NODE [START_NODE ...]] [-c COUNT] file out
+
+positional arguments:
+  file                  Path to a gfa-like file
+  out                   Output path (with extension)
+
+options:
+  -h, --help            show this help message and exit
+  -s START_NODE [START_NODE ...], --start_node START_NODE [START_NODE ...]
+                        To specifiy a starting node on reference to create a subgraph
+  -c COUNT, --count COUNT
+                        Number of nodes around each starting point
+```
+
+### By starting and ending position
+
+With this function, you need to have coordinates in your input GFA, meaning you need to use `pangraphs offset` beforehand.
+
+```bash
+pangraphs isolate [-h] [-s START] [-e END] [-r REFERENCE] file out
+
+positional arguments:
+  file                  Path to a gfa-like file
+  out                   Output path (with extension)
+
+options:
+  -h, --help            show this help message and exit
+  -s START, --start START
+                        To specifiy a starting point (in bp) to create a subgraph
+  -e END, --end END     To specifiy a end point (in bp) to create a subgraph
+  -r REFERENCE, --reference REFERENCE
+                        To specifiy the path to follow
+```
+
+
+## Compute edition between graphs
+
+In order to compare two graphs, they need to :
++ have the same sequence content
++ have the same number and names of paths
++ the reconstruction of paths must yield the same sequences
+
+If those criteria are met, you may compare your graphs.
+
+```bash
+pangraphs edit [-h] -o OUTPUT_FOLDER [-p] file [file ...]
+
+positional arguments:
+  file                  Path(s) to two or more gfa-like file(s).
+
+options:
+  -h, --help            show this help message and exit
+  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
+                        Path to a folder for results.
+  -p, --perform_edition
+                        Asks to perform edition on graph and outputs it.
+```
+
+The `-p`/`--perform_edition` applies the merge/split identified operations to the second loaded graph, to make it with the same segmentation as the first one.
diff --git a/workspace/bubble_seeker.py b/workspace/bubble_seeker.py
@@ -0,0 +1,90 @@
+from typing import Generator
+from gfagraphs import Graph
+
+
+def grouper(iterable, n=2, m=1):
+    """Collect data into overlapping fixed-length chunks or blocks"""
+    return [iterable[i:i+n] for i in range(0, len(iterable)-1, n-m)]
+
+
+def common_members(elements: list[set]):
+    first_path, other_paths = elements[0], elements[1:]
+    return sorted(list(first_path.intersection(*other_paths)))
+
+
+def bubble_caller(gfa_graph: Graph) -> list[dict]:
+    """Calls out the bubbles in the graph.
+    A bubble can be defined as having a starting and an ending node
+    with a in and out node with degree equal to the number of paths
+    for superbubble level we don't have to watch the order, as 
+
+    Args:
+        gfa_file (str): path to a gfa-like file
+
+    Returns:
+        list[dict]: a list of mappings between paths names and the subchain in the bubble
+                    one element per bubble
+    """
+    gfa_paths: list = gfa_graph.get_path_list()
+
+    all_sets = {
+        path.datas['name']:
+            [
+                node_name for node_name, _ in path.datas['path']
+        ]
+        for path in gfa_paths
+    }
+
+    bubbles_endpoints: list = sorted(common_members(
+        list(
+            set(x) for x in all_sets.values()
+        )
+    ), key=int)
+    bubbles: list[dict] = [{}
+                           for _ in range(len(bubbles_endpoints)-1)]
+    for path in gfa_paths:
+        # Computing endpoint positions in list for each path
+        endpoints_indexes: list = grouper(
+            [
+                all_sets[
+                    path.datas['name']
+                ].index(
+                    endpoint
+                ) for endpoint in bubbles_endpoints
+            ],
+            2
+        )
+        print(endpoints_indexes)
+        # Getting bubble chains
+        for i, (start, end) in enumerate(endpoints_indexes):
+            bubbles[i][path.datas['name']
+                       ] = all_sets[path.datas['name']][start:end+1]
+    return bubbles
+
+
+def call_variants(gfa_file: str, gfa_type: str, reference_name: str) -> Generator:
+    """Given a GFA file and a path name, calls rank 1 variants against it
+
+    Args:
+        gfa_file (str): path to a gfa file
+        gfa_type (str): subformat
+        reference_name (str): a path name in the gfa file
+    """
+    gfa_graph: Graph = Graph(
+        gfa_file=gfa_file,
+        gfa_type=gfa_type,
+        with_sequence=True)
+    bubbles: list[dict] = bubble_caller(gfa_graph=gfa_graph)
+    print(bubbles)
+    for bubble in bubbles:
+        yield {path_name: ''.join([gfa_graph.get_segment(node=node).datas['seq'] for node in path_chain]) for path_name, path_chain in bubble.items()}
+
+
+"""
+def flatten_graph(gfa_file: str, gfa_type: str) -> None:
+    gfa_graph: Graph = Graph(
+        gfa_file=gfa_file,
+        gfa_type=gfa_type,
+        with_sequence=True)
+    bubbles: list[dict] = bubble_caller(gfa_graph=gfa_graph)
+"""
diff --git a/workspace/main.py b/workspace/main.py
@@ -96,14 +96,16 @@
 parser_grapher.add_argument("output", type=str,
                             help="Output path for the html graph file.")
 parser_grapher.add_argument(
-    "-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+')
+    "-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+', default=[50])
 
 ## Subparser for stats ##
 
 parser_stats: ArgumentParser = subparsers.add_parser(
     'stats', help="Retrieves basic stats on a pangenome graph.")
 
 parser_stats.add_argument("file", type=str, help="Path to a gfa-like file")
+parser_stats.add_argument(
+    "-b", "--boundaries", type=int, help="One or a list of ints to use as boundaries for display (ex : -b 50 2000 will set 3 colors : one for nodes in range 0-50bp, one for nodes in range 51-2000 bp and one for nodes in range 2001-inf bp).", nargs='+', default=[50])
 
 ## Subparser for reconstruct_sequences ##
 
@@ -235,9 +237,17 @@ def main() -> None:
             paths_step(args.file, output, nodes,
                        gfa_version_info, gfa_version_info)
     elif args.subcommands == 'stats':
-        pangenome_graph: MultiDiGraph = (pgraph := pGraph(
-            args.file, gfa_version_info, with_sequence=True)).compute_networkx()
-        graph_stats = compute_stats(pgraph)
+        pgraph: pGraph = pGraph(
+            args.file, gfa_version_info, with_sequence=True)
+        bounds: list = []
+        boundaries = [
+            0] + [bound+x for bound in args.boundaries for x in [0, 1]] + [float('inf')]
+        for i in range(0, len(boundaries), 2):
+            x = i
+            bounds.append([boundaries[x], boundaries[x+1]])
+
+        graph_stats = compute_stats(pgraph, length_classes=tuple(bounds))
+
         for key, value in graph_stats.items():
             print(f"{key}: {value}")
     elif args.subcommands == 'grapher':