Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start modularizing intervals + BP graph, and generate AA-esque summary file #24

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9894198
start modularizing intervals and generating AA-esque summary file
suhas-r Dec 14, 2024
2ee9f9b
continue cleaning up breakpoint creation + interval methods
suhas-r Dec 24, 2024
626e388
unify breakpoint alignment logic / reduce redundancy
suhas-r Dec 25, 2024
a7fee46
modularize + improve readability/error-handling for seeding
suhas-r Jan 5, 2025
45d44a8
add CN plotting functionality + further breakpoint modularization
suhas-r Jan 5, 2025
7b93d8a
drastically clean up amplified interval search, add Node + Discordant…
suhas-r Jan 6, 2025
9482d90
everything but the adjacency matrix
suhas-r Jan 8, 2025
ab49f73
pending one last triage
suhas-r Jan 10, 2025
a98658b
bugs triaged, pending path constraint updates and indexing->dot notat…
suhas-r Jan 13, 2025
d6fcd0c
path constraint cleanup
suhas-r Jan 19, 2025
2f81982
modularize path constraint construction
suhas-r Jan 25, 2025
fab6883
finish path constraint refactor
suhas-r Jan 26, 2025
1520f98
Finalize generalized BP graph updates, address bugs post-integration
suhas-r Jan 27, 2025
d43e525
removing deprecated code, centralizing to/from file methods for BP graph
suhas-r Jan 29, 2025
111e7ea
cycle decomp decoupling of BP graph + metadata container
suhas-r Jan 29, 2025
17c0e7a
Fix cycle entrypoint, cleanup cycle decomp redundancy
suhas-r Jan 30, 2025
b37537b
Clean up BP clustering methods, finish up summary file
suhas-r Jan 31, 2025
cbbf672
Clean up file structure, summary file
suhas-r Jan 31, 2025
90b5e37
Fix cni access, indel breakpoint generation
suhas-r Feb 5, 2025
f888866
poetry build updates for py3.12 + cnvkit
suhas-r Feb 6, 2025
239aaee
Update README + wiki
suhas-r Feb 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ configs/
*.log
*.lp
*.pdf
*.ipynb
*.code-workspace
95 changes: 86 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# CoRAL - Complete Reconstruction of Amplifications with Long reads
## Reference
CoRAL is a tool which utilizes aligned, single-molecule long-read data (.bam) as input, and identifies candidate ecDNA structures. A pre-print is available here:
https://www.biorxiv.org/content/10.1101/2024.02.15.580594v1
CoRAL is a tool which utilizes aligned, single-molecule long-read data (.bam) as input, and identifies candidate ecDNA structures. The original Genome Research '24 paper is available here: https://genome.cshlp.org/content/34/9/1344.

- **CoRAL only works on long-read
whole-genome sequencing data (PacBio, Oxford Nanopore, etc.) - not targeted
sequencing!**
- **We also only support hg38-aligned data currently. Support for other genomes
is [coming soon](https://github.com/AmpliconSuite/CoRAL/issues/33)!**

## Installation
CoRAL can be installed and run on most modern Unix-like operating systems (e.g. Ubuntu 18.04+, CentOS 7+, macOS).
Expand All @@ -17,19 +22,20 @@ CoRAL requires python>=3.12; we recommend using venv/conda for managing Python/p

2. Install packages

- **Option 1.** Install With `pip`.

`pip install -r requirements.txt`

Set `--extra-index-url https://download.pytorch.org/whl/cpu` to prevent inclusion of gigantic GPU packages.

- **Option 2.** Install with `poetry`.
- **Option 1.** Install with `poetry`.

```bash
pip install poetry
poetry install
```

- **Option 2.** Install With `pip`.

`pip install -r requirements.txt`

Set `--extra-index-url https://download.pytorch.org/whl/cpu` to prevent inclusion of gigantic GPU packages.



3. [Download a Gurobi optimizer license](https://support.gurobi.com/hc/en-us/articles/360040541251-How-do-I-obtain-a-free-academic-license) (free for academic use)
- Place the `gurobi.lic` file you download into `$HOME/`. This path is usually `/home/username/gurobi.lic`.
Expand Down Expand Up @@ -76,6 +82,11 @@ The modes are as follows:
3. `plot`: Create plots of decomposed cycles and/or breakpoint graph sashimi plot.
4. `hsr`: Identify candidate locations of chromosomal homogenously staining region (HSR) integration points for ecDNA.
5. `cycle2bed`: Convert the [AmpliconArchitect](https://github.com/AmpliconSuite/AmpliconArchitect) (AA) style `*_cycles.txt` file to a .bed format. The AA format is also used by CoRAL.
6. `cycle`: Run the cycle extraction algorithm on a previously generated
breakpoint graph. NOTE: This requires the breakpoint graph to be generated with
CoRAL v2.1.0 or later, as we require `path constraints` and `amplicon intervals`
to be included in the provided `*_graph.txt` file.


## 1. ```seed```
As the seed amplification intervals are required by the main script ```reconstruct``` mode, it is suggested the user first run ```seed``` mode to generate seed amplification intervals.
Expand Down Expand Up @@ -118,6 +129,12 @@ Usage:
CoRAL may identify and reconstruct a few distinct focal amplifications in the input ```*.BAM``` sample, each will be organized as an *amplicon*, which includes a connected component of amplified intervals and their connections by discordant edges. CoRAL writes the following files to the directory specified with ```--output_dir```.

* Graph file: For each amplicon, a tab-separated text file named ```output_dir/amplicon*_graph.txt``` describing the *sequence edges*, *concordant edges* and *discordant edges* in the graph and their predicted copy count. Note that the graph files outputted by CoRAL have the same format as those outputted by [AmpliconArchitect](https://github.com/AmpliconSuite/AmpliconArchitect) (and therefore the files can be used interchangeably with AmpliconArchitect). Here is an example graph file from GBM39, a cell line with *EGFR* amplified on ecDNA.
* As of version 2.1.0, CoRAL additionally includes `path constraints` and
`amplicon intervals` in the `*_graph.txt` file. This results in the graph
being fully self-contained and able to be passed to cycle extraction without
re-parsing the BAM file. For more information on how to interpret this
metadata, visit our [wiki](https://github.com/AmpliconSuite/CoRAL/wiki/Home/_edit#breakpoint-graphs).

```
SequenceEdge: StartPosition, EndPosition, PredictedCN, AverageCoverage, Size, NumberOfLongReads
sequence chr7:54659673- chr7:54763281+ 4.150534 45.907363 103609 576
Expand All @@ -137,6 +154,12 @@ concordant chr7:56049369+->chr7:56049370- 4.150534 45
discordant chr7:55610095-->chr7:55609190+ 86.642611 869
discordant chr7:56049369+->chr7:54763282- 85.189818 981
discordant chr7:55155021-->chr7:55127266+ 86.496697 978
...
PathConstraint: Path, Support
path_constraint e2+:1,c2-:1,e3+:1,c3-:1,e4+:1 6
path_constraint e4+:1,c4-:1,e5+:1,c5-:1,e6+:1 34
AmpliconIntervals: chr, start, end
interval chr7 54659673 56149664
```
* Cycles file:
For each amplicon, a tab-separated text file named ```output_dir_amplicon*_cycles.txt``` describing the list of cycles and paths returned from cycle extraction. Note that the cycles files output by CoRAL have mostly the same format as those output by [AmpliconArchitect](https://github.com/AmpliconSuite/AmpliconArchitect) (and therefore the files can be used interchangeably with AmpliconArchitect in most cases). Specifically a cycles file includes (i) the list of amplified intervals; (ii) the list of sequence edges; (iii) the list of cycles and paths, where an entry starts with ```0+``` and ends with ```0-``` in ```Segments``` indicates a path - these lines have the same format as AmpliconArchitect output. CoRAL's cycles files additionally include (iv) a list of longest (i.e., there are no paths that can form a sub/super-path to each other) path constraint indicated by long reads, and used in CoRAL's cycle extraction. Here is an example cycles file corresponding to the above graph file from GBM39.
Expand Down Expand Up @@ -236,3 +259,57 @@ chr7 55610095 56049369 + 1 True 82.346163
chr7 54763282 56049369 + 2 False 2.843655
```


## 5. ```cycle```
CoRAL provides an option to convert its cycles output in AmpliconArchitect format ```*_cycles.txt``` into ```*.bed``` format (similar to [Decoil](https://github.com/madagiurgiu25/decoil-pre)), which makes it easier for downstream analysis of these cycles.

Usage:
```coral cycle2bed <Required arguments> <Optional arguments>```

**5.1 Required arguments:**
* ```--cycle-file <FILE>``` - Input cycles file in AmpliconArchitect format.
* ```--output-file <FILE>``` - Output cycles file in ```*.bed``` format.

**5.2 Optional arguments:**
* ```--num-cycles <INT>``` - If specified, only convert the first NUM_CYCLES cycles.

Here is an example output of ```cycle2bed``` given by the above cycles file from GBM39.
```
#chr start end orientation cycle_id iscyclic weight
chr7 54763282 55127266 + 1 True 82.346163
chr7 55155021 55609190 + 1 True 82.346163
chr7 55610095 56049369 + 1 True 82.346163
chr7 54763282 56049369 + 2 False 2.843655
```


## 6. ```cycle```
Usage:
```coral cycle <Required arguments> <Optional arguments>```

**4.1 Required arguments:**

| Argument | Descripion |
|--------------------|---------------------------------------------------|
| `--graph <file>` | AA-formatted `_graph.txt` file |
| `--output-dir <file>` | Directory for output files |

**4.2 Optional arguments:**

| Argument | Default | Description |
|------------------------------|---------|--------------------------------------------------------------------|
| `--alpha <float>` | 0.01 | Parameter used to balance CN weight and path constraints in the objective function of greedy cycle extraction. Default value is 0.01, higher values favor the satisfaction of more path constraints. |
| `--time-limit <int>` | 7200 | Time limit for cycle extraction (in seconds) |
| `--threads <int>` | -1 | Number of threads for cycle extraction. If not specified, use all available cores. |
| `--solver <choice>` | gurobi | Solver for cycle extraction. Must be one of `[gurobi, scip]` |
| `--output-all-path-constraints` | False | If specified, output all path constraints given by long reads in `*_cycles.txt` file (see "Expected output" below). |
| `--postprocess-greedy-sol` | False | If specified, automatically postprocess the cycles/paths returned in greedy cycle extraction, by solving the full quadratic program to minimize the number of cycles/paths starting with the greedy cycle extraction solution (as an initial solution). |


## FAQs
- `call_cnvs.sh` didn't produce segmented CN calls in a .cns file?
- `cnvkit.py batch` contains multiple steps detailed in their
[documentation](https://cnvkit.readthedocs.io/en/stable/pipeline.html). The
errors from a particular stage don't always percolate up when running the
complete pipeline via `batch`, so try running each stage separately to
pinpoint the root cause.
81 changes: 81 additions & 0 deletions coral/bam_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from __future__ import annotations

from typing import Generator, Protocol

import pysam

from coral.datatypes import Interval, Node


class BAMReadProtocol(Protocol):
pass


class BAMRead(BAMReadProtocol, pysam.AlignedRead): # type: ignore
pass


class SAMSegmentProtocol(Protocol):
# Abstract class used to add more informative typing to pysam BAM ops.
# Also, .pyx docstrings don't appear to work natively with VSCode/Pylance.
query_name: str # Force non-optional (unlike pysam.AlignedSegment)

def get_cigar_stats(self) -> tuple[list[int], list[int]]:
"""Summary of operations in cigar string.

The output order in the array is "MIDNSHP=X" followed by a
field for the NM tag. If the NM tag is not present, this
field will always be 0.

+-----+--------------+-----+
|M |BAM_CMATCH |0 |
+-----+--------------+-----+
|I |BAM_CINS |1 |
+-----+--------------+-----+
|D |BAM_CDEL |2 |
+-----+--------------+-----+
|N |BAM_CREF_SKIP |3 |
+-----+--------------+-----+
|S |BAM_CSOFT_CLIP|4 |
+-----+--------------+-----+
|H |BAM_CHARD_CLIP|5 |
+-----+--------------+-----+
|P |BAM_CPAD |6 |
+-----+--------------+-----+
|= |BAM_CEQUAL |7 |
+-----+--------------+-----+
|X |BAM_CDIFF |8 |
+-----+--------------+-----+
|B |BAM_CBACK |9 |
+-----+--------------+-----+
|NM |NM tag |10 |
+-----+--------------+-----+

If no cigar string is present, empty arrays will be returned.

Returns:
arrays :
two arrays. The first contains the nucleotide counts within
each cigar operation, the second contains the number of blocks
for each cigar operation.

"""
...


class SAMSegment(SAMSegmentProtocol, pysam.AlignedSegment): # type: ignore
pass


class BAMWrapper(pysam.AlignmentFile):
def fetch_interval(
self, interval: Interval
) -> Generator[SAMSegment, None, None]:
return self.fetch(interval.chr, interval.start, interval.end + 1) # type: ignore

def fetch_node(self, node: Node) -> Generator[SAMSegment, None, None]:
return self.fetch(node.chr, node.pos, node.pos + 1) # type: ignore


# class TypedBAM(pysam.AlignmentFile, BAMProtocol):
# pass
Loading