(Click the badge to go to the download page)
Fixing the anno column's json format by replacing double double-quotes ("") with single double-quotes (").
(Click the badge to go to the download page)
- Updated with two additional pathogenic repeats
- FGF14 from https://www.nejm.org/doi/full/10.1056/NEJMoa2207406
- THAP11 from https://pubmed.ncbi.nlm.nih.gov/37148549/
- Intersected with 118 phenotypic VNTRs
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8549062/
The phenotypic VNTRs had no overlap to pathogenic repeats. Therefore, we added the phenotypic VNTR's gene name to the pathogenic column of the catalog. This makes the v1.2 'patho' column more of a 'patho/pheno'
Column | Definition |
---|---|
chr | Chromosome of the region |
start | Start position of the region |
end | End position of the region |
ovl_flag | overlap categories of annotations inside the region |
up_buff | number of bases upstream of the first annotation's start that are non-TR sequence |
dn_buff | number of bases downstream of the last annotation's end that are non-TR sequence |
hom_span | number of bases of the region found to be homopolymer repeats |
n_filtered | number of annotations removed from the region |
n_annos | number of annotations remaining in the region |
n_subregions | number of subregions in the region |
mu_purity | average purity of annotations in region |
pct_annotated | percent of the region's range (minus buffer) annotated |
interspersed | name of interspersed repeat class found within region by RepeatMasker v4.1.4 |
patho/pheno | name of gene affected by a pathogenic or phenotypic tandem repeat in region |
codis | name of CODIS site contained in region |
gene_flag | gene features intersecting region (Enseml v105) |
biotype | comma separated gene biotypes intersecting region (Enseml v105) |
annos | JSON of TRF annotations in the region (list of dicts with keys: motif, entropy, ovl_flag, etc) |
(Click the badge to go to the download page)
- Updated pathogenic repeats. 54 pathogenic regions are unchanged, 2 have been changed, and 6 added.
changed: (is_now -> was)
NOTCH2NLA -> NOTCH2NLC
NOTCH2NLC -> NOTCH2NL
added:
EIF4A3, PRNP, TBX1, PRDM12, DMD, ZIC3
- Renamed annotations' "motif" key back to "repeat" for
truvari anno trf
compatibility. - hom_span column normalized as hom_pct - percent of bases in the regions annotated as homopolymers
(Click the badge to go to the download page)
- Removed homopolymer annotations
- Simplified overlapping annotations
- Added new columns that describe properties of the region
Column | Definition |
---|---|
chr | Chromosome of the region |
start | Start position of the region |
end | End position of the region |
ovl_flag | overlap categories of annotations inside the region |
up_buff | number of bases upstream of the first annotation's start that are non-TR sequence |
dn_buff | number of bases downstream of the last annotation's end that are non-TR sequence |
hom_span | number of bases of the region found to be homopolymer repeats |
n_filtered | number of annotations removed from the region |
n_annos | number of annotations remaining in the region |
n_subregions | number of subregions in the region |
mu_purity | average purity of annotations in region |
pct_annotated | percent of the region's range (minus buffer) annotated |
interspersed | name of interspersed repeat class found within region by RepeatMasker v4.1.4 |
patho | name of gene affected by a pathogenic tandem repeat in region |
codis | name of CODIS site contained in region |
gene_flag | gene features intersecting region (Enseml v105) |
biotype | comma separated gene biotypes intersecting region (Enseml v105) |
annos | JSON of TRF annotations in the region (list of dicts with keys: motif, entropy, ovl_flag, etc) |
The annos
JSON is simple key:values of:
- chrom - chromosome
- start - start position of the repeat
- end - end position of the repeat
- period - period size of the repeat
- copies - number of copes of the repeat in the reference
- score - alignment score
- entropy - entropy measure based on percent composition
- motif - motif sequence of the repeat
- purity - Sequence similarity of
motif*copies
against annotation’s reference span
(Click the badge to go to the download page)
- Added new annotations sources from:
- See slides for details
- Same file structure as v0.2
(Click the badge to go to the download page)
Much simpler and much smaller data
- Using a pVCF created from 172 haplotype-resolved long-read assemblies, we removed any TR region which had no observed non-SNP variant. This removed 58% of the regions.
- With fewer regions, we simplified the data's structure and now store everything in a single tab-delimited bed-like file with columns:
Columns:
- chrom - chromosome of TR region
- start - 0-based start position of TR region
- end - 0-based end position of TR region
- annos - Json containing a list of TRF annotated repeats with structure:
- chrom - chromosome
- start - start position of the repeat
- end - end position of the repeat
- period - period size of the repeat
- copies - number of copes of the repeat in the reference
- score - alignment score
- entropy - entropy measure based on percent composition
- repeat - motif of the repeat
Example:
chr11 11605859 11605958 [{"chrom": "chr11", "start": 11605912, "end": 11605927, "period": 2.0, "copies": 8.5, "score": 41, "entropy": 0.99, "repeat": "CA"}, {"chrom": "chr11", "start": 11605930, "end": 11605941, "period": 6.0, "copies": 2.0, "score": 36, "entropy": 1.92, "repeat": "AGCTTC"}]
The single file of regions paired with their annotations allows much easier parsing/usage. A custom parser can easily be built by splitting each line on tabs and using a json parser on the 4th column. A parser has already been built into truvari and can be used via:
from truvari.annotations.trf import iter_tr_regions
# generator for every region
adotto_all_regions = iter_tr_regions("adotto_TRannotations_v0.2.bed.gz")
# fetch regions in a region
adotto_fetch_regions = iter_tr_regions("adotto_TRannotations_v0.2.bed.gz", region=("chr17", 10350000, 10360000))
Additionally, the file can be queried with tabix
tabix adotto_TRannotations_v0.2.bed.gz chr11:11600000-11606000
(Click the badge to go to the download page)
Inside of data_list.txt is the relative paths to data generated by this step.
We bundle these files using tar such that they can be recovered in-place into a copy of the code for reanalysis and to keep things organized.
The tarball can be created with the command:
cat data_list.txt | tar czvf adotto_regions_data_<version>.tgz -T -
The tarball can be placed inside this directory and extracted in-place via:
tar xzvf adotto_regions_data_<version>.tgz
Note! the data_list.txt is put into .gitignore to help keep git status
clean
- It exists
List of files and their descriptions:
data/tr_regions.bed.gz
- Final set of tandem-repeat regions for analysisdata/tr_annotated.bed.gz
- TandemRepeatFinder annotations over the tandem-repeat regionsdata/unannotated_regions.bed.gz
- tr_regions.bed.gz which have no accompanying tr_annotated.bed.gz entriesdata/merged.slop25.bed.gz
- Merged calls from the sources with 25bp of slop added to each enddata/giab/giab_concat_input.bed.gz
- Raw input provided by GIAB from ftpdata/giab/merged.bed.gz
- GIAB regions mergeddata/baylor/grch38.simpleRepeat.truvari.bed.gz
- Raw input provided by baylor from UCSC Simple repeat annotationsdata/baylor/merged.bed.gz
- Baylor regions mergeddata/pacbio/repeat_catalog.hg38.bed.gz
- Raw input provided by pacbio from Illuminadata/pacbio/merged.bed.gz
- Pacbio regions mergeddata/ucsd1/ensembleTR_loci_list.bed.gz
- Raw input provided by UCSD from ensembledata/ucsd1/merged.bed.gz
- UCSD1 regions mergeddata/ucsd2/GIAB_adVNTR_short_VNTR_regions.bed.gz
- Raw input provided by UCSD from ???data/ucsd2/merged.bed.gz
- UCSD2 regions merged