panREPET
panREPET is described in this preprint available on BioRxiv
Description
Transposable elements (TEs) are mobile DNA elements that can invade genomes by transposition. Despite their reputation as parasitic sequences, they can enrich genomes with functional novelties that foster genome evolution. The impact of TEs in a genome can be explored by searching for their insertions. Individuals of the same species independently undergo TE insertions causing inter-individual genetic variability. This variability between individuals is the basis of the natural selection that leads to an increased adaptation of individuals to their environment. A way to search for the potential role of TEs in host adaptation is through a pangenomic approach. The TE pangenome can be described by (i) TE insertions present in all individuals of the species (core-genome), (ii) insertions present only among a subset of individuals (dispensable-genome) or (iii) ecogenome when the individuals share the same environment, and finally (iv) individual-specific insertions.
A majority of current pangenomic analysis methods are based on the alignment of reads from different genomes of a species to an assembled reference genome. But, the advent of the third-generation sequencing makes now possible to better approach this question using multiple de novo assembled genomes of the same species to avoid the bias introduced by a single reference genome. panREPET is a new pipeline to take in charge this type of data. It compares TE copies between each pair of genomes then identifies TE copies shared by a group of individuals. As it work on multiple assembled genomes, it allows accessing to exact genomic contexts of TE copies and their exact sequences.
Installation
Install Conda (>=4.12.0)
Install Snakemake via mamba (mamba >=0.22.1, snakemake >=7.3.8)
Install Singularity (>=3.8.7-1.el7)
Load the TEfinder2.31 Singularity Image File (SIF) containing the Matcher tool:
cd containers
conda activate snakemake
singularity pull --arch amd64 library://hquesneville/default/te_finder:2.31
For more details about TEfinder and Matcher tools: Quesneville, H., Nouaud, D. & Anxolabéhère, D. Detection of New Transposable Element Families in Drosophila melanogaster and Anopheles gambiae Genomes . J Mol Evol57 (Suppl 1), S50–S59 (2003). https://doi.org/10.1007/s00239-003-0007-2
Directory tree
.
├── README.md
├── Snakefile
├── run_Snakemake.sh
├── scripts
│ ├── extractCopies.py
│ ├── seqIdLenFasta.py
│ ├── fastaLength.py
│ ├── reformat.py
│ ├── launchMinimap2.py
│ ├── minimap2align.py
│ ├── bestHits.py
│ ├── cliques.py
│ ├── uniqueCopies.py
│ └── stats.py
├── config
├── containers
│ └── how_install\_te-finder.txt
├── data
└── envs
How configurate the configfile
Set up the configfile in config/example.yaml:
-
output_directory
: Output directory path, the path must be absolute. -
GFF
: Specify the path to TE annotation from your genomes of interest (GFF format, of the same shape as the GFF from panTEannot). -
Ref
: Specify the path to the reference sequence of your genomes. Each genome for GFF and Ref must have the same identifier. -
Consensus
: Specify the path to the TE consensus library used for TE annotation of genomes (fasta format).
params:
-
cov_consensus:
Select TE copies which cover their consensus betwen x% and (100+(100-x))% (95 by default meaning over 95-105%). The higher the coverage, the more complete copies are selected. -
copies_type
: Specify the complete TE copy type you want to extract. Choose between FLF or FLC (FLC by default). The copies are classified as Full-Length Copies (FLC) when they are aligned (potentially with large gaps if there are insertions) over x%-(100+(100-x))% of their reference sequence and as Full-Length Fragment copies (FLF), when they are unfragmented. -
extension_length:
Flanking regions length of TE copies in base pairs (500 bp by default). -
cov_flank:
Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage of the two flankers from its match (80% by default). This filter diminishes the number of false positive copies. -
cov_match:
Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage of its match (not useful, 0 by default).
params (which are optional):
-
select_region_bed:
Specify regions to select or to mask (bedfile). None by default. None by default means that all submitted genomes (params Ref) are considered. -
select_type:
Select or mask a part of genomes (e.g. you can mask the centromeres). Choose between mask or select (select by default). -
is_chr_level:
You can specify at which chromosomal level TE insertions should be compared (e.g. compare only same chromosomes between genomes). Caution 1: For assemblies that are not on a chromosomal scale, you must set false. Caution 2: If chromosome names are not the same between genomes, the comparison will not be possible.
Example:
output_directory = /absolute_path/my_output_directory
GFF:
genome1: genome1.gff
genome2: genome2.gff
Ref:
genome1: genome1.fa
genome2: genome2.fa
Consensus: my_TE_library.fa
params:
cov_consensus: 95
copies_type: FLC
extension_length: 500
cov_flank: 80
cov_match: 0.0
select_region_bed: centromeres.bed (optional)
select_type: mask (optional)
is_chr_level: False (optional)
The gff files must have this format:
chr1 matcher match 30000 31000 900 - . ID=1;Name=consensus_name_1;Target=consensus_name_1 1 1000;Note=e-value:0.0,identity:95
chr1 matcher match_part 30000 31000 900 - . ID=1.1;Parent=1;Name=consensus_name_1;Target=Bdis_TEdenovoGr-B-G6303-Map3_reversed 1 1000;Note=e-value:0.0,identity:95
centromeres.bed must have this format:
genome chr start end
genome1 1 1000 5000
genome1 2 700 1000
genome2 1 1000 5000
genome2 2 800 1100
During the "extractCopies" step, panREPET does not extract or extracts only the TE copies in the regions specified in select_region_bed params.
tools:
-
container :
Singularity container containing TEfinder 2.31 path (containers/te_finder_2.31.sif)
Build DAG in png
To visualize your DAG (Directed Acyclic Graph):
snakemake --forceall --dag --configfile config/example.yaml | dot -Tpng > dag.png
How to deal with outputs
.
├── benchmarks
├── log
├── extractCopies
├── extendedGFF
├── extendedFasta
├── minimap2
├── matcher
├── bestHits
├── clique
│ └── filter80
│ ├── cliques_copy_filter80.tsv
│ ├── cliques_stats_filter80.tsv
│ └── filter80.stats
├── uniqueCopies
│ └── filter80
│ └── concat_uniqueCopies_filter80.tsv
cliques_copy_filter80.tsv :
this output gives in each line a TE copy belonging to an accession and which has been cliqued (i.e. a copy shared with at least one other accession). The id column indicates the clique ID, i.e which TE copies are shared by the same accessions.
id accession chromosome start end copy_name consensus clique_size pangenomic_compartment
1 genome1 chr1 1 3000 {copy_id}_{consensus_name} consensus_name 2 core
1 genome2 chr1 100 3300 {copy_id}_{consensus_name} consensus_name 2 core
cliques_stats_filter80.tsv :
this output gives in each line a clique (i.e. a shared TE insertion).
id clique_size core/dispensable pangenomic_compartment accessions clique
1 2 core core ['genome1', 'genome2'] ['genome1/chr1/{copy_id}_{consensus_name}/{consensus_name}/1-3000', 'genome2/chr1/{copy_id}_{consensus_name}/{consensus_name}/100-3300']
filter80.stats :
Statistics about cliques (i.e. shared TE insertions).
concat_uniqueCopies_filter80.tsv :
gff file containing TE copies that could not be clicked on (i.e. unique TE insertions which are not shared). The last column gives the name of the accession to which the TE copy belongs.
Execution
You can execute an example based on data you can find in data folder composed of:
- 2 whole-genome assemblies of Brachypodium distachyon from Phytozome database (Gordon et al. 2017) : data/Bdis_TEdenovoGr.fa (Bd21 v3.2) and data/ABR2_337.fa.formated
- their TE annotation from panTEannot
- a TE library built denovo with TEdenovo pipeline from REPET v3.0 (https://urgi.versailles.inra.fr/Tools/REPET, Flutre et al. 2011) on Bd21 v3.2 genome (data/Bdis_TEannotGr_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthCopy.fa)
Please specify in run_Snakemake.sh file, the argument --singularity-args '--bind /home/myhome'
if necessary.
conda activate snakemake
nohup ./run_Snakemake.sh &> test.log &
Expected results are in results folder (results/clique/filter80/cliques_copy_filter80.tsv and results/clique/filter80/cliques_stats_filter80.tsv). The test execution took 45 minutes on 8 cores 16 RAM (specify --cores 8 --resources mem_gb=16). However, note that minimap2 rule is not parallelized per TE copy for a given accession, but is parallelized for the number of accessions compared in pairs.