Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • asm4pg/GenomAsm4pg
1 result
Show changes
Commits on Source (35)
Showing
with 794 additions and 372 deletions
# absolute/relative path to your desired output path
root: .
# absolute path to your desired output path
root: /output/path
####################### optional prejob - data preparation #######################
# path to tar data
data: test_data
data: /path
# list of tar names
get_all_tar_filename: True
tarIDS: []
get_all_tar_filename: False
tarIDS: "tar_filename"
####################### job - workflow #######################
# number of threads used by pigz
pigz_threads: 4
### CONFIG
get_all_filenames: True
IDS: ["sd_0001.ccs", "sd_0002.ccs", "sd_0003.ccs"]
sd_0001.ccs:
run: run001
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
sd_0002.ccs:
run: run002
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
sd_0003.ccs:
run: run003
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
####################### workflow output directories #######################
# results directory
resdir: workflow_results
### PREJOB
# extracted raw data
rawdir: 00_raw_data
bamdir: 00_raw_data/bam_files
fastxdir: 00_raw_data/fastx_files
# extracted input data
rawdir: 00_input_data
bamdir: 00_input_data/bam_files
fastxdir: 00_input_data/fastx_files
### JOB
# QC
......
......@@ -50,11 +50,13 @@
!*.yaml
!Snakefile
!*.smk
!slurm_logs/
!*.svg
# 3) add a pattern to track the file patterns of section2 even if they are in
# subdirectories
!*/
node_modules
node_modules/*
# 4) specific files or folder to TRACK (the '**' sign means 'any path')
......
# requiring the environment of NodeJS LTS
image: node:lts
# add 'node_modules' to cache for speeding up builds
cache:
paths:
- node_modules/ # Node modules and dependencies
before_script:
- npm init --yes
- npm install honkit --save-dev
test:
stage: test
script:
- npx honkit build . public # build to public path
only:
- branches # this job will affect every branch except 'main'
except:
- main
# the 'pages' job will deploy and build your site to the 'public' path
pages:
stage: deploy
script:
- npx honkit build . public # build to public path
- cp -r workflow/doc/fig public/workflow/doc/ # fix missing images asset not copied to public
artifacts:
paths:
- public
expire_in: 1 week
only:
- main # this job will affect only the 'main' branch
# <A HREF="https://forgemia.inra.fr/asm4pg/GenomAsm4pg"> asm4pg </A>
An automatic and reproducible genome assembly workflow for pangenomic applications using PacBio HiFi data.
This workflow uses [Snakemake](https://snakemake.readthedocs.io/en/stable/) to quickly assemble genomes with a HTML report summarizing obtained assembly stats.
A first script (`prejob.sh`) taking `.tar` file(s) as input aims to convert `.bam` to `.fastq(a).gz` and create `00.raw_data` folder with several subfolders (detailed folder structure is descriped below). This step can be skipped if the user already has fasta(q).gz files that are put in the folders with the same structure. `fastq.gz` is mandatory for raw data QC steps, and (`fasta.gz`) is mandatory if QC is not required. The user must create a single input from multiple hifi runs for a single assembly run using (`job.sh`).
A first script (```prejob.sh```) prepares the data until *fasta.gz* files are obtained. A second script (```job.sh```) runs the genome assembly and stats.
A second script (`job.sh`) runs the genome assembly and stats.
doc: [Gitlab pages](https://asm4pg.pages.mia.inra.fr/genomasm4pg)
![workflow DAG](fig/rule_dag.svg)
![workflow DAG](workflow/doc/fig/rule_dag.svg)
## Table of contents
- [ asm4pg ](#-asm4pg-)
- [Table of contents](#table-of-contents)
- [Repo directory structure](#repo-directory-structure)
- [Requirements](#requirements)
- [Workflow steps, programs \& Docker images pulled by Snakemake](#workflow-steps-programs--docker-images-pulled-by-snakemake)
- [How to run the workflow](#how-to-run-the-workflow)
- [Profile setup](#profile-setup)
- [Workflow execution](#workflow-execution)
- [Running the prejob](#running-the-prejob)
- [Running the main workflow](#running-the-main-workflow)
- [Dry run](#dry-run)
- [Outputs](#outputs)
- [Known problems/errors](#known-problemserrors)
- [HPC](#hpc)
- [BUSCO](#busco)
- [HiFi assembly](#hifi-assembly)
- [Snakemake locked directory](#snakemake-locked-directory)
- [How to cite asm4pg?](#how-to-cite-asm4pg)
- [License](#license)
- [Contacts](#contacts)
[TOC]
## Repo directory structure
```
├── README.md
├── job.sh
├── prejob.sh
├── workflow
│ ├── rules
│ ├── modules
│ ├── scripts
│ ├── pre-job_snakefiles
| └── Snakefile
......@@ -60,266 +39,20 @@ A second script (`job.sh`) runs the genome assembly and stats.
```
## Requirements
- snakemake >= 6.5.1
- slurm
- conda
- singularity
## Workflow steps, programs & Docker images pulled by Snakemake
All images here will be pulled automatically by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
Images are stored on the project's container registry but come from various container libraries:
**Pre-assembly**
- Conversion of PacBio bam to fasta & fastq
- **smrtlink** (https://www.pacb.com/support/software-downloads/)
- image version: 9.0.0.92188 ([link](https://hub.docker.com/r/bryce911/smrtlink/tags))
- Fastq to fasta conversion
- **seqtk** (https://github.com/lh3/seqtk)
- image version: 1.3--dc0d16b ([link](https://hub.docker.com/r/nanozoo/seqtk))
- Raw data quality control
- **fastqc** (https://github.com/s-andrews/FastQC)
- image version: v0.11.5_cv4 ([link](https://hub.docker.com/r/biocontainers/fastqc/tags))
- **lonqQC** (https://github.com/yfukasawa/LongQC)
- image version: latest (April 2022) ([link](https://hub.docker.com/r/grpiccoli/longqc/tags))
- Metrics
- **genometools** (https://github.com/genometools/genometools)
- image version: v1.5.9ds-4-deb_cv1 ([link](https://hub.docker.com/r/biocontainers/genometools/tags))
- K-mer analysis
- **jellyfish** (https://github.com/gmarcais/Jellyfish)
- image version: 2.3.0--h9f5acd7_3 ([link](https://quay.io/repository/biocontainers/kmer-jellyfish?tab=tags))
- **genomescope** (https://github.com/tbenavi1/genomescope2.0)
- image version: 2.0 ([link](https://hub.docker.com/r/abner12/genomescope))
**Assembly**
- Assembly
- **hifiasm** (https://github.com/chhylp123/hifiasm)
- image version: 0.16.1--h5b5514e_1 ([link](https://quay.io/repository/biocontainers/hifiasm?tab=tags))
- Metrics
- **genometools** (same as Pre-assembly)
- Assembly quality control
- **busco** (https://gitlab.com/ezlab/busco)
- image version: v5.3.1_cv1 ([link](https://hub.docker.com/r/ezlabgva/busco/tags))
- **kat** (https://github.com/TGAC/KAT)
- image version: 2.4.1--py35h355e19c_3 ([link](https://quay.io/repository/biocontainers/kat))
- Error rate, QV & phasing
- **meryl** and **merqury** (https://github.com/marbl/meryl, https://github.com/marbl/merqury)
- image version: 1.3--hdfd78af_0 ([link](https://quay.io/repository/biocontainers/merqury?tab=tags))
- Detect assembled telomeres
- **FindTelomeres** (https://github.com/JanaSperschneider/FindTelomeres)
- **Biopython** image version: 1.75 ([link](https://quay.io/repository/biocontainers/biopython?tab=tags))
- Haplotigs and overlaps purging
- **purge_dups** (https://github.com/dfguan/purge_dups)
- image version: 1.2.5--h7132678_2 ([link](https://quay.io/repository/biocontainers/purge_dups?tab=tags))
- **matplotlib** image version: v0.11.5-5-deb-py3_cv1 ([link](https://hub.docker.com/r/biocontainers/matplotlib-venn/tags))
**Report**
- **R markdown**
- image version: 4.0.3 ([link](https://hub.docker.com/r/reslp/rmarkdown/tags))
## How to run the workflow
[wiki](https://forgemia.inra.fr/asm4pg/GenomAsm4pg/-/wikis/home)
### Profile setup
The current profile is made for SLURM. To run this workflow on another HPC, create another profile (https://github.com/Snakemake-Profiles) and add it in the `.config/snakemake_profile` directory. Change the `CLUSTER_CONFIG` and `PROFILE` variables in `job.sh` and `prejob.sh`.
If you are using the current SLURM setup, change line 13 to your email adress in the `cluster_config`.yml file.
## Workflow execution
Navigate into the `GenomAsm4pg` directory to run the bash scripts.
## Running the prejob
Create a test_data folder to hold the test data that will be used to run the pipeline.
```
$ mkdir -p test_data
```
Download the test data from `raw.github...` and place it into the `test_data` folder
Modify the following variables in the following files:
`.config/masterconfig.yaml`:
- `root`
- The path where you want the output to be. This can be relative or absolute
- Set this to be the repository folder, `.`.
- `data`
- The path where you want the input data (`.tar`) to be.
- Set this to `test_data`.
- Alternatively, you have the option of running only on user-specified files:
- Setting `get_all_tar_filename: True`, will uncompress all tar files.
- If you want to choose the files to uncompress, set `get_all_tar_filename: False` and type out the filenames as a list in `tarIDS`
`./prejob.sh`:
- Line 17, `#SBATCH --mail-user=`
- Set this to be your email adress.
- `Module Loading:`
- If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
Once these variables have been set, run the following:
```bash
sbatch prejob.sh
```
This will create multiple directories to prepare the data for the workflow. You will end up with a `bam_files` directory containing all _bam_ files, renamed as the tar filename if your data was named "ccs.bam", and a `fastx_files` directory containing all _fasta_ and _fastq_ files. The `extract` directory contains all other files that were in the tar ball.
```
workflow_results
└── 00_raw_data
├── bam_files
├── extract
└── fastx_files
```
## Running the main workflow
The `fastx_files` directory will be the starting point for the assembly workflow. You can add other datasets but the workflow needs a _fasta.gz_ file. If _bam_ files or _fastq.gz_ files are available, the workflow runs raw data quality control steps.
You will have to modify other variables in `.config/masterconfig.yaml`:
- Setting `get_all_filenames: True` will take all of the `.fasta.gz` files in the `fastx_files` directory and set them as a list in `IDS`.
- Alternatively, give the fasta filenames as a list in `IDS` to specify files you want to run the pipeline on.
Your config should also follow this template:
```yaml
# default assembly mode
sample_1_file_name:
run: name
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
# trio assembly mode
sample_2_file_name:
run: name
ploidy: 2
busco_lineage: eudicots_odb10
mode: trio
p1: path/to/parent/1/reads
p2: path/to/parent/2/reads
# hi-c assembly mode
sample_3_file_name:
run: name
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: path/to/r1/reads
r2: path/to/r2/reads
```
- Make sure to set `Sample_1_file_name` to match the file names in the `fastx_files` directory. An example can be seen in the `masterconfig.yaml` file which is configured to run on the provided test data.
- Choose your run name by setting `run`.
- Specify the organism ploidy with `ploidy`.
- Choose the BUSCO lineage with `lineage`.
- There are 3 modes to run hifiasm. In all cases, the organism has to be sequenced in PacBio HiFi. To choose the mode, modify the variable `mode` to either :
- `default` for a HiFi-only assembly.
- `trio` if you have parental reads (either HiFi or short reads) in addition to the sequencing of the organism.
- Add a key corresponding to your filename and modify the variables `p1` and `p2` to be the parental reads. Supported filetypes are _fasta_, _fasta.gz_, _fastq_ and _fastq.gz_.
- `hi-c` if the organism has been sequenced in paired-end Hi-C as well.
- Add a key corresponding to your filename an modify the variables `r1` and `r2` to be the paired-end Hi-C reads. Supported filetypes are _fasta_, _fasta.gz_, _fastq_ and _fastq.gz_.
Modify the following variables in `./job.sh`:
- Line 17, `#SBATCH --mail-user=`
- Set this to be your email adress.
- `Module Loading`
- If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
Once these variables have been set, run the following:
```bash
sbatch job.sh
```
All the slurm output logs are in the `slurm_logs` directory. There are .out and .err files for the worklow (*snakemake.cortex\*\*) and for each rules (*rulename.cortex\*\*).
### Dry run
To check if the workflow will run fine, you can do a dry run: uncomment line 56 in `job.sh` and comment line 59, then run
```bash
sbatch job.sh
```
Check the snakemake.cortex\*.out file in the `slurm_logs` directory, you should see a summary of the workflow.
### Outputs
These are the directories for the data produced by the workflow:
- An automatic report is generated in each `RUN` directory.
- `01_raw_data_QC` contains all quality control ran on the reads. FastQC and LongQC create html reports on fastq and bam files respectively, reads stats are given by Genometools, and predictions of genome size and heterozygosity are given by Genomescope (in directory `04_kmer`).
- `02_genome_assembly` contains 2 assemblies. The first one is in `01_raw_assembly`, it is the assembly obtained with hifiasm. The second one is in `02_after_purge_dups_assembly`, it is the hifiasm assembly after haplotigs removal by purge_dups. Both assemblies have a `01_assembly_QC` directory containing assembly statistics done by Genometools (in directory `assembly_stats`), BUSCO analyses (`busco`), k-mer profiles with KAT (`katplot`) and completedness and QV stats with Merqury (`merqury`) as well as assembled telomeres with FindTelomeres (`telomeres`).
```
workflow_results
├── 00_raw_data
└── FILENAME
└── RUN
├── 01_raw_data_QC
│ ├── 01_fastQC
│ ├── 02_longQC
│ ├── 03_genometools
| └── 04_kmer
| └── genomescope
└── 02_genome_assembly
├── 01_raw_assembly
│ ├── 00_assembly
| └── 01_assembly_QC
| ├── assembly_stats
| ├── busco
| ├── katplot
| ├── merqury
| └── telomeres
└── 02_after_purge_dups_assembly
├── 00_assembly
| ├── hap1
| └── hap2
└── 01_assembly_QC
├── assembly_stats
├── busco
├── katplot
├── merqury
└── telomeres
```
## Known problems/errors
### HPC
The workflow does not work if the HPC does not allow a job to run other jobs.
### BUSCO
The first time you run the workflow, if there are multiple samples, the BUSCO lineage might be downladed multiple times. This can create a conflict between the jobs using BUSCO and may interrupt some of them. In that case, you only need to rerun the workflow once everything is done.
### HiFi assembly
If your pipeline fails at the hifiasm step, this may be a result of improper input data being provided. Please make sure that there are no 'N' or undefined bases in your raw data.
### Snakemake locked directory
When you try to rerun the workflow after cancelling a job, you may have to unlock the results directory. To do so, go in `.config/snakemake_profile/slurm` and uncomment line 14 of `config.yaml`. Run the workflow once to unlock the directory (it should only take a few seconds). Still in `config.yaml`, comment line 14. The workflow will be able to run and create outputs.
## How to cite asm4pg?
We are currently writing a publication about asm4pg. Meanwhile, if you use the pipeline, please cite it using the address of this repository.
## How to cite asm4pg? ##
## License
We are currently writing a publication about asm4pg. Meanwhile, if you use the pipeline, please cite it using the address of this repository.
The content of this repository is licensed under <A HREF="https://choosealicense.com/licenses/gpl-3.0/">(GNU GPLv3)</A>
## License ##
## Contacts
The content of this repository is licensed under <A HREF="https://choosealicense.com/licenses/gpl-3.0/">(GNU GPLv3)</A>
## Contacts ##
For any troubleshouting, issue or feature suggestion, please use the issue tab of this repository.
For any other question or if you want to help in developing asm4pg, please contact Ludovic Duvaux at ludovic.duvaux@inrae.fr
# Summary
* [Introduction](README.md)
* [Documentation summary](workflow/documentation.md)
* [Requirements](workflow/documentation.md#asm4pg-requirements)
* [Tutorials](workflow/documentation.md#tutorials)
* [Quick start](workflow/doc/Quick-start.md)
* [Hi-C mode](workflow/doc/Assembly-Mode/Hi-C-tutorial.md)
* [Trio mode](workflow/doc/Assembly-Mode/Trio-tutorial.md)
* [Outputs](workflow/documentation.md#outputs)
* [Workflow output](workflow/doc/Outputs.md)
* [Optional data preparation](workflow/documentation.md#optional-data-preparation)
* [if your data is in a tarball archive](workflow/doc/Tar-data-preparation.md)
* [Going further](workflow/doc/Going-further.md)
* [Troubleshooting](workflow/documentation.md#known-errors)
* [known errors](workflow/doc/Known-errors.md)
* [Software Dependencies](workflow/documentation.md#programs)
* [Programs listing](workflow/doc/Programs.md)
* [Gitlab pages using honkit](honkit.md)
# HonKit
HonKit is building beautiful books using GitHub/Git and Markdown.
![HonKit Screenshot](./honkit.png)
## Documentation and Demo
HonKit documentation is built by HonKit!
- <https://honkit.netlify.app/>
## Quick Start
### Installation
- Requirement: [Node.js](https://nodejs.org) [LTS](https://nodejs.org/about/releases/) version
The best way to install HonKit is via **NPM** or **Yarn**.
```
$ npm init --yes
$ npm install honkit --save-dev
```
⚠️ Warning:
- If you have installed `honkit` globally, you must install each plugins globally as well
- If you have installed `honkit` locally, you must install each plugins locally as well
We recommend installing `honkit` locally.
### Create a book
HonKit can set up a boilerplate book:
```
$ npx honkit init
```
If you wish to create the book into a new directory, you can do so by running `honkit init ./directory`
Preview and serve your book using:
```
$ npx honkit serve
```
Or build the static website using:
```
$ npx honkit build
```
You can start to write your book!
For more details, see [HonKit's documentation](https://honkit.netlify.app/).
## Docker support
Honkit provide docker image at [honkit/honkit](https://hub.docker.com/r/honkit/honkit).
This docker image includes built-in dependencies for PDF/epub.
```
docker pull honkit/honkit
docker run -v `pwd`:`pwd` -w `pwd` --rm -it honkit/honkit honkit build
docker run -v `pwd`:`pwd` -w `pwd` --rm -it honkit/honkit honkit pdf
```
For more details, see [docker/](./docker/).
## Usage examples
HonKit can be used to create a book, public documentation, enterprise manual, thesis, research papers, etc.
You can find a list of [real-world examples](https://honkit.netlify.app/examples.html) in the documentation.
## Features
* Write using [Markdown](https://honkit.netlify.app/syntax/markdown.html) or [AsciiDoc](https://honkit.netlify.app/syntax/asciidoc.html)
* Output as a website or [ebook (pdf, epub, mobi)](https://honkit.netlify.app/ebook.html)
* [Multi-Languages](https://honkit.netlify.app/languages.html)
* [Lexicon / Glossary](https://honkit.netlify.app/lexicon.html)
* [Cover](https://honkit.netlify.app/ebook.html)
* [Variables and Templating](https://honkit.netlify.app/templating/)
* [Content References](https://honkit.netlify.app/templating/conrefs.html)
* [Plugins](https://honkit.netlify.app/plugins/)
* [Beautiful default theme](./packages/@honkit/theme-default)
## Fork of GitBook
HonKit is a fork of [GitBook (Legacy)](https://github.com/GitbookIO/gitbook).
[GitBook (Legacy)](https://github.com/GitbookIO/gitbook) is [deprecated](https://github.com/GitbookIO/gitbook/commit/6c6ef7f4af32a2977e44dd23d3feb6ebf28970f4) and an inactive project.
HonKit aims to smooth the migration from GitBook (Legacy) to HonKit.
### Compatibility with GitBook
- Almost all plugins work without changes!
- Support `gitbook-plugin-*` packages
- You should install these plugins via npm or yarn
- `npm install gitbook-plugin-<example> --save-dev`
### Differences with GitBook
- Node.js 14+ supports
- Improve `build`/`serve` performance
- `honkit build`: use file cache by default
- `honkit serve`: 28.2s → 0.9s in [examples/benchmark](examples/benchmark)
- Also, support `--reload` flag for force refresh
- Improve plugin loading logic
- Reduce cost of finding `honkit-plugin-*` and `gitbook-plugin-*`
- Support `honkit-plugin-*` and `@scope/honkit-plugin-*` (GitBook does not support a scoped module)
- Remove `install` command
- Instead of it, just use `npm install` or `yarn install`
- Remove `global-npm` dependency
- You can use HonKit with another npm package manager like `yarn`
- Update dependencies
- Upgrade to nunjucks@2, highlight.js etc...
- It will reduce bugs
- TypeScript
- Rewritten by TypeScript
- Monorepo codebase
- Easy to maintain
- [Docker support](./docker)
### Migration from GitBook
Replace `gitbook-cli` with `honkit`.
```
npm uninstall gitbook-cli
npm install honkit --save-dev
```
Replace `gitbook` command with `honkit` command.
```diff
"scripts": {
- "build": "gitbook build",
+ "build": "honkit build",
- "serve": "gitbook serve"
+ "serve": "honkit serve"
},
```
After that, HonKit just works!
Examples of migration:
- [Add a Github action to deploy · DjangoGirls/tutorial](https://github.com/DjangoGirls/tutorial/pull/1666)
- [Migrate from GitBook to Honkit · swaroopch/byte-of-python](https://github.com/swaroopch/byte-of-python/pull/88)
- [replace Gitbook into Honkit · yamat47/97-things-every-programmer-should-know](https://github.com/yamat47/97-things-every-programmer-should-know/pull/2)
- [Migrate misp-book from GitBook to honkit](https://github.com/MISP/misp-book/pull/227)
## Benchmarks
`honkit build` benchmark:
- <https://honkit.github.io/honkit/dev/bench/>
## Licensing
HonKit is licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for the full license text.
HonKit is a fork of [GitBook (Legacy)](https://github.com/GitbookIO/gitbook).
GitBook is licensed under the Apache License, Version 2.0.
Also, HonKit includes [bignerdranch/gitbook](https://github.com/bignerdranch/gitbook) works.
## Sponsors
<a href="https://www.netlify.com">
<img src="https://www.netlify.com/img/global/badges/netlify-color-bg.svg" alt="Deploys by Netlify" />
</a>
......@@ -14,7 +14,7 @@
#SBATCH -o slurm_logs/snakemake.%N.%j.out
#SBATCH -e slurm_logs/snakemake.%N.%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=ken.smith@plantandfood.co.nz
#SBATCH --mail-user=sukanya.denni@univ-rouen.fr
################################################################################
# Useful information to print
......@@ -35,29 +35,16 @@ echo 'scontrol show job:'
scontrol show job $SLURM_JOB_ID
echo '########################################'
## get SNG_BIND abs path using python
function SNG_BIND_ABS_PATH {
SNG_BIND="$(python3 - <<END
import os
abs_path = os.getcwd()
print(abs_path)
END
)"
}
SNG_BIND_ABS_PATH
### variables
CLUSTER_CONFIG=".config/snakemake_profile/slurm/cluster_config.yml"
MAX_CORES=4
MAX_CORES=10
PROFILE=".config/snakemake_profile/slurm"
SMK_PATH="workflow/pre-job_snakefiles"
SNG_BIND="/gpfs/scratch/sdenni/wf/GenomAsm4pg"
### Module Loading:
module purge
module load snakemake
module load singularity
module load snakemake/6.5.1
echo 'Starting Snakemake workflow'
......@@ -65,10 +52,12 @@ echo 'Starting Snakemake workflow'
mkdir -p slurm_logs
### Snakemake commands
## Dry run
# snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -n -r
# snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -f print
## Run
snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG
\ No newline at end of file
if [ "$1" = "dry" ]
then
# dry run
snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -n -r
else
# run
snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG
fi
\ No newline at end of file
......@@ -14,7 +14,7 @@
#SBATCH -o slurm_logs/snakemake_prejob.%N.%j.out
#SBATCH -e slurm_logs/snakemake_prejob.%N.%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=ken.smith@plantandfood.co.nz
#SBATCH --mail-user=sukanya.denni@univ-rouen.fr
################################################################################
# Useful information to print
......@@ -35,30 +35,16 @@ echo 'scontrol show job:'
scontrol show job $SLURM_JOB_ID
echo '########################################'
## get SNG_BIND abs path using python
function SNG_BIND_ABS_PATH {
SNG_BIND="$(python3 - <<END
import os
abs_path = os.getcwd()
print(abs_path)
END
)"
}
SNG_BIND_ABS_PATH
### variables
CLUSTER_CONFIG=".config/snakemake_profile/slurm/cluster_config.yml"
MAX_CORES=4
PROFILE=".config/snakemake_profile/slurm"
SMK_PATH="workflow/pre-job_snakefiles"
SNG_BIND="/gpfs/scratch/sdenni/wf/GenomAsm4pg"
### Module Loading:
module purge
module load snakemake
module load singularity
module load snakemake/6.5.1
echo 'Starting Snakemake - data preparation'
......
configfile: ".config/masterconfig.yaml"
res_path=config["root"] + "/" + config["resdir"]
###### Include all scripts & rules necessary to run the workflow ######
### Scripts
# get parameters from masterconfig
include: "scripts/from_config/hifiasm_mode.py"
include: "scripts/from_config/parameters.py"
include: "scripts/from_config/target_list.py"
include: "scripts/path_helper.py"
### paths
if config["root"].startswith("."):
abs_root_path = get_abs_root_path()
res_path = get_res_path()
else:
abs_root_path = config["root"]
res_path = abs_root_path + "/" + config["resdir"]
### Rules
include: "rules/01_pre_asm_qc.smk"
## PRE ASSEMBLY QC
include: "rules/01_qc.smk"
## ASSEMBLY
include: "rules/02_asm.smk"
# Statistics
include: "rules/03_asm_qc.smk"
......@@ -24,25 +20,29 @@ include: "rules/03.5_asm_qc_merqury.smk"
# Purging
include: "rules/04_purge_dups.smk"
include: "rules/05_purged_asm_qc.smk"
include: "rules/05.5_pa_qc_merqury.smk"
include: "rules/05.5_purged_asm_qc_merqury.smk"
# Link final assembly
include: "rules/06_sym_link_hap.smk"
# Automatic report
## AUTOMATIC REPORT
include: "rules/07_report.smk"
## runtime
include: "rules/00_runtime.smk"
###### get filenames for workflow ######
if config["get_all_filenames"]:
IDS=get_files_id(abs_root_path + "/" + config["resdir"] + "/" + config["fastxdir"])
else:
IDS=config["IDS"]
bamIDS=check_bam(abs_root_path + "/" + config["resdir"] + "/" + config["bamdir"], IDS)
fastqIDS=check_fastq(abs_root_path + "/" + config["resdir"] + "/" + config["fastxdir"], IDS)
IDS=config["IDS"]
bamIDS=check_bam(IDS)
fastqIDS=check_fastq(IDS)
####
RUNID = run_id(config["IDS"])
BID_RUN = run_BFid(bamIDS)
FID_RUN = run_BFid(fastqIDS)
###### results path ######
res_path=config["root"] + "/" + config["resdir"]
###### Target files ######
### raw data stats
## raw data stats
longqc_output = expand(res_path + "/{Bid}/{run}/01_raw_data_QC/02_longQC", zip,
run=BID_RUN, Bid=bamIDS),
fastqc_output = expand(res_path + "/{Fid}/{run}/01_raw_data_QC/01_fastQC/{Fid}_fastqc.{ext}", zip,
......@@ -61,23 +61,24 @@ REP_TRIO_ID = for_report_trio(IDS)
RUNID_TRIO = run_id(REP_TRIO_ID)
BUSCO_LIN_TRIO = busco_lin(REP_TRIO_ID)
report_trio_output = expand(res_path + "/{runid}/report_trio_{id}.{lin}.html", zip,
runid=RUNID_TRIO, id=REP_TRIO_ID, lin = BUSCO_LIN_TRIO)
### SYM LINK
# symbolic link to final assembly
## symbolic link to final assembly
symb_link1 = expand(res_path + "/{runid}/{id}_hap{n}.fa", zip,
runid=RUNID_REG, id=REP_ID, n=["1", "2"])
symb_link2 = expand(res_path + "/{runid}/{id}_hap{n}.fa", zip,
runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"])
# PURGE_DUPS CUTOFFS GRAPH
## PURGE_DUPS CUTOFFS GRAPH
cut_eval1 = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/00_assembly/{id}_hap{n}/cutoffs_graph_hap{n}.png", zip,
runid=RUNID_REG, id=REP_ID, n=["1", "2"])
cut_eval2 = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/00_assembly/{id}_hap{n}/cutoffs_graph_hap{n}.png", zip,
runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"])
# BUSCO
## BUSCO
busco_reg = expand(res_path + "/{runid}/02_genome_assembly/01_raw_assembly/01_assembly_QC/busco/{id}_hap{n}/short_summary.specific.{lin}.{id}_hap{n}.txt", zip,
runid=RUNID_REG, id=REP_ID, n=["1", "2"], lin = BUSCO_LIN)
busco_purged_reg = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/01_assembly_QC/busco/{id}_purged_hap{n}/short_summary.specific.{lin}.{id}_purged_hap{n}.txt", zip,
......@@ -88,6 +89,12 @@ busco_trio = expand(res_path + "/{runid}/02_genome_assembly/01_raw_assembly/01_a
busco_purged_trio = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/01_assembly_QC/busco/{id}_purged_hap{n}/short_summary.specific.{lin}.{id}_purged_hap{n}.txt", zip,
runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"], lin = BUSCO_LIN_TRIO)
## RUNTIME
time = expand(res_path + "/{runid}/runtime.{id}.{lin}.txt", zip,
runid = RUNID_REG, id=REP_ID, lin=BUSCO_LIN)
time_trio = expand(res_path + "/{runid}/runtime_trio.{id}.{lin}.txt", zip,
runid = RUNID_TRIO, id=REP_TRIO_ID, lin=BUSCO_LIN_TRIO)
rule_all_input_list = [
longqc_output,
fastqc_output,
......@@ -100,10 +107,12 @@ rule_all_input_list = [
busco_reg,
busco_purged_reg,
busco_trio,
busco_purged_trio
busco_purged_trio,
time,
time_trio
]
##### target files #####
#### target files
rule all:
input:
all_input = rule_all_input_list
\ No newline at end of file
# Hi-C mode tutorial
Please look at [quick start](../Quick-start.md) first, some of the steps are omitted here.
This tutorial shows how to use the workflow with hi-c assembly mode which takes PacBio Hifi data and Hi-C data as input.
## 1. Config file
**TO-DO : add a toy dataset fasta and hi-c.**
```bash
cd GenomAsm4pg/.config
```
Modify `masterconfig.yaml`. The PacBio HiFi file is `toy_dataset_hi-c.fasta`, its name is used as key in config. The Hi-C files are `data_r1.fasta` and `data_r2.fasta`
```yaml
####################### job - workflow #######################
### CONFIG
IDS: ["toy_dataset_hi-c"]
toy_dataset_hi-c:
fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
run: hi-c_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
```
## 2. Dry run
To check the config, first do a dry run of the workflow.
```bash
sbatch job.sh dry
```
## 3. Run
If the dry run is successful, you can run the workflow.
```bash
sbatch job.sh
```
## Other assembly modes
If you want to use parental data, follow the [Trio assembly mode tutorial](Trio-tutorial.md).
To go further with the workflow use go [here](../Going-further.md).
# Trio mode tutorial
Please look at [quick start](../Quick-start.md) first, some of the steps are omitted here.
This tutorial shows how to use the workflow with hi-c assembly mode which takes PacBio Hifi data and Hi-C data as input.
## 1. Config file
**TO-DO : add a toy dataset fasta and parental fasta.**
```bash
cd GenomAsm4pg/.config
```
Modify `masterconfig.yaml`. The PacBio HiFi file is `toy_dataset_trio.fasta`, its name is used as key in config. The parental reads files are `data_p1.fasta` and `data_p2.fasta`.
Parental data is used as k-mers, you use Illumina or PacBio Hifi reads.
```yaml
####################### job - workflow #######################
### CONFIG
IDS: ["toy_dataset_trio"]
toy_dataset_trio:
fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
run: trio_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: trio
p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
```
## 2. Dry run
To check the config, first do a dry run of the workflow.
```bash
sbatch job.sh dry
```
## 3. Run
If the dry run is successful, you can run the workflow.
```bash
sbatch job.sh
```
## Other assembly modes
If you want to use Hi-C data, follow the [Hi-C assembly mode tutorial](Hi-C-tutorial.md).
To go further with the workflow use go [here](../Going-further.md).
# Going further
[TOC]
## 1. Multiple datasets
You can run the workflow on multiple datasets at the same time.
### 1.1. All datasets
With `masterconfig.yaml` as follow, running the workflow will assemble each dataset in its specific assembly mode.
You can add as many datasets as you want, each with different parameters.
```yaml
IDS: ["toy_dataset", "toy_dataset_hi-c", "toy_dataset_trio"]
toy_dataset:
fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
run: tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
toy_dataset_hi-c:
fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
run: hi-c_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
toy_dataset_trio:
fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
run: trio_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: trio
p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
```
### 1.2. On chosen datasets
You can remove dataset from IDS to assemble only chosen genomes:
```yaml
IDS: ["toy_dataset", "toy_dataset_trio"]
toy_dataset:
fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
run: tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
toy_dataset_hi-c:
fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
run: hi-c_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
toy_dataset_trio:
fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
run: trio_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: trio
p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
```
Running the workflow with this config will assemble only `toy_dataset` and `toy_dataset_trio`.
## 2. Different run names
If you want to try different parameters on the same dataset, changing the run name will create a new directory and keep the previous data.
In the [Hi-C tutorial](), we used the following config.
```yaml
IDS: ["toy_dataset_hi-c"]
toy_dataset_hi-c:
fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
run: hi-c_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
```
If you want to compare the Hi-C and default assembly modes, you can run the workflow with a different run name and the default mode.
```yaml
IDS: ["toy_dataset_hi-c"]
toy_dataset_hi-c:
fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
run: default_comparaison
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
```
You will end up with 2 sub-directories for toy_dataset_hi-c (`hi-c_tutorial` and `default_comparaison`) and keep the data from the previous run in Hi-C mode.
## 3. The same dataset with different parameters at once
If you want to do the previous example in one run, you will have to create a symbolic link to the fasta with a different filename.
YAML files do not allow multiple uses of the same key. The following config does not work.
```yaml
## DOES NOT WORK
IDS: ["toy_dataset_hi-c"]
toy_dataset_hi-c:
run: hi-c_tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: hi-c
r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
toy_dataset_hi-c:
run: default_comparaison
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
```
**TO COMPLETE**
## 4. Optional fastq and bam files
If fastq and bam are available and you want to do raw QC with fastQC and longQC, add the `fastq` and/or `bam` key in your config. The fasta, fastq and bam filenames have to be the same. For example:
```yaml
IDS: ["toy_dataset"]
toy_dataset:
fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
fastq: "./GenomAsm4pg/tutorial_data/toy_dataset.fastq"
bam: "./GenomAsm4pg/tutorial_data/toy_dataset.bam"
run: tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
```
# Troubleshooting
[TOC]
## One of the BUSCO rules failed
The first time you run the workflow, the BUSCO lineage might be downloaded multiple times. This can create a conflict between the jobs using BUSCO and may interrupt some of them. In that case, you only need to rerun the workflow once everything is done.
## Snakemake locked directory
When you try to rerun the workflow after cancelling a job, you may have to unlock the results directory. To do so, go in `.config/snakemake_profile/slurm` and uncomment line 14 of `config.yaml`. Run the workflow once to unlock the directory (it should only take a few seconds). Still in `config.yaml`, comment line 14. The workflow will be able to run and create outputs.
# Workflow output
[TOC]
## Directories
There are three directories for the data produced by the workflow:
- An automatic report is generated in the `RUN` directory.
- `01_raw_data_QC` contains all quality control ran on the reads. FastQC and LongQC create HTML reports on fastq and bam files respectively, reads stats are given by Genometools, and predictions of genome size and heterozygosity are given by Genomescope (in directory `04_kmer`).
- `02_genome_assembly` contains 2 assemblies. The first one is in `01_raw_assembly`, it is the assembly obtained with hifiasm. The second one is in `02_after_purge_dups_assembly`, it is the hifiasm assembly after haplotigs removal by purge_dups. Both assemblies have a `01_assembly_QC` directory containing assembly statistics done by Genometools (in directory `assembly_stats`), BUSCO analyses (`busco`), k-mer profiles with KAT (`katplot`) and completedness and QV stats with Merqury (`merqury`) as well as assembled telomeres with FindTelomeres (`telomeres`).
- `benchmark` contains main programs runtime
```
workflow_results
├── 00_input_data
└── FILENAME
└── RUN
├── 01_raw_data_QC
│ ├── 01_fastQC
│ ├── 02_longQC
│ ├── 03_genometools
| └── 04_kmer
| └── genomescope
└── 02_genome_assembly
├── 01_raw_assembly
│ ├── 00_assembly
| └── 01_assembly_QC
| ├── assembly_stats
| ├── busco
| ├── katplot
| ├── merqury
| └── telomeres
└── 02_after_purge_dups_assembly
├── 00_assembly
| ├── hap1
| └── hap2
└── 01_assembly_QC
├── assembly_stats
├── busco
├── katplot
├── merqury
└── telomeres
```
## Additional files
- Symbolic links to haplotype 1 and haplotype 2 assemblies after purge_dups
- HTML report with the main results from each program
- Runtime file with the total workflow runtime for the dataset
# Workflow steps and program versions
All images here will be pulled automatically by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
Images are stored on the project's container registry but come from various container libraries:
## 1. Pre-assembly
- Conversion of PacBio bam to fasta & fastq
- [smrtlink](https://www.pacb.com/support/software-downloads/) 9.0.0
- Fastq to fasta conversion
- [seqtk](https://github.com/lh3/seqtk) 1.3
- Raw data quality control
- [fastqc](https://github.com/s-andrews/FastQC) 0.11.5
- [lonqQC](https://github.com/yfukasawa/LongQC) 1.2.0c
- Metrics
- [genometools](https://github.com/genometools/genometools) 1.5.9
- K-mer analysis
- [jellyfish](https://github.com/gmarcais/Jellyfish) 2.3.0
- [genomescope](https://github.com/tbenavi1/genomescope2.0) 2.0
## 2. Assembly
- Assembly
- [hifiasm](https://github.com/chhylp123/hifiasm) 0.16.1
- Metrics
- [genometools](https://github.com/genometools/genometools) 1.5.9
- Assembly quality control
- [BUSCO](https://gitlab.com/ezlab/busco) 5.3.1
- [KAT](https://github.com/TGAC/KAT) 2.4.1
- Error rate, QV & phasing
- [meryl](https://github.com/marbl/meryl) and [merqury](https://github.com/marbl/merqury) 1.3
- Detect assembled telomeres
- [FindTelomeres](https://github.com/JanaSperschneider/FindTelomeres)
- **Biopython** 1.75
- Haplotigs and overlaps purging
- [purge_dups](https://github.com/dfguan/purge_dups) 1.2.5
- **matplotlib** 0.11.5
## 3. Report
- **R markdown** 4.0.3
# Docker images
The programs are pulled automatically as images by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
Images are stored on the project's container registry but come from various container libraries:
- [smrtlink](https://hub.docker.com/r/bryce911/smrtlink/tags)
- [seqtk](https://hub.docker.com/r/nanozoo/seqtk)
- [fastqc](https://hub.docker.com/r/biocontainers/fastqc/tags)
- [lonqQC](https://hub.docker.com/r/grpiccoli/longqc/tags)
- [genometools](https://hub.docker.com/r/biocontainers/genometools/tags)
- [jellyfish](https://quay.io/repository/biocontainers/kmer-jellyfish?tab=tags)
- [genomescope](https://hub.docker.com/r/abner12/genomescope)
- [hifiasm](https://quay.io/repository/biocontainers/hifiasm?tab=tags)
- [BUSCO](https://hub.docker.com/r/ezlabgva/busco/tags)
- [KAT](https://quay.io/repository/biocontainers/kat)
- [meryl and merqury](https://quay.io/repository/biocontainers/merqury?tab=tags)
- [Biopython for FindTelomeres](https://quay.io/repository/biocontainers/biopython?tab=tags)
- [purge_dups](https://quay.io/repository/biocontainers/purge_dups?tab=tags)
- [matplotlib as companion to purge_dups](https://hub.docker.com/r/biocontainers/matplotlib-venn/tags)
- [R markdown](https://hub.docker.com/r/reslp/rmarkdown/tags)
# Quick start
This tutorial shows how to use the workflow with default assembly mode which takes PacBio Hifi data as input.
[TOC]
## Clone repository
```bash
cd .
git clone https://forgemia.inra.fr/asm4pg/GenomAsm4pg.git
```
## 1. Cluster profile setup
```bash
cd GenomAsm4pg/.config/snakemake_profile
```
The current profile is made for SLURM. If you use it, change line 13 to your email address in the `cluster_config`.yml file.
To run this workflow on another HPC, create another profile (https://github.com/Snakemake-Profiles) and add it in the `.config/snakemake_profile` directory. Change the `CLUSTER_CONFIG` and `PROFILE` variables in `job.sh` and `prejob.sh` scripts.
## 2. Config file
**TO-DO : add a toy fasta.**
```bash
cd ..
```
Modify `masterconfig.yaml`. Root refers to the path for the output data.
```yaml
# absolute path to your desired output path
root: ./GenomAsm4pg/tutorial_output
```
The reads file is `toy_dataset.fasta`, its name is used as key in config.
```yaml
####################### job - workflow #######################
### CONFIG
IDS: ["toy_dataset"]
toy_dataset:
fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
run: tutorial
ploidy: 2
busco_lineage: eudicots_odb10
mode: default
```
## 3. Create slurm_logs directory
```bash
cd ..
mkdir slurm_logs
```
SLURM logs for each rule will be in this directory, there are .out and .err files for the worklow (*snakemake.cortex**) and for each rules (*rulename.cortex**).
## 4. Mail setup
Modify line 17 to your email address in `job.sh`.
## 5. Dry run
To check the config, first do a dry run of the workflow.
```bash
sbatch job.sh dry
```
## 6. Run
If the dry run is successful, check that the `SNG_BIND` variable in `job.sh` is the same as `root` variable in `masterconfig.yaml`.
If Singularity is not in the HPC environment, add `module load singularity` under `module load snakemake/6.5.1`.
You can run the workflow.
```bash
sbatch job.sh
```
## Other assembly modes
If you want to use additional Hi-C data or parental data, follow the [Hi-C assembly mode tutorial](Assembly-Mode/Hi-C-tutorial.md) or the [Trio assembly mode tutorial](Assembly-Mode/Trio-tutorial.md). To go further with the workflow use go [here](Going-further.md).
# Optional: data preparation
If your data is in a tarball, this companion workflow will extract the data and convert bam files to fastq and fasta if necessary.
[TOC]
## 1. Config file
```bash
cd GenomAsm4pg/.config
```
Modify the the `data` variable in file `.config/masterconfig.yaml` to be the path to the directory containing all input tar files.
This workflow can automatically determine the name of files in the specified `data` directory, or run only on given files :
- `get_all_tar_filename: True` will uncompress all tar files. If you want to choose the the files to uncompress, use `get_all_tar_filename: False` and give the filenames as a list in `tarIDS`
## 2. Run
Modify the `SNG_BIND` variable in `prejob.sh`, it has to be the same as the variable `root` in `.config/masterconfig.yaml`. Change line 17 to your email adress.
If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
Then run
```bash
sbatch prejob.sh
```
## 3. Outputs
This will create multiple directories to prepare the data for the workflow. You will end up with a `bam_files` directory containing all *bam* files, renamed as the tar filename if your data was named "ccs.bam", and a `fastx_files` directory containing all *fasta* and *fastq* files. The `extract` directory contains all other files that were in the tar ball.
```
workflow_results
└── 00_raw_data
├── bam_files
├── extract
└── fastx_files
```
File moved
# <A HREF="https://forgemia.inra.fr/asm4pg/GenomAsm4pg"> asm4pg </A>
Asm4pg is an automatic and reproducible genome assembly workflow for pangenomic applications using PacBio HiFi data.
doc: [Gitlab pages](https://asm4pg.pages.mia.inra.fr/genomasm4pg)
![workflow DAG](doc/fig/rule_dag.svg)
## Asm4pg Requirements
- snakemake >= 6.5.1
- singularity
The workflow does not work with HPC that does not allow a job to run other jobs.
## Tutorials
The three assembly modes from hifiasm are available.
- [Quick start (default mode)](doc/Quick-start.md)
- [Hi-C mode](doc/Assembly-Mode/Hi-C-tutorial.md)
- [Trio mode](doc/Assembly-Mode/Trio-tutorial.md)
## Outputs
[Workflow outputs](doc/Outputs.md)
## Optional Data Preparation
If your [data is in a tarball](doc/Tar-data-preparation.md)
## Known errors
You may run into [these errors](doc/Known-errors.md)
## Softwares
[Softwares used in the workflow](doc/Programs.md)