Compare revisions

Sukanya Denni · Sukanya Denni · Sukanya Denni · Sukanya Denni · Sukanya Denni · Sukanya Denni
--- a/.config/masterconfig.yaml
+++ b/.config/masterconfig.yaml
-# absolute/relative path to your desired output path
-root: .
+# absolute path to your desired output path
+root: /output/path

 ####################### optional prejob - data preparation #######################
 # path to tar data
-data: test_data
+data: /path
 # list of tar names
-get_all_tar_filename: True
-tarIDS: []
+get_all_tar_filename: False
+tarIDS: "tar_filename"

 ####################### job - workflow #######################
+# number of threads used by pigz
+pigz_threads: 4
+
 ### CONFIG
-get_all_filenames: True
-IDS: ["sd_0001.ccs", "sd_0002.ccs", "sd_0003.ccs"]
-
-sd_0001.ccs:
-  run: run001
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: default
-
-sd_0002.ccs:
-  run: run002
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: default
-
-sd_0003.ccs:
-  run: run003
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: default
+
+

 ####################### workflow output directories #######################
 # results directory
 resdir: workflow_results

 ### PREJOB
-# extracted raw data
-rawdir: 00_raw_data
-bamdir: 00_raw_data/bam_files
-fastxdir: 00_raw_data/fastx_files
+# extracted input data
+rawdir: 00_input_data
+bamdir: 00_input_data/bam_files
+fastxdir: 00_input_data/fastx_files

 ### JOB
 # QC

--- a/.gitignore
+++ b/.gitignore
@@ -50,11 +50,13 @@
 !*.yaml
 !Snakefile
 !*.smk
-!slurm_logs/
+!*.svg

 # 3) add a pattern to track the file patterns of section2 even if they are in
    # subdirectories
 !*/
+node_modules
+node_modules/*

 # 4) specific files or folder to TRACK (the '**' sign means 'any path')


--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
+# requiring the environment of NodeJS LTS
+image: node:lts
+
+# add 'node_modules' to cache for speeding up builds
+cache:
+  paths:
+    - node_modules/ # Node modules and dependencies
+
+before_script:
+  - npm init --yes
+  - npm install honkit --save-dev
+
+test:
+  stage: test
+  script:
+    - npx honkit build . public # build to public path
+  only:
+    - branches # this job will affect every branch except 'main'
+  except:
+    - main
+    
+# the 'pages' job will deploy and build your site to the 'public' path
+pages:
+  stage: deploy
+  script:
+    - npx honkit build . public # build to public path
+    - cp -r workflow/doc/fig public/workflow/doc/ # fix missing images asset not copied to public
+  artifacts:
+    paths:
+      - public
+    expire_in: 1 week
+  only:
+    - main # this job will affect only the 'main' branch
--- a/README.md
+++ b/README.md
 # <A HREF="https://forgemia.inra.fr/asm4pg/GenomAsm4pg"> asm4pg </A>
-
 An automatic and reproducible genome assembly workflow for pangenomic applications using PacBio HiFi data.

 This workflow uses [Snakemake](https://snakemake.readthedocs.io/en/stable/) to quickly assemble genomes with a HTML report summarizing obtained assembly stats.

-A first script (`prejob.sh`) taking `.tar` file(s) as input aims to convert `.bam` to `.fastq(a).gz` and create `00.raw_data` folder with several subfolders (detailed folder structure is descriped below). This step can be skipped if the user already has fasta(q).gz files that are put in the folders with the same structure. `fastq.gz` is mandatory for raw data QC steps, and (`fasta.gz`) is mandatory if QC is not required. The user must create a single input from multiple hifi runs for a single assembly run using (`job.sh`).
+A first script (```prejob.sh```) prepares the data until *fasta.gz* files are obtained. A second script (```job.sh```) runs the genome assembly and stats.

-A second script (`job.sh`) runs the genome assembly and stats.
+doc: [Gitlab pages](https://asm4pg.pages.mia.inra.fr/genomasm4pg)

-![workflow DAG](fig/rule_dag.svg)
+![workflow DAG](workflow/doc/fig/rule_dag.svg)

 ## Table of contents
-
- [ asm4pg ](#-asm4pg-)
-  - [Table of contents](#table-of-contents)
-  - [Repo directory structure](#repo-directory-structure)
-  - [Requirements](#requirements)
-  - [Workflow steps, programs \& Docker images pulled by Snakemake](#workflow-steps-programs--docker-images-pulled-by-snakemake)
-  - [How to run the workflow](#how-to-run-the-workflow)
-    - [Profile setup](#profile-setup)
-  - [Workflow execution](#workflow-execution)
-  - [Running the prejob](#running-the-prejob)
-  - [Running the main workflow](#running-the-main-workflow)
-    - [Dry run](#dry-run)
-    - [Outputs](#outputs)
-  - [Known problems/errors](#known-problemserrors)
-    - [HPC](#hpc)
-    - [BUSCO](#busco)
-    - [HiFi assembly](#hifi-assembly)
-    - [Snakemake locked directory](#snakemake-locked-directory)
-  - [How to cite asm4pg?](#how-to-cite-asm4pg)
-  - [License](#license)
-  - [Contacts](#contacts)
+[TOC]

 ## Repo directory structure

+
 ```
 ├── README.md
 ├── job.sh
 ├── prejob.sh
 ├── workflow
 │   ├── rules
-│   ├── modules
 │   ├── scripts
 │   ├── pre-job_snakefiles
 |   └── Snakefile
@@ -60,266 +39,20 @@ A second script (`job.sh`) runs the genome assembly and stats.
 ```

 ## Requirements
-
 - snakemake >= 6.5.1
- slurm
- conda
 - singularity

-## Workflow steps, programs & Docker images pulled by Snakemake
-
-All images here will be pulled automatically by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
-Images are stored on the project's container registry but come from various container libraries:
-
-**Pre-assembly**
-
- Conversion of PacBio bam to fasta & fastq
-  - **smrtlink** (https://www.pacb.com/support/software-downloads/)
-    - image version: 9.0.0.92188 ([link](https://hub.docker.com/r/bryce911/smrtlink/tags))
- Fastq to fasta conversion
-  - **seqtk** (https://github.com/lh3/seqtk)
-    - image version: 1.3--dc0d16b ([link](https://hub.docker.com/r/nanozoo/seqtk))
- Raw data quality control
-  - **fastqc** (https://github.com/s-andrews/FastQC)
-    - image version: v0.11.5_cv4 ([link](https://hub.docker.com/r/biocontainers/fastqc/tags))
-  - **lonqQC** (https://github.com/yfukasawa/LongQC)
-    - image version: latest (April 2022) ([link](https://hub.docker.com/r/grpiccoli/longqc/tags))
- Metrics
-  - **genometools** (https://github.com/genometools/genometools)
-    - image version: v1.5.9ds-4-deb_cv1 ([link](https://hub.docker.com/r/biocontainers/genometools/tags))
- K-mer analysis
-  - **jellyfish** (https://github.com/gmarcais/Jellyfish)
-    - image version: 2.3.0--h9f5acd7_3 ([link](https://quay.io/repository/biocontainers/kmer-jellyfish?tab=tags))
-  - **genomescope** (https://github.com/tbenavi1/genomescope2.0)
-    - image version: 2.0 ([link](https://hub.docker.com/r/abner12/genomescope))
-
-**Assembly**
-
- Assembly
-  - **hifiasm** (https://github.com/chhylp123/hifiasm)
-    - image version: 0.16.1--h5b5514e_1 ([link](https://quay.io/repository/biocontainers/hifiasm?tab=tags))
- Metrics
-  - **genometools** (same as Pre-assembly)
- Assembly quality control
-  - **busco** (https://gitlab.com/ezlab/busco)
-    - image version: v5.3.1_cv1 ([link](https://hub.docker.com/r/ezlabgva/busco/tags))
-  - **kat** (https://github.com/TGAC/KAT)
-    - image version: 2.4.1--py35h355e19c_3 ([link](https://quay.io/repository/biocontainers/kat))
- Error rate, QV & phasing
-  - **meryl** and **merqury** (https://github.com/marbl/meryl, https://github.com/marbl/merqury)
-    - image version: 1.3--hdfd78af_0 ([link](https://quay.io/repository/biocontainers/merqury?tab=tags))
- Detect assembled telomeres
-  - **FindTelomeres** (https://github.com/JanaSperschneider/FindTelomeres)
-    - **Biopython** image version: 1.75 ([link](https://quay.io/repository/biocontainers/biopython?tab=tags))
- Haplotigs and overlaps purging
-  - **purge_dups** (https://github.com/dfguan/purge_dups)
-    - image version: 1.2.5--h7132678_2 ([link](https://quay.io/repository/biocontainers/purge_dups?tab=tags))
-    - **matplotlib** image version: v0.11.5-5-deb-py3_cv1 ([link](https://hub.docker.com/r/biocontainers/matplotlib-venn/tags))
-
-**Report**
-
- **R markdown**
-  - image version: 4.0.3 ([link](https://hub.docker.com/r/reslp/rmarkdown/tags))
-
 ## How to run the workflow
+[wiki](https://forgemia.inra.fr/asm4pg/GenomAsm4pg/-/wikis/home)

-### Profile setup
-
-The current profile is made for SLURM. To run this workflow on another HPC, create another profile (https://github.com/Snakemake-Profiles) and add it in the `.config/snakemake_profile` directory. Change the `CLUSTER_CONFIG` and `PROFILE` variables in `job.sh` and `prejob.sh`.
-If you are using the current SLURM setup, change line 13 to your email adress in the `cluster_config`.yml file.
-
-## Workflow execution
-
-Navigate into the `GenomAsm4pg` directory to run the bash scripts.
-
-## Running the prejob
-
-Create a test_data folder to hold the test data that will be used to run the pipeline.
-
-```
-$ mkdir -p test_data
-```
-
-Download the test data from `raw.github...` and place it into the `test_data` folder
-
-Modify the following variables in the following files:
-
-`.config/masterconfig.yaml`:
-
- `root`
-  - The path where you want the output to be. This can be relative or absolute
-  - Set this to be the repository folder, `.`.
- `data`
-  - The path where you want the input data (`.tar`) to be.
-  - Set this to `test_data`.
-  - Alternatively, you have the option of running only on user-specified files:
-    - Setting `get_all_tar_filename: True`, will uncompress all tar files.
-    - If you want to choose the files to uncompress, set `get_all_tar_filename: False` and type out the filenames as a list in `tarIDS`
-
-`./prejob.sh`:
-
- Line 17, `#SBATCH --mail-user=`
-  - Set this to be your email adress.
- `Module Loading:`
-  - If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
-
-Once these variables have been set, run the following:
-
-```bash
-sbatch prejob.sh
-```
-
-This will create multiple directories to prepare the data for the workflow. You will end up with a `bam_files` directory containing all _bam_ files, renamed as the tar filename if your data was named "ccs.bam", and a `fastx_files` directory containing all _fasta_ and _fastq_ files. The `extract` directory contains all other files that were in the tar ball.
-
-```
-workflow_results
-└── 00_raw_data
-    ├── bam_files
-    ├── extract
-    └── fastx_files
-```
-
-## Running the main workflow
-
-The `fastx_files` directory will be the starting point for the assembly workflow. You can add other datasets but the workflow needs a _fasta.gz_ file. If _bam_ files or _fastq.gz_ files are available, the workflow runs raw data quality control steps.
-
-You will have to modify other variables in `.config/masterconfig.yaml`:
-
- Setting `get_all_filenames: True` will take all of the `.fasta.gz` files in the `fastx_files` directory and set them as a list in `IDS`.
- Alternatively, give the fasta filenames as a list in `IDS` to specify files you want to run the pipeline on.
-
-Your config should also follow this template:
-
-```yaml
-# default assembly mode
-sample_1_file_name:
-  run: name
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: default
-
-# trio assembly mode
-sample_2_file_name:
-  run: name
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: trio
-  p1: path/to/parent/1/reads
-  p2: path/to/parent/2/reads
-
-  # hi-c assembly mode
-sample_3_file_name:
-  run: name
-  ploidy: 2
-  busco_lineage: eudicots_odb10
-  mode: hi-c
-  r1: path/to/r1/reads
-  r2: path/to/r2/reads
-```
-
- Make sure to set `Sample_1_file_name` to match the file names in the `fastx_files` directory. An example can be seen in the `masterconfig.yaml` file which is configured to run on the provided test data.
- Choose your run name by setting `run`.
- Specify the organism ploidy with `ploidy`.
- Choose the BUSCO lineage with `lineage`.
- There are 3 modes to run hifiasm. In all cases, the organism has to be sequenced in PacBio HiFi. To choose the mode, modify the variable `mode` to either :
-  - `default` for a HiFi-only assembly.
-  - `trio` if you have parental reads (either HiFi or short reads) in addition to the sequencing of the organism.
-    - Add a key corresponding to your filename and modify the variables `p1` and `p2` to be the parental reads. Supported filetypes are _fasta_, _fasta.gz_, _fastq_ and _fastq.gz_.
-  - `hi-c` if the organism has been sequenced in paired-end Hi-C as well.
-    - Add a key corresponding to your filename an modify the variables `r1` and `r2` to be the paired-end Hi-C reads. Supported filetypes are _fasta_, _fasta.gz_, _fastq_ and _fastq.gz_.
-
-Modify the following variables in `./job.sh`:
-
- Line 17, `#SBATCH --mail-user=`
-  - Set this to be your email adress.
- `Module Loading`
-  - If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
-
-Once these variables have been set, run the following:
-
-```bash
-sbatch job.sh
-```
-
-All the slurm output logs are in the `slurm_logs` directory. There are .out and .err files for the worklow (*snakemake.cortex\*\*) and for each rules (*rulename.cortex\*\*).
-
-### Dry run
-
-To check if the workflow will run fine, you can do a dry run: uncomment line 56 in `job.sh` and comment line 59, then run
-
-```bash
-sbatch job.sh
-```
-
-Check the snakemake.cortex\*.out file in the `slurm_logs` directory, you should see a summary of the workflow.
-
-### Outputs
-
-These are the directories for the data produced by the workflow:
-
- An automatic report is generated in each `RUN` directory.
- `01_raw_data_QC` contains all quality control ran on the reads. FastQC and LongQC create html reports on fastq and bam files respectively, reads stats are given by Genometools, and predictions of genome size and heterozygosity are given by Genomescope (in directory `04_kmer`).
- `02_genome_assembly` contains 2 assemblies. The first one is in `01_raw_assembly`, it is the assembly obtained with hifiasm. The second one is in `02_after_purge_dups_assembly`, it is the hifiasm assembly after haplotigs removal by purge_dups. Both assemblies have a `01_assembly_QC` directory containing assembly statistics done by Genometools (in directory `assembly_stats`), BUSCO analyses (`busco`), k-mer profiles with KAT (`katplot`) and completedness and QV stats with Merqury (`merqury`) as well as assembled telomeres with FindTelomeres (`telomeres`).
-
-```
-workflow_results
-├── 00_raw_data
-└── FILENAME
-    └── RUN
-        ├── 01_raw_data_QC
-        │   ├── 01_fastQC
-        │   ├── 02_longQC
-        │   ├── 03_genometools
-        |   └── 04_kmer
-        |       └── genomescope
-        └── 02_genome_assembly
-            ├── 01_raw_assembly
-            │   ├── 00_assembly
-            |   └── 01_assembly_QC
-            |       ├── assembly_stats
-            |       ├── busco
-            |       ├── katplot
-            |       ├── merqury
-            |       └── telomeres
-            └── 02_after_purge_dups_assembly
-                ├── 00_assembly
-                |   ├── hap1
-                |   └── hap2
-                └── 01_assembly_QC
-                    ├── assembly_stats
-                    ├── busco
-                    ├── katplot
-                    ├── merqury
-                    └── telomeres
-```
-
-## Known problems/errors
-
-### HPC
-
-The workflow does not work if the HPC does not allow a job to run other jobs.
-
-### BUSCO
-
-The first time you run the workflow, if there are multiple samples, the BUSCO lineage might be downladed multiple times. This can create a conflict between the jobs using BUSCO and may interrupt some of them. In that case, you only need to rerun the workflow once everything is done.
-
-### HiFi assembly
-
-If your pipeline fails at the hifiasm step, this may be a result of improper input data being provided. Please make sure that there are no 'N' or undefined bases in your raw data.
-
-### Snakemake locked directory
-
-When you try to rerun the workflow after cancelling a job, you may have to unlock the results directory. To do so, go in `.config/snakemake_profile/slurm` and uncomment line 14 of `config.yaml`. Run the workflow once to unlock the directory (it should only take a few seconds). Still in `config.yaml`, comment line 14. The workflow will be able to run and create outputs.
-
-## How to cite asm4pg?
-
-We are currently writing a publication about asm4pg. Meanwhile, if you use the pipeline, please cite it using the address of this repository.
+## How to cite asm4pg? ##

-## License
+We are currently writing a publication about asm4pg. Meanwhile, if you use the pipeline, please cite it using the address of this repository. 

-The content of this repository is licensed under <A HREF="https://choosealicense.com/licenses/gpl-3.0/">(GNU GPLv3)</A>
+## License ##

-## Contacts
+The content of this repository is licensed under <A HREF="https://choosealicense.com/licenses/gpl-3.0/">(GNU GPLv3)</A> 

+## Contacts ##
 For any troubleshouting, issue or feature suggestion, please use the issue tab of this repository.
 For any other question or if you want to help in developing asm4pg, please contact Ludovic Duvaux at ludovic.duvaux@inrae.fr
--- a/SUMMARY.md
+++ b/SUMMARY.md
+# Summary
+
+* [Introduction](README.md)
+* [Documentation summary](workflow/documentation.md)
+    * [Requirements](workflow/documentation.md#asm4pg-requirements)
+    * [Tutorials](workflow/documentation.md#tutorials)
+        * [Quick start](workflow/doc/Quick-start.md)
+        * [Hi-C mode](workflow/doc/Assembly-Mode/Hi-C-tutorial.md)
+        * [Trio mode](workflow/doc/Assembly-Mode/Trio-tutorial.md)
+    * [Outputs](workflow/documentation.md#outputs)
+        * [Workflow output](workflow/doc/Outputs.md)
+    * [Optional data preparation](workflow/documentation.md#optional-data-preparation)
+        * [if your data is in a tarball archive](workflow/doc/Tar-data-preparation.md)
+    * [Going further](workflow/doc/Going-further.md)
+    * [Troubleshooting](workflow/documentation.md#known-errors)
+        * [known errors](workflow/doc/Known-errors.md)
+    * [Software Dependencies](workflow/documentation.md#programs)
+        * [Programs listing](workflow/doc/Programs.md)
+* [Gitlab pages using honkit](honkit.md)
+
--- a/fig/.gitkeep
+++ b/fig/.gitkeep
--- a/honkit.md
+++ b/honkit.md
+# HonKit
+
+HonKit is building beautiful books using GitHub/Git and Markdown.
+
+![HonKit Screenshot](./honkit.png)
+
+## Documentation and Demo
+
+HonKit documentation is built by HonKit!
+
+- <https://honkit.netlify.app/>
+
+## Quick Start
+
+### Installation
+
+- Requirement: [Node.js](https://nodejs.org) [LTS](https://nodejs.org/about/releases/) version
+
+The best way to install HonKit is via **NPM** or **Yarn**.
+
+```
+$ npm init --yes
+$ npm install honkit --save-dev
+```
+
+⚠️ Warning:
+
+- If you have installed `honkit` globally, you must install each plugins globally as well
+- If you have installed `honkit` locally, you must install each plugins locally as well
+
+We recommend installing `honkit` locally.
+
+### Create a book
+
+HonKit can set up a boilerplate book:
+
+```
+$ npx honkit init
+```
+
+If you wish to create the book into a new directory, you can do so by running `honkit init ./directory`
+
+Preview and serve your book using:
+
+```
+$ npx honkit serve
+```
+
+Or build the static website using:
+
+```
+$ npx honkit build
+```
+
+You can start to write your book!
+
+For more details, see [HonKit's documentation](https://honkit.netlify.app/).
+
+## Docker support
+
+Honkit provide docker image at [honkit/honkit](https://hub.docker.com/r/honkit/honkit).
+
+This docker image includes built-in dependencies for PDF/epub.
+
+```
+docker pull honkit/honkit
+docker run -v `pwd`:`pwd` -w `pwd` --rm -it honkit/honkit honkit build
+docker run -v `pwd`:`pwd` -w `pwd` --rm -it honkit/honkit honkit pdf
+```
+
+For more details, see [docker/](./docker/).
+
+## Usage examples
+
+HonKit can be used to create a book, public documentation, enterprise manual, thesis, research papers, etc.
+
+You can find a list of [real-world examples](https://honkit.netlify.app/examples.html) in the documentation.
+
+## Features
+
+* Write using [Markdown](https://honkit.netlify.app/syntax/markdown.html) or [AsciiDoc](https://honkit.netlify.app/syntax/asciidoc.html)
+* Output as a website or [ebook (pdf, epub, mobi)](https://honkit.netlify.app/ebook.html)
+* [Multi-Languages](https://honkit.netlify.app/languages.html)
+* [Lexicon / Glossary](https://honkit.netlify.app/lexicon.html)
+* [Cover](https://honkit.netlify.app/ebook.html)
+* [Variables and Templating](https://honkit.netlify.app/templating/)
+* [Content References](https://honkit.netlify.app/templating/conrefs.html)
+* [Plugins](https://honkit.netlify.app/plugins/)
+* [Beautiful default theme](./packages/@honkit/theme-default)
+
+## Fork of GitBook
+
+HonKit is a fork of [GitBook (Legacy)](https://github.com/GitbookIO/gitbook).
+[GitBook (Legacy)](https://github.com/GitbookIO/gitbook) is [deprecated](https://github.com/GitbookIO/gitbook/commit/6c6ef7f4af32a2977e44dd23d3feb6ebf28970f4) and an inactive project.
+
+HonKit aims to smooth the migration from GitBook (Legacy) to HonKit.
+
+### Compatibility with GitBook
+
+- Almost all plugins work without changes!
+- Support `gitbook-plugin-*` packages
+    - You should install these plugins via npm or yarn
+    - `npm install gitbook-plugin-<example> --save-dev`
+
+### Differences with GitBook
+
+- Node.js 14+ supports
+- Improve `build`/`serve` performance
+    - `honkit build`: use file cache by default
+    - `honkit serve`: 28.2s → 0.9s in [examples/benchmark](examples/benchmark)
+    - Also, support `--reload` flag for force refresh
+- Improve plugin loading logic
+    - Reduce cost of finding `honkit-plugin-*` and `gitbook-plugin-*`
+    - Support `honkit-plugin-*` and `@scope/honkit-plugin-*` (GitBook does not support a scoped module)
+- Remove `install` command
+    - Instead of it, just use `npm install` or `yarn install`
+- Remove `global-npm` dependency
+    - You can use HonKit with another npm package manager like `yarn`
+- Update dependencies
+    - Upgrade to nunjucks@2, highlight.js etc...
+    - It will reduce bugs
+- TypeScript
+    - Rewritten by TypeScript
+- Monorepo codebase
+    - Easy to maintain
+- [Docker support](./docker)
+
+### Migration from GitBook
+
+Replace `gitbook-cli` with `honkit`.
+
+```
+npm uninstall gitbook-cli
+npm install honkit --save-dev
+```
+
+Replace `gitbook` command with `honkit` command.
+
+```diff
+  "scripts": {
+-    "build": "gitbook build",
+    "build": "honkit build",
+-    "serve": "gitbook serve"
+    "serve": "honkit serve"
+  },
+```
+
+After that, HonKit just works!
+
+Examples of migration:
+
+- [Add a Github action to deploy · DjangoGirls/tutorial](https://github.com/DjangoGirls/tutorial/pull/1666)
+- [Migrate from GitBook to Honkit · swaroopch/byte-of-python](https://github.com/swaroopch/byte-of-python/pull/88)
+- [replace Gitbook into Honkit · yamat47/97-things-every-programmer-should-know](https://github.com/yamat47/97-things-every-programmer-should-know/pull/2)
+- [Migrate misp-book from GitBook to honkit](https://github.com/MISP/misp-book/pull/227)
+
+## Benchmarks
+
+`honkit build` benchmark:
+
+- <https://honkit.github.io/honkit/dev/bench/>
+
+## Licensing
+
+HonKit is licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for the full license text.
+
+HonKit is a fork of [GitBook (Legacy)](https://github.com/GitbookIO/gitbook).
+GitBook is licensed under the Apache License, Version 2.0.
+
+Also, HonKit includes [bignerdranch/gitbook](https://github.com/bignerdranch/gitbook) works.
+
+## Sponsors
+
+<a href="https://www.netlify.com">
+<img src="https://www.netlify.com/img/global/badges/netlify-color-bg.svg" alt="Deploys by Netlify" />
+</a>
--- a/job.sh
+++ b/job.sh
@@ -14,7 +14,7 @@
 #SBATCH -o slurm_logs/snakemake.%N.%j.out
 #SBATCH -e slurm_logs/snakemake.%N.%j.err
 #SBATCH --mail-type=END,FAIL
-#SBATCH --mail-user=ken.smith@plantandfood.co.nz
+#SBATCH --mail-user=sukanya.denni@univ-rouen.fr
 ################################################################################

 # Useful information to print
@@ -35,29 +35,16 @@ echo 'scontrol show job:'
 scontrol show job $SLURM_JOB_ID
 echo '########################################'

-## get SNG_BIND abs path using python
-function SNG_BIND_ABS_PATH {
-    SNG_BIND="$(python3 - <<END
-import os
-
-abs_path = os.getcwd()
-print(abs_path)
-
-END
-)"
-}
-SNG_BIND_ABS_PATH

 ### variables
 CLUSTER_CONFIG=".config/snakemake_profile/slurm/cluster_config.yml"
-MAX_CORES=4
+MAX_CORES=10
 PROFILE=".config/snakemake_profile/slurm"
-SMK_PATH="workflow/pre-job_snakefiles"
+SNG_BIND="/gpfs/scratch/sdenni/wf/GenomAsm4pg"

 ### Module Loading:
 module purge
-module load snakemake
-module load singularity
+module load snakemake/6.5.1

 echo 'Starting Snakemake workflow'

@@ -65,10 +52,12 @@ echo 'Starting Snakemake workflow'
 mkdir -p slurm_logs

 ### Snakemake commands
-## Dry run
-# snakemake --profile $PROFILE -j $MAX_CORES --use-singularity  --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -n -r
-
-# snakemake --profile $PROFILE -j $MAX_CORES --use-singularity  --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -f print

-## Run
-snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG
\ No newline at end of file
+if [ "$1" = "dry" ]
+then
+    # dry run
+    snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG -n -r
+else
+    # run
+    snakemake --profile $PROFILE -j $MAX_CORES --use-singularity --singularity-args "-B $SNG_BIND" --cluster-config $CLUSTER_CONFIG
+fi
\ No newline at end of file
--- a/prejob.sh
+++ b/prejob.sh
@@ -14,7 +14,7 @@
 #SBATCH -o slurm_logs/snakemake_prejob.%N.%j.out
 #SBATCH -e slurm_logs/snakemake_prejob.%N.%j.err
 #SBATCH --mail-type=END,FAIL
-#SBATCH --mail-user=ken.smith@plantandfood.co.nz
+#SBATCH --mail-user=sukanya.denni@univ-rouen.fr
 ################################################################################

 # Useful information to print
@@ -35,30 +35,16 @@ echo 'scontrol show job:'
 scontrol show job $SLURM_JOB_ID
 echo '########################################'

-## get SNG_BIND abs path using python
-function SNG_BIND_ABS_PATH {
-    SNG_BIND="$(python3 - <<END
-import os
-
-abs_path = os.getcwd()
-print(abs_path)
-
-END
-)"
-}
-SNG_BIND_ABS_PATH
-
 ### variables
 CLUSTER_CONFIG=".config/snakemake_profile/slurm/cluster_config.yml"
 MAX_CORES=4
 PROFILE=".config/snakemake_profile/slurm"
 SMK_PATH="workflow/pre-job_snakefiles"
-
+SNG_BIND="/gpfs/scratch/sdenni/wf/GenomAsm4pg"

 ### Module Loading:
 module purge
-module load snakemake
-module load singularity
+module load snakemake/6.5.1

 echo 'Starting Snakemake - data preparation'


--- a/workflow/Snakefile
+++ b/workflow/Snakefile
 configfile: ".config/masterconfig.yaml"

+res_path=config["root"] + "/" + config["resdir"]
+
 ###### Include all scripts & rules necessary to run the workflow ######
 ### Scripts
+# get parameters from masterconfig
 include: "scripts/from_config/hifiasm_mode.py"
 include: "scripts/from_config/parameters.py"
 include: "scripts/from_config/target_list.py"
-include: "scripts/path_helper.py"
-
-### paths
-if config["root"].startswith("."):
-    abs_root_path = get_abs_root_path()
-    res_path = get_res_path()
-else:
-    abs_root_path = config["root"]
-    res_path = abs_root_path + "/" + config["resdir"]

 ### Rules
-include: "rules/01_pre_asm_qc.smk"
+## PRE ASSEMBLY QC
+include: "rules/01_qc.smk"
+## ASSEMBLY
 include: "rules/02_asm.smk"
 # Statistics
 include: "rules/03_asm_qc.smk"
@@ -24,25 +20,29 @@ include: "rules/03.5_asm_qc_merqury.smk"
 # Purging
 include: "rules/04_purge_dups.smk"
 include: "rules/05_purged_asm_qc.smk"
-include: "rules/05.5_pa_qc_merqury.smk"
+include: "rules/05.5_purged_asm_qc_merqury.smk"
 # Link final assembly
 include: "rules/06_sym_link_hap.smk"
-# Automatic report
+## AUTOMATIC REPORT
 include: "rules/07_report.smk"

+## runtime
+include: "rules/00_runtime.smk"
+
 ###### get filenames for workflow ######
-if config["get_all_filenames"]:
-    IDS=get_files_id(abs_root_path + "/" + config["resdir"] + "/" + config["fastxdir"])
-else:
-    IDS=config["IDS"]
-bamIDS=check_bam(abs_root_path + "/" + config["resdir"] + "/" + config["bamdir"], IDS)
-fastqIDS=check_fastq(abs_root_path + "/" + config["resdir"] + "/" + config["fastxdir"], IDS)
+IDS=config["IDS"]
+bamIDS=check_bam(IDS)
+fastqIDS=check_fastq(IDS)
+####
 RUNID = run_id(config["IDS"])
 BID_RUN = run_BFid(bamIDS)
 FID_RUN = run_BFid(fastqIDS)

+###### results path ######
+res_path=config["root"] + "/" + config["resdir"]
+
 ###### Target files ######
-### raw data stats
+## raw data stats
 longqc_output = expand(res_path + "/{Bid}/{run}/01_raw_data_QC/02_longQC", zip,
    run=BID_RUN, Bid=bamIDS),
 fastqc_output =  expand(res_path + "/{Fid}/{run}/01_raw_data_QC/01_fastQC/{Fid}_fastqc.{ext}", zip,
@@ -61,23 +61,24 @@ REP_TRIO_ID = for_report_trio(IDS)
 RUNID_TRIO = run_id(REP_TRIO_ID)
 BUSCO_LIN_TRIO = busco_lin(REP_TRIO_ID)

+
 report_trio_output = expand(res_path + "/{runid}/report_trio_{id}.{lin}.html", zip,
    runid=RUNID_TRIO, id=REP_TRIO_ID, lin = BUSCO_LIN_TRIO)

 ### SYM LINK
-# symbolic link to final assembly
+## symbolic link to final assembly
 symb_link1 = expand(res_path + "/{runid}/{id}_hap{n}.fa", zip,
    runid=RUNID_REG, id=REP_ID, n=["1", "2"])
 symb_link2 = expand(res_path + "/{runid}/{id}_hap{n}.fa", zip,
    runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"])

-# PURGE_DUPS CUTOFFS GRAPH
+## PURGE_DUPS CUTOFFS GRAPH
 cut_eval1 = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/00_assembly/{id}_hap{n}/cutoffs_graph_hap{n}.png", zip,
    runid=RUNID_REG, id=REP_ID, n=["1", "2"])
 cut_eval2 = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/00_assembly/{id}_hap{n}/cutoffs_graph_hap{n}.png", zip,
    runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"])

-# BUSCO
+## BUSCO
 busco_reg = expand(res_path + "/{runid}/02_genome_assembly/01_raw_assembly/01_assembly_QC/busco/{id}_hap{n}/short_summary.specific.{lin}.{id}_hap{n}.txt", zip,
    runid=RUNID_REG, id=REP_ID, n=["1", "2"], lin = BUSCO_LIN)
 busco_purged_reg = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/01_assembly_QC/busco/{id}_purged_hap{n}/short_summary.specific.{lin}.{id}_purged_hap{n}.txt", zip,
@@ -88,6 +89,12 @@ busco_trio = expand(res_path + "/{runid}/02_genome_assembly/01_raw_assembly/01_a
 busco_purged_trio = expand(res_path + "/{runid}/02_genome_assembly/02_after_purge_dups_assembly/01_assembly_QC/busco/{id}_purged_hap{n}/short_summary.specific.{lin}.{id}_purged_hap{n}.txt", zip,
    runid=RUNID_TRIO, id=REP_TRIO_ID, n=["1", "2"], lin = BUSCO_LIN_TRIO)

+## RUNTIME
+time = expand(res_path + "/{runid}/runtime.{id}.{lin}.txt", zip,
+    runid = RUNID_REG, id=REP_ID, lin=BUSCO_LIN)
+time_trio = expand(res_path + "/{runid}/runtime_trio.{id}.{lin}.txt", zip,
+    runid = RUNID_TRIO, id=REP_TRIO_ID, lin=BUSCO_LIN_TRIO)
+
 rule_all_input_list = [
    longqc_output,
    fastqc_output,
@@ -100,10 +107,12 @@ rule_all_input_list = [
    busco_reg,
    busco_purged_reg,
    busco_trio,
-    busco_purged_trio
+    busco_purged_trio,
+    time,
+    time_trio
 ]

-##### target files #####
+#### target files
 rule all:
    input:
        all_input = rule_all_input_list
\ No newline at end of file
--- a/workflow/doc/Assembly-Mode/Hi-C-tutorial.md
+++ b/workflow/doc/Assembly-Mode/Hi-C-tutorial.md
+# Hi-C mode tutorial
+
+Please look at [quick start](../Quick-start.md) first, some of the steps are omitted here.
+
+This tutorial shows how to use the workflow with hi-c assembly mode which takes PacBio Hifi data and Hi-C data as input.
+
+## 1. Config file
+**TO-DO : add a toy dataset fasta and hi-c.**
+```bash
+cd GenomAsm4pg/.config
+```
+
+Modify `masterconfig.yaml`. The PacBio HiFi file is `toy_dataset_hi-c.fasta`, its name is used as key in config. The Hi-C files are `data_r1.fasta` and `data_r2.fasta`
+
+```yaml
+####################### job - workflow #######################
+### CONFIG
+
+IDS: ["toy_dataset_hi-c"]
+
+toy_dataset_hi-c:
+  fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
+  run: hi-c_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: hi-c
+  r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+  r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+```
+
+## 2. Dry run
+To check the config, first do a dry run of the workflow.
+
+```bash
+sbatch job.sh dry
+```
+## 3. Run 
+If the dry run is successful, you can run the workflow.
+
+```bash
+sbatch job.sh
+```
+
+## Other assembly modes
+If you want to use parental data, follow the [Trio assembly mode tutorial](Trio-tutorial.md).
+To go further with the workflow use go [here](../Going-further.md).
--- a/workflow/doc/Assembly-Mode/Trio-tutorial.md
+++ b/workflow/doc/Assembly-Mode/Trio-tutorial.md
+# Trio mode tutorial
+
+Please look at [quick start](../Quick-start.md) first, some of the steps are omitted here.
+
+This tutorial shows how to use the workflow with hi-c assembly mode which takes PacBio Hifi data and Hi-C data as input.
+
+## 1. Config file
+**TO-DO : add a toy dataset fasta and parental fasta.**
+```bash
+cd GenomAsm4pg/.config
+```
+
+Modify `masterconfig.yaml`. The PacBio HiFi file is `toy_dataset_trio.fasta`, its name is used as key in config. The parental reads files are `data_p1.fasta` and `data_p2.fasta`.
+Parental data is used as k-mers, you use Illumina or PacBio Hifi reads.
+
+```yaml
+####################### job - workflow #######################
+### CONFIG
+
+IDS: ["toy_dataset_trio"]
+
+toy_dataset_trio:
+  fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
+  run: trio_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: trio
+  p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
+  p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
+```
+
+## 2. Dry run
+To check the config, first do a dry run of the workflow.
+
+```bash
+sbatch job.sh dry
+```
+## 3. Run 
+If the dry run is successful, you can run the workflow.
+
+```bash
+sbatch job.sh
+```
+
+## Other assembly modes
+If you want to use Hi-C data, follow the [Hi-C assembly mode tutorial](Hi-C-tutorial.md).
+To go further with the workflow use go [here](../Going-further.md).
--- a/workflow/doc/Going-further.md
+++ b/workflow/doc/Going-further.md
+# Going further
+
+[TOC]
+
+## 1. Multiple datasets
+You can run the workflow on multiple datasets at the same time.
+
+### 1.1. All datasets
+With `masterconfig.yaml` as follow, running the workflow will assemble each dataset in its specific assembly mode.
+You can add as many datasets as you want, each with different parameters. 
+
+```yaml
+IDS: ["toy_dataset", "toy_dataset_hi-c", "toy_dataset_trio"]
+
+toy_dataset:
+  fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
+  run: tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+
+toy_dataset_hi-c:
+  fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
+  run: hi-c_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: hi-c
+  r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+  r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+
+toy_dataset_trio:
+  fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
+  run: trio_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: trio
+  p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
+  p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
+```
+### 1.2. On chosen datasets
+You can remove dataset from IDS to assemble only chosen genomes:
+```yaml
+IDS: ["toy_dataset", "toy_dataset_trio"]
+
+toy_dataset:
+  fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
+  run: tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+
+toy_dataset_hi-c:
+  fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
+  run: hi-c_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: hi-c
+  r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+  r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+
+toy_dataset_trio:
+  fasta: ./GenomAsm4pg/tutorial_data/trio/toy_dataset_trio.fasta
+  run: trio_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: trio
+  p1: ./GenomAsm4pg/tutorial_data/trio/data_p1.fasta
+  p2: ./GenomAsm4pg/tutorial_data/trio/data_p2.fasta
+```
+Running the workflow with this config will assemble only `toy_dataset` and `toy_dataset_trio`.
+
+## 2. Different run names
+If you want to try different parameters on the same dataset, changing the run name will create a new directory and keep the previous data.
+
+In the [Hi-C tutorial](), we used the following config.
+```yaml
+IDS: ["toy_dataset_hi-c"]
+
+toy_dataset_hi-c:
+  fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
+  run: hi-c_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: hi-c
+  r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+  r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+```
+
+If you want to compare the Hi-C and default assembly modes, you can run the workflow with a different run name and the default mode.
+```yaml
+IDS: ["toy_dataset_hi-c"]
+
+toy_dataset_hi-c:
+  fasta: ./GenomAsm4pg/tutorial_data/hi-c/toy_dataset_hi-c.fasta
+  run: default_comparaison
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+```
+You will end up with 2 sub-directories for toy_dataset_hi-c (`hi-c_tutorial` and `default_comparaison`) and keep the data from the previous run in Hi-C mode.
+
+## 3. The same dataset with different parameters at once
+If you want to do the previous example in one run, you will have to create a symbolic link to the fasta with a different filename.
+
+YAML files do not allow multiple uses of the same key. The following config does not work.
+```yaml
+## DOES NOT WORK
+IDS: ["toy_dataset_hi-c"]
+
+toy_dataset_hi-c:
+  run: hi-c_tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: hi-c
+  r1: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+  r2: ./GenomAsm4pg/tutorial_data/hi-c/data_r1.fasta
+
+toy_dataset_hi-c:
+  run: default_comparaison
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+```
+
+**TO COMPLETE**
+
+## 4. Optional fastq and bam files
+If fastq and bam are available and you want to do raw QC with fastQC and longQC, add the `fastq` and/or `bam` key in your config. The fasta, fastq and bam filenames have to be the same. For example:
+
+```yaml
+IDS: ["toy_dataset"]
+
+toy_dataset:
+  fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
+  fastq: "./GenomAsm4pg/tutorial_data/toy_dataset.fastq"
+  bam: "./GenomAsm4pg/tutorial_data/toy_dataset.bam"
+  run: tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+```
--- a/workflow/doc/Known-errors.md
+++ b/workflow/doc/Known-errors.md
+# Troubleshooting
+
+[TOC]
+
+## One of the BUSCO rules failed
+The first time you run the workflow, the BUSCO lineage might be downloaded multiple times. This can create a conflict between the jobs using BUSCO and may interrupt some of them. In that case, you only need to rerun the workflow once everything is done.
+
+## Snakemake locked directory
+When you try to rerun the workflow after cancelling a job, you may have to unlock the results directory. To do so, go in `.config/snakemake_profile/slurm` and uncomment line 14 of `config.yaml`. Run the workflow once to unlock the directory (it should only take a few seconds). Still in `config.yaml`, comment line 14. The workflow will be able to run and create outputs.
--- a/workflow/doc/Outputs.md
+++ b/workflow/doc/Outputs.md
+# Workflow output
+
+[TOC]
+
+## Directories
+There are three directories for the data produced by the workflow:
+- An automatic report is generated in the `RUN` directory.
+- `01_raw_data_QC` contains all quality control ran on the reads. FastQC and LongQC create HTML reports on fastq and bam files respectively, reads stats are given by Genometools, and predictions of genome size and heterozygosity are given by Genomescope (in directory `04_kmer`).
+- `02_genome_assembly` contains 2 assemblies. The first one is in `01_raw_assembly`, it is the assembly obtained with hifiasm. The second one is in `02_after_purge_dups_assembly`, it is the hifiasm assembly after haplotigs removal by purge_dups. Both assemblies have a `01_assembly_QC` directory containing assembly statistics done by Genometools (in directory `assembly_stats`), BUSCO analyses (`busco`), k-mer profiles with KAT (`katplot`) and completedness and QV stats with Merqury (`merqury`) as well as assembled telomeres with FindTelomeres (`telomeres`).
+- `benchmark` contains main programs runtime
+
+```
+workflow_results
+├── 00_input_data
+└── FILENAME
+    └── RUN
+        ├── 01_raw_data_QC
+        │   ├── 01_fastQC
+        │   ├── 02_longQC
+        │   ├── 03_genometools
+        |   └── 04_kmer
+        |       └── genomescope
+        └── 02_genome_assembly
+            ├── 01_raw_assembly
+            │   ├── 00_assembly
+            |   └── 01_assembly_QC
+            |       ├── assembly_stats
+            |       ├── busco
+            |       ├── katplot
+            |       ├── merqury
+            |       └── telomeres
+            └── 02_after_purge_dups_assembly
+                ├── 00_assembly
+                |   ├── hap1
+                |   └── hap2
+                └── 01_assembly_QC
+                    ├── assembly_stats
+                    ├── busco
+                    ├── katplot
+                    ├── merqury
+                    └── telomeres
+```
+
+## Additional files
+- Symbolic links to haplotype 1 and haplotype 2 assemblies after purge_dups
+- HTML report with the main results from each program
+- Runtime file with the total workflow runtime for the dataset
--- a/workflow/doc/Programs.md
+++ b/workflow/doc/Programs.md
+# Workflow steps and program versions
+All images here will be pulled automatically by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
+Images are stored on the project's container registry but come from various container libraries:
+
+## 1. Pre-assembly
+- Conversion of PacBio bam to fasta & fastq
+    - [smrtlink](https://www.pacb.com/support/software-downloads/) 9.0.0
+- Fastq to fasta conversion
+    - [seqtk](https://github.com/lh3/seqtk) 1.3
+- Raw data quality control
+    - [fastqc](https://github.com/s-andrews/FastQC) 0.11.5
+    - [lonqQC](https://github.com/yfukasawa/LongQC) 1.2.0c
+- Metrics
+    - [genometools](https://github.com/genometools/genometools) 1.5.9
+- K-mer analysis
+    - [jellyfish](https://github.com/gmarcais/Jellyfish) 2.3.0
+    - [genomescope](https://github.com/tbenavi1/genomescope2.0) 2.0
+
+## 2. Assembly
+- Assembly
+    - [hifiasm](https://github.com/chhylp123/hifiasm) 0.16.1
+- Metrics
+    - [genometools](https://github.com/genometools/genometools) 1.5.9
+- Assembly quality control
+    - [BUSCO](https://gitlab.com/ezlab/busco) 5.3.1
+    - [KAT](https://github.com/TGAC/KAT) 2.4.1
+- Error rate, QV & phasing
+    - [meryl](https://github.com/marbl/meryl) and [merqury](https://github.com/marbl/merqury) 1.3
+- Detect assembled telomeres
+    - [FindTelomeres](https://github.com/JanaSperschneider/FindTelomeres)
+        - **Biopython** 1.75 
+- Haplotigs and overlaps purging 
+    - [purge_dups](https://github.com/dfguan/purge_dups) 1.2.5
+        - **matplotlib** 0.11.5
+
+## 3. Report
+- **R markdown** 4.0.3
+
+# Docker images
+The programs are pulled automatically as images by Snakemake the first time you run the workflow. It may take some time. Images are only downloaded once and reused automatically by the workflow.
+Images are stored on the project's container registry but come from various container libraries:
+
+- [smrtlink](https://hub.docker.com/r/bryce911/smrtlink/tags)
+- [seqtk](https://hub.docker.com/r/nanozoo/seqtk)
+- [fastqc](https://hub.docker.com/r/biocontainers/fastqc/tags)
+- [lonqQC](https://hub.docker.com/r/grpiccoli/longqc/tags)
+- [genometools](https://hub.docker.com/r/biocontainers/genometools/tags)
+- [jellyfish](https://quay.io/repository/biocontainers/kmer-jellyfish?tab=tags)
+- [genomescope](https://hub.docker.com/r/abner12/genomescope)
+- [hifiasm](https://quay.io/repository/biocontainers/hifiasm?tab=tags)
+- [BUSCO](https://hub.docker.com/r/ezlabgva/busco/tags)
+- [KAT](https://quay.io/repository/biocontainers/kat)
+- [meryl and merqury](https://quay.io/repository/biocontainers/merqury?tab=tags)
+- [Biopython for FindTelomeres](https://quay.io/repository/biocontainers/biopython?tab=tags)
+- [purge_dups](https://quay.io/repository/biocontainers/purge_dups?tab=tags)
+- [matplotlib as companion to purge_dups](https://hub.docker.com/r/biocontainers/matplotlib-venn/tags)
+- [R markdown](https://hub.docker.com/r/reslp/rmarkdown/tags)
--- a/workflow/doc/Quick-start.md
+++ b/workflow/doc/Quick-start.md
+# Quick start
+
+This tutorial shows how to use the workflow with default assembly mode which takes PacBio Hifi data as input.
+
+[TOC]
+
+## Clone repository
+```bash
+cd .
+git clone https://forgemia.inra.fr/asm4pg/GenomAsm4pg.git
+```
+
+## 1. Cluster profile setup
+```bash
+cd GenomAsm4pg/.config/snakemake_profile
+```
+The current profile is made for SLURM. If you use it, change line 13 to your email address in the `cluster_config`.yml file.
+
+To run this workflow on another HPC, create another profile (https://github.com/Snakemake-Profiles) and add it in the `.config/snakemake_profile` directory. Change the `CLUSTER_CONFIG` and `PROFILE` variables in `job.sh` and `prejob.sh` scripts.
+
+
+## 2. Config file
+**TO-DO : add a toy fasta.**
+```bash
+cd ..
+```
+
+Modify `masterconfig.yaml`. Root refers to the path for the output data.
+```yaml
+# absolute path to your desired output path
+root: ./GenomAsm4pg/tutorial_output
+```
+
+The reads file is `toy_dataset.fasta`, its name is used as key in config. 
+
+```yaml
+####################### job - workflow #######################
+### CONFIG
+IDS: ["toy_dataset"]
+
+toy_dataset:
+  fasta: "./GenomAsm4pg/tutorial_data/toy_dataset.fasta"
+  run: tutorial
+  ploidy: 2
+  busco_lineage: eudicots_odb10
+  mode: default
+```
+
+## 3. Create slurm_logs directory
+```bash
+cd ..
+mkdir slurm_logs
+```
+SLURM logs for each rule will be in this directory, there are .out and .err files for the worklow (*snakemake.cortex**) and for each rules (*rulename.cortex**).
+
+## 4. Mail setup
+Modify line 17 to your email address in `job.sh`.
+
+## 5. Dry run
+To check the config, first do a dry run of the workflow.
+
+```bash
+sbatch job.sh dry
+```
+## 6. Run 
+If the dry run is successful, check that the `SNG_BIND` variable in `job.sh` is the same as `root` variable in `masterconfig.yaml`. 
+
+If Singularity is not in the HPC environment, add `module load singularity` under `module load snakemake/6.5.1`.
+
+You can run the workflow.
+
+```bash
+sbatch job.sh
+```
+
+## Other assembly modes
+If you want to use additional Hi-C data or parental data, follow the [Hi-C assembly mode tutorial](Assembly-Mode/Hi-C-tutorial.md) or the [Trio assembly mode tutorial](Assembly-Mode/Trio-tutorial.md). To go further with the workflow use go [here](Going-further.md).
--- a/workflow/doc/Tar-data-preparation.md
+++ b/workflow/doc/Tar-data-preparation.md
+# Optional: data preparation
+
+If your data is in a tarball, this companion workflow will extract the data and convert bam files to fastq and fasta if necessary.
+
+[TOC]
+
+## 1. Config file
+```bash
+cd GenomAsm4pg/.config
+```
+Modify the the `data` variable in file `.config/masterconfig.yaml` to be the path to the directory containing all input tar files.
+This workflow can automatically determine the name of files in the specified `data` directory, or run only on given files :
+- `get_all_tar_filename: True` will uncompress all tar files. If you want to choose the the files to uncompress, use `get_all_tar_filename: False` and give the filenames as a list in `tarIDS`
+
+## 2. Run 
+Modify the `SNG_BIND` variable in `prejob.sh`, it has to be the same as the variable `root` in `.config/masterconfig.yaml`. Change line 17 to your email adress.
+If Singularity is not in the HPC environement, add `module load singularity` under Module loading.
+
+Then run
+
+```bash
+sbatch prejob.sh
+```
+
+## 3. Outputs
+This will create multiple directories to prepare the data for the workflow. You will end up with a `bam_files` directory containing all *bam* files, renamed as the tar filename if your data was named "ccs.bam", and a `fastx_files` directory containing all *fasta* and *fastq* files. The `extract` directory contains all other files that were in the tar ball.
+
+```
+workflow_results
+└── 00_raw_data
+	├── bam_files
+	├── extract
+	└── fastx_files
+```
--- a/fig/rule_dag.svg
+++ b/fig/rule_dag.svg
--- a/workflow/documentation.md
+++ b/workflow/documentation.md
+# <A HREF="https://forgemia.inra.fr/asm4pg/GenomAsm4pg"> asm4pg </A>
+
+Asm4pg is an automatic and reproducible genome assembly workflow for pangenomic applications using PacBio HiFi data.
+
+doc: [Gitlab pages](https://asm4pg.pages.mia.inra.fr/genomasm4pg)
+
+![workflow DAG](doc/fig/rule_dag.svg)
+
+## Asm4pg Requirements
+- snakemake >= 6.5.1
+- singularity
+
+The workflow does not work with HPC that does not allow a job to run other jobs.
+
+## Tutorials
+The three assembly modes from hifiasm are available.
+- [Quick start (default mode)](doc/Quick-start.md)
+- [Hi-C mode](doc/Assembly-Mode/Hi-C-tutorial.md)
+- [Trio mode](doc/Assembly-Mode/Trio-tutorial.md)
+
+## Outputs
+[Workflow outputs](doc/Outputs.md)
+
+## Optional Data Preparation
+If your [data is in a tarball](doc/Tar-data-preparation.md)
+
+## Known errors
+You may run into [these errors](doc/Known-errors.md)
+
+## Softwares
+[Softwares used in the workflow](doc/Programs.md)
No results found