Skip to content
Snippets Groups Projects
README.md 31.2 KiB
Newer Older
Jules Sabban's avatar
Jules Sabban committed
# Methylation Analysis for ChickStress
In this repository are stored some scripts, command lines, graphics and files for the methylation analysis of the ChickStress project.
Jules Sabban's avatar
Jules Sabban committed
All of these files are developed and produced during Jules Sabban's internship from April to September 2020, in GenEpi team under the supervision of Guillaume Devailly and Frederique Pitel.
Jules Sabban's avatar
Jules Sabban committed

## Details of the different scripts and files
Jules Sabban's avatar
Jules Sabban committed
* [Make R environnement](#make-r-environnement)
* [nf-core/methylseq](#nf-coremethylseq)
* [SeqOccin project pipelines](#seqoccin-project-pipelines)
	+ [BS_preprocessing.R](#bs_preprocessingr)
Jules Sabban's avatar
Jules Sabban committed
		+ [Making specific VCF with LiftOver (BED file)](#making-specific-vcf-with-liftover-bed-file)
	+ [MethDiffDSS_2.R](#methdiffdss_2r)
* [R scripts](#r-scripts)
Jules Sabban's avatar
Jules Sabban committed
	+ [coverage_exploration.R](#coverage_explorationr)
	+ [analyse_methyl_cov.R](#analyse_methyl_covr)
	+ [coverage_threshold_exploration_rrbs.R](#coverage_threshold_exploration_rrbsr)
	+ [MHM_exploration.R](#mhm_explorationr)
* [Perl scripts](#perl-scripts)
	+ [make_TSS.pl](#make_tsspl)
* [RNAseq analysis](#rnaseq-analysis)
	+ [search_expression.pl](#search_expressionpl)
	+ [expression_boxplot.R](#expression_boxplotr)
Jules Sabban's avatar
Jules Sabban committed
	+ [search_DE_genes.R](#search_de_genesr)
	+ [count_DE_genes.pl](#count_de_genespl)
* [DMC analysis](#dmc-analysis)
	+ [plotDMC.R](#plotdmcr)
	+ [Files generated with BedTools](#files-generated-with-bedtools)
		+ [Galgal6.gene.bed](#galgal6genebed)
		+ [annotation.bed](#annotationbed)
	+ [search_region.pl](#search_regionpl)
	+ [search_gene_expression.pl](#search_gene_expressionpl)
	+ [boxplot_gene_expression_with_DMC.R](#boxplot_gene_expression_with_dmcr)
	+ [search_meth_n_exp.pl](#search_meth_n_exppl)
Jules Sabban's avatar
Jules Sabban committed
* [BisSNP](#bissnp)
	+ [Bash pipeline](#bash-pipeline)
	+ [vcfsorter.pl](#vcfsorterpl)
	+ [bissnp_easy_usage.pl](#bissnp_easy_usagepl)
Jules Sabban's avatar
Jules Sabban committed
## Make R environnement
Because of some R packages wrongly installed on Genotoul, it's advisable to create a personal R environnement in its home directory. The aim is to create a personal directory which will contain all R packages installations, for each R version used. For it, create a file named `.Rprofile` in the personal home directory and copy and paste the following lines (from Guillaume Devailly) :
```
# default CRAN mirror
options(repos = structure(c(CRAN = "https://ftp.igh.cnrs.fr/pub/CRAN/")))

# R package installation directory
my_path <- file.path("~", "work", "Rpackages")
if(!file.exists(my_path)) {
	dir.create(my_path)
}

my_path <- file.path("~", "work", "Rpackages", paste(R.version$major, sub("\\..*$", "", R.version$minor), sep = "."))
if(!file.exists(my_path)) {
	dir.create(my_path)
}

.libPaths(c(my_path, .libPaths()))
```

To install properly all of R packages used in the SeqOccin pipeline (see bellow) and used in the scripts of this repository, just run the script `install_Rpackages.R` as show bellow :
```
./install_Rpackages.R 
```

***NB** : I used 3.6.2 R version.*

## nf-core/methylseq<a id="nf-coremethylseq"></a>
`nf-core/methylseq` is a pipeline using NextFlow that compute first steps of bioinformatics analysis of methylation sequencing raw data. Documentation is available here : https://nf-co.re/methylseq

Example of running `nf-core/methylseq` on raw data :
```
nextflow run ./methylseq/main.nf \
	-profile genotoul \
	-name run20200421 \
	--reads '/work2/genphyse/genepi/RRBS/ChickStress/Methylseq/data/FRPL*.fastq.gz' \
	--single_end \
	--cytosine_report \
	--fasta '/work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.dna.toplevel.fa' \
	--rrbs \
	--clip_r1 3 \
	--save_reference \
	--email jules.sabban@inrae.fr \
	--max_memory 64GB \
	--max_cpus 16
```

***NB** :  To use the new version of the pipeline (1.5) on Genotoul, NextFlow v.20 is needed. To have it,  it is necessary to use this before run* `nf-core/methylseq` :
```
module load bioinfo/nfcore-Nextflow-v19.04.0
module unload bioinfo/Nextflow-v19.04.0
module load bioinfo/Nextflow-v20.01.0
``` 

`nf-core/methylseq` produces a lot of exit files, an example of these on Genotoul cluster is here : `/work2/genphyse/genepi/RRBS/ChickStress/Methylseq/run20200421/results`

## SeqOccin project pipelines<a id="seqoccin-project-pipelines"></a>
Documentation of the two next pipelines is searchable here : https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis

Jules Sabban's avatar
Jules Sabban committed
Recap of the different runs of `BS_preprocessing.R` and `MethDiffDSS_2.R` :  
| preprocessing | PP_params | normalisation  | methdiff | MD_params | details |
| :-----------: | :-------: | :------------: | :------: | :-------: | :-----: |
|run20200427 | 100 - 5 - 0.8 | YES | run20200429 | CT/HS 3 - 0.5 - 0.05 |    |
| |  |  | run20200501 | CT/HS 3 - 0.5 - 0.10 |    |
|run20200428 | 100 - 5 - 0.8 |  |  |  |  to remove ?  |
|run20200505 | 95 - 10 - 0.2 |  |  |  |  to remove ?  |
Jules Sabban's avatar
Jules Sabban committed
|run20200506 | 100 - 5 - 0.8 | YES | run20200507_C | CT/HS 3 - 0.5 - 0.05 |  ncores = 16  |"
| |  |  | run20200507_S | M/F 3 - 0.5 - 0.05 |  ncores = 16 : v |
|run20200512 | 100 - 5 - 0.8 | NO | run20200512 | CT/HS 3 - 0.5 - 0.05 |    |
|run20200513 | 100 - 5 - 0.8 | YES (MAJ : v) | run20200513 | M/F 3 - 0.5 - 0.05 | RNAseq verified sex |
Jules Sabban's avatar
Jules Sabban committed
| |  |  | run20200515 | CT/HS 3 - 0.5 - 0.05 | |  
Jules Sabban's avatar
Jules Sabban committed
|run20200618 | 100 - 5 - 0.8 | YES | run20200619_S | M/F 3 - 0.5 - 0.05 | Galgal6/R- VCF |
| |  |  | run20200619_C | CT/HS 3 - 0.5 - 0.05 | Galgal6/R- VCF |
|run20200623 | 100 - 5 - 0.8 | NO | run20200623_S | M/F 3 - 0.5 - 0.05 | Galgal6/R- VCF |
| |  |  | run20200623_C | CT/HS 3 - 0.5 - 0.05 | Galgal6/R- VCF |
Jules Sabban's avatar
Jules Sabban committed
|run20200909 | 100 - 5 - 0.8 | NO | run20200910_S | M/F 3 - 0.5 - 0.05 |all sequencing files available (75) |
| |  |  | run20200910_C | CT/HS 3 - 0.5 - 0.05 |all sequencing files available (75) |
|run20200925 | 1000 - 5 - 0.8 | NO | run20200925_S | M/F 3 - 0.5 - 0.05 |all sequencing files available (75) |
| |  |  | run20200925_C | CT/HS 3 - 0.5 - 0.05 |all sequencing files available (75) |
**Explication of this array :**
1st column is the directories of BS_preprocessing.R runs.
2nd column indicates parameters of BS_preprocessing.R used.
3rd column indicates if normalization step was occur in BS_preprocessing.R.
4th column is the directories of MethDiffDSS_2.R runs.
5th column indicates parameters of MethDiffDSS_2.R used.
6th column give some details on run of BS_preprocessing.R and/or MethDiffDSS_2.R.

`run2020####` is the directory of the script in the column, from `/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff`. 
**For example** : in column methdiff, `run20200513` means that MethDiffDSS_2.R was run in condition male/female, with parameters dmr.numC=3, dmr.propDMC=0.5 and alpha=0.05. And the work directory is `/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200513`. This run used outputs from `/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200513`, that ran with parameters x=100, y=5 and z=0.8.
### BS_preprocessing.R<a id="bs_preprocessingr"></a>
The first step of the differential methylation analysis is the run of [BS_preprocessing.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/blob/BSseq-DiffAnalysis/script_BSseq/BS_preprocessing.R). This R script prepare outputs of Bismark (generated by nf-core/methylseq) to MethDiffDSS_2.R (see below).

Example of running `BS_preprocessing.R` : 
```
Rscript /home/jsabban/work/GitHub/axis2-epigenectics/script_BSseq/BS_preprocessing.R \
	-o "/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200513" \
	-m "/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/metadata_2.tsv" \
	-p "/work2/genphyse/genepi/RRBS/ChickStress/Methylseq/run20200421/results/bismark_methylation_calls/stranded_CpG_report/" \
	--replicate=FALSE \
	--mergeStrands=TRUE \
	--process_functions="/home/jsabban/work/GitHub/axis2-epigenectics/script_BSseq/BS_process_functions.R" \
	--vcf="/work2/genphyse/genepi/RRBS/ChickStress/galgal6/gallus_gallus.vcf" \
	-x 100 \
	-y 5 \
	-z 0.8
```
***NB** : Resources needed to run the job on Genotoul : `--mem=20G`.*
Jules Sabban's avatar
Jules Sabban committed
#### Making specific VCF with LiftOver (BED file)
In a second time, a specific BED file to our data was made from a VCF made with our line of hens and under galgal5, to make more specific analysis. The conversion from galgal5 to galgal6 was done with awk manipulation, [LiftOver](#https://genome.sph.umich.edu/wiki/LiftOver) and perl scripts (`merge_bed.pl` and `snp_from_bed.pl`). 

1/ Prepare et make BED file format from the entire VCF file :
```
VCF=/work2/genphyse/genepi/RRBS/ChickStress/vcf/RpRm_F0_15indiv_DNA_BiAl_Filtered_Pass_7902186SNP.vcf.gz

echo -e "#CHROM\tSTART\tEND\tID\tSCORE\tNAME\tSTRAND\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tFEEDAGENE_Rm-197-0815\tFEEDAGENE_Rm-199-0641\tFEEDAGENE_Rm-202-0624\tFEEDAGENE_Rm-203-0754\tFEEDAGENE_Rm-304-0576\tFEEDAGENE_Rm-342-0819\tFEEDAGENE_Rm-375-0933" > tmp.bed

zcat -f $VCF | awk '{OFS="\t"; if(!/^#/){print "chr"$1,$2-1,$2,$3,"0",NR,".",$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' >> tmp.bed
```
2/ Make BED file with few column to use as LiftOver input :
```
echo -e "#CHROM\tSTART\tEND\tNAME\tSTRAND\tREF\tALT" > few_columns.bed

zcat -f $VCF | awk '{OFS="\t"; if(!/^#/){print "chr"$1,$2-1,$2,NR,".",$4,$5}}' >> few_columns.bed
```
3/ LiftOver run from Galgal5 to Galgal6 on the previous BED file :
```
liftOver few_columns.bed galGal5ToGalGal6.over.chain few_columns_To_Galgal6.bed few_columns_To_Galgal6_unmapped.bed
```
4/ Merge columns of LiftOver output and `few_columns.bed` to retrieve all the data from the initial VCF with `merge_bed.pl` :
```
./merge_bed.pl few_columns_To_Galgal6.bed tmp.bed > all_columns_To_Galgal6.bed
```
5/ Selection of SNP positions with the script named `snp_from_bed.pl` : 
```
./snp_from_bed.pl all_columns_To_Galgal6.bed > snp.bed
```
The file `snp.bed` is used for new run of `BS_preprocessing.R` with the `--vcf` argument.

### MethDiffDSS_2.R<a id="methdiffdss_2r"></a>
The second step of the differential methylation analysis is the run of [MethDiffDSS_2.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/blob/BSseq-DiffAnalysis/script_BSseq/MethDiffDSS_2.R). This pipeline make research of DMC and DMR, then exports some graphics. 

Example of running `MethDiffDSS_2.R` :
```
Rscript  /home/jsabban/work/GitHub/axis2-epigenectics/script_BSseq/MethDiffDSS_2.R \
	-d "/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200513/Bismark_format/" \
	-f 249_bismark.tsv,250_bismark.tsv,251_bismark.tsv,252_bismark.tsv,253_bismark.tsv,254_bismark.tsv,255_bismark.tsv,256_bismark.tsv,257_bismark.tsv,258_bismark.tsv,259_bismark.tsv,260_bismark.tsv,261_bismark.tsv,262_bismark.tsv,263_bismark.tsv,264_bismark.tsv,265_bismark.tsv,266_bismark.tsv,267_bismark.tsv,268_bismark.tsv,269_bismark.tsv,270_bismark.tsv \
	--pool1 250_bismark.tsv,256_bismark.tsv,257_bismark.tsv,261_bismark.tsv,262_bismark.tsv,263_bismark.tsv,265_bismark.tsv,269_bismark.tsv,270_bismark.tsv \
	--pool2 249_bismark.tsv,251_bismark.tsv,252_bismark.tsv,253_bismark.tsv,254_bismark.tsv,255_bismark.tsv,258_bismark.tsv,259_bismark.tsv,260_bismark.tsv,264_bismark.tsv,266_bismark.tsv,267_bismark.tsv,268_bismark.tsv \
	--dmr TRUE \
	--dmr.type qvalue \
	--dmr.numC 3 \
	--dmr.propDMC 0.5 \
	--alpha 0.05 \
	--correct BH \
	--gff /work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.99.gff3 \
	--type 'gene,exon,five_prime_UTR,three_prime_UTR' \
	--tss /work2/genphyse/genepi/RRBS/ChickStress/galgal6/tss.txt \
	--plots TRUE \
	--parallel TRUE \
	--ncores 16 \
	-o /work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200513/results \
	--process_functions "/home/jsabban/work/GitHub/axis2-epigenectics/script_BSseq/BS_DSS_analysis.R"
```
***NB** : Resources needed to run the job on Genotoul : `--mem=70G` and `--cpus-per-task=16`.*

Some graphics are stored in the directory `MethDiff_results`. The PDF title `image_DMC.pdf` was generated by `MethDiffDSS_2.R`. Some QQ-plots of p-values are also stored in this directory, they were obtained with these lines, in R session :

```
library("qqman")
png("qqplot_<normalization>_<condition>.png",height =  619, width =  633)
qq(res_DMC$pval)
title("QQplot sur <CTHS/sexes> \n<Avec/Sans>  normalisation")
dev.off()
```
***Idea** : integrate this QQ-plot in the MethDiffDSS_2.R pipeline.*

Jules Sabban's avatar
Jules Sabban committed

## R scripts <a id="r-scripts"></a>
Jules Sabban's avatar
Jules Sabban committed
### coverage_exploration.R <a id="coverage_explorationr"></a>
Jules Sabban's avatar
Jules Sabban committed
At the end of the run of `BS_preprocessing.R` and `MethDiffDSS_2.R`, this script exports means of coverage of all sites and the number of samples in which the position corresponds to a CpG site for each DMC.
Jules Sabban's avatar
Jules Sabban committed
Three arguments are needed and must be specified in the script  :
* `opt$bismark` : path to the `bismark_format` directory generated by `BS_preprocessing`
* `opt$dmc`  : path to DMC file generated by `MethDiffDSS_2.R`
* `opt$out` : path to an output directory
Jules Sabban's avatar
Jules Sabban committed
Two others arguments allow to describe data viewed ; `opt$com1` and `opt$com2` that can take character value.  A third optional argument is available : `opt$cpg`. This is the path to a folder containing the * cpg.vcf.gz output from BisSNP. If this argument is specified, the script will calculate the number of individuals for which each position is a CpG site.

To run `coverage_exploration.R`, give value to `opt$bismark`, `opt$dmc`, `opt$out`, `opt$com1` and `opt$com2`, `opt$cpg` as many times as there are files to compare, in the MAIN part of the script.
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
The script produces three output files according to the arguments entered :
* `opt$out` (require) : exports comparison file named `DMC_coverage.txt`
* `opt$cpg` (optional) : exports a count file named `count_cpg.txt`  which contains for each CpG site sequenced the number of individuals in which it is found. `opt$cpg` allows also the export of `count_cpg_in_dmc.txt` which contains for each DMC, the mean coverage and previous information about number of CpG site. Note that `NA` means the DMC is not on CpG site.
Jules Sabban's avatar
Jules Sabban committed

Running `coverage_exploration.R` :
```
./coverage_exploration.R
```
Jules Sabban's avatar
Jules Sabban committed
***NB** : If all arguments are empty, le function will not run. Copy and paste the 7-lines block as many times as there are files to compare.*
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
Examples of MAIN part :
Jules Sabban's avatar
Jules Sabban committed
```
opt=NULL
opt$bismark="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200623/Bismark_format/"
opt$dmc="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200623_S/results/DMC.txt"
opt$out="/home/jsabban/work/coverage_exploration/"
opt$com1="Males vs Femelles"
opt$com2="Nouveaux fichiers"
Jules Sabban's avatar
Jules Sabban committed
opt$cpg="/home/jsabban/work/BisSNP/work/cpg_compressed/"
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
run_analyse(opt$bismark, opt$dmc, opt$out, opt$com1, opt$com2, FALSE, opt$cpg)
Jules Sabban's avatar
Jules Sabban committed

opt=NULL
opt$bismark="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200909/Bismark_format/"
opt$dmc="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200910_S/results/DMC.txt"
opt$out="/home/jsabban/work/coverage_exploration/"
opt$com1="Males vs Femelles"
opt$com2="Tous les fichiers"
Jules Sabban's avatar
Jules Sabban committed
opt$cpg="/home/jsabban/work/BisSNP/work/cpg_compressed/"
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
run_analyse(opt$bismark, opt$dmc, opt$out, opt$com1, opt$com2, TRUE, opt$cpg)
Jules Sabban's avatar
Jules Sabban committed

#-------------
opt=NULL
opt$bismark="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200623/Bismark_format/"
opt$dmc="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200623_C/results/DMC.txt"
opt$out="/home/jsabban/work/coverage_exploration/"
opt$com1="Controles vs Stresses"
opt$com2="Nouveaux fichiers"
Jules Sabban's avatar
Jules Sabban committed
opt$cpg=NULL
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
run_analyse(opt$bismark, opt$dmc, opt$out, opt$com1, opt$com2, TRUE, opt$cpg)
Jules Sabban's avatar
Jules Sabban committed

opt=NULL
opt$bismark="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200909/Bismark_format/"
opt$dmc="/work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200910_C/results/DMC.txt"
opt$out="/home/jsabban/work/coverage_exploration/"
opt$com1="Controles vs Stresses"
opt$com2="Tous les fichiers"
Jules Sabban's avatar
Jules Sabban committed
opt$cpg=NULL
Jules Sabban's avatar
Jules Sabban committed

Jules Sabban's avatar
Jules Sabban committed
run_analyse(opt$bismark, opt$dmc, opt$out, opt$com1, opt$com2, TRUE, opt$cpg)
### analyse_methyl_cov.R  <a id="analyse_methyl_covr"></a>
Jules Sabban's avatar
Jules Sabban committed
This script allows the analysis of the C/T distribution and the level of methylation, with output files from the nf-core/methylseq pipeline run.

***NB** : Not yet suitable to run from a command line, all input data are entered in the 'Main' part.*
Jules Sabban's avatar
Jules Sabban committed

`analyse_methyl_cov.R` exports four types of graphics :
* A boxplot of the coverage per individual, titled `Boxplot_couverture_par_individus.pdf`
* A point plot where are represented mean ratios of the coverage on the W chromosome to the coverage on all chromosomes, titled `Ratio_couverture_W_vs_autres.pdf`
* A line plot of the distribution of the methylation levels per individual, titled `Distribution_niv_meth_par_individu.pdf`
* A line plot per individual of the distribution of the methylation levels (exported in a separated directory, one plot per individual), titled `Distribution_niveau_methylation_*.png`, where the wildcard is the identifier of the individual

Running of analyse_methyl_cov.R :
``` 
./analyse_methyl_cov.R
```

### coverage_threshold_exploration_rrbs.R <a id="coverage_threshold_exploration_rrbsr"></a>
Jules Sabban's avatar
Jules Sabban committed
This script produces two graphics to analyse the outputs of [BS_preprocessing.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/blob/BSseq-DiffAnalysis/script_BSseq/BS_preprocessing.R) depending on the value of selection parameters, and to verify the effect of the various treatments carried out on the data.

Firstly, the script produces a plot titled `minimum_coverage_thresholding.png` that explores different values for the thresholding of 'minCoverage' argument of [BS_preprocessing.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/blob/BSseq-DiffAnalysis/script_BSseq/BS_preprocessing.R). The aim of this analysis is to check if one threshold value that would be more suitable than another to filter raw data.

In a second time, the script produces a boxplot with several sections, titled `all_thresholding_steps.png` that highlights the evolution of the number of CpG sites as the various treatments carried out during [BS_preprocessing.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/blob/BSseq-DiffAnalysis/script_BSseq/BS_preprocessing.R) progress. One of the goals is to verify the effectiveness of the normalization stage.

***NB** : All input data are entered in the script.*
Jules Sabban's avatar
Jules Sabban committed

Running of coverage_threshold_exploration_rrbs.R :
``` 
./coverage_threshold_exploration_rrbs.R
```

### MHM_exploration.R <a id="mhm_explorationr"></a>
This very simple R script use coverage outputs from Bismark to determine how many CpG are in the MHM region, for each embryos and how many are shared by all embryos.
Define the path to the `methylation_coverage` output in line 4 before run.

Running MHM_exploration.R
```
./MHM_exploration.R
```

## Perl scripts <a id="perl-scripts"></a>
### make_TSS.pl <a id="make_tsspl"></a>
Jules Sabban's avatar
Jules Sabban committed
This is a very simple script in Perl, using to produce a TSS file from a GFF file.
A Transcription Start Site file contains the list of genomic position where transcription starts and on which strand this site is.
Columns are : chr &nbsp;  tss &nbsp; strand

Example of running make_TSS.pl :
````
./make_TSS.pl /work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.99.gff3 > ./tss.tsv
````
## RNAseq analysis <a id="rnaseq-analysis"></a>
### search_expression.pl <a id="search_expressionpl"></a>
Jules Sabban's avatar
Jules Sabban committed
This script written in Perl retrieves the expression values of a particular chromosome from an RNA-seq data file and generates a boxplot representing the distribution of expression values by individual, and sorted according to their median. This script is a little workflow that automates the search carried out by successively calling several unix programs used on the command line and a single call to the R script : [expression_boxplot.R](#expression_boxplot) (see below).

***NB** : The treatment need a file containing the targeted gene list, with one gene_ID per line ([see example](https://forgemia.inra.fr/jules.sabban1/methylation-analysis-for-chickstress/-/blob/master/RNAseq/expression/list_geneID_from_RNA_file.txt))*
Jules Sabban's avatar
Jules Sabban committed

Example of running search_expression.pl :
```
./search_expression.pl \
	--rna /work/project/1000rnaseqchick/internal_runs/RpRm/embr/F0_galgal5_V87dec2016_GTFrennes/Results/Summary/RSEM_multimap_summary_RpRm_embr_F0_genes_TPM.tsv \
	--gtf /work/project/1000rnaseqchick/reference/7b_gg_EnsemblRennesAldbNcbiNoncodeFragencode_20180731_all_52075g_101188t_gg5_v87v93_withFGF21_usedForPubli_additionOfExonID_withUTRandStarStopCodons.gtf \
	--chr Z \
	--dir /home/jsabban/work/testBS/testExpression
```
### expression_boxplot.R <a id="expression_boxplotr"></a>
Jules Sabban's avatar
Jules Sabban committed
This R script is used in [search_expression.pl](#search_expression). It traces boxplot of chromosome expression values passed in argument and sorted by median for all individuals. For each gene, if at least 80% of individuals have a zero expression value, it is removed from the data for all individuals.

This script exports a boxplot titled `Boxplot_expression_<chr>_par_individu.pdf`, where `chr` is the chromosome's name passed as an argument. 

***NB** : `expression_boxplot.R` can also be call regardless of the previous script.*
Jules Sabban's avatar
Jules Sabban committed

Example of running expression_boxplot.R :
```
Rscript ./expression_boxplot.R \
		--file /home/jsabban/work/testBS/testExpression/expression_W.txt \
		--out /home/jsabban/work/testBS/testExpression/expression \
		--chr Z
```

Jules Sabban's avatar
Jules Sabban committed
### search_DE_genes.R <a id="search_de_genesr"></a>
This script was used after the GeneSwitch's pipeline run, to make differential expression analysis of CTHS and sexes in the same run. It exports tables of all genes `DE_*_all.txt` and only differentially expressed genes `DE_*_signif.txt`. It also exports two Volcano plots in one PDF file. All outputs are stored in a directory named `DE/`.
***NB** : No option in command line is available, change data in the script if necessary.*

Running of search_de_genes.R :
```
Rscript ./search_DE_genes.R
```

### count_DE_genes.pl <a id="count_de_genespl"></a>
This Perl script counts the number of differentially expressed genes on one chromosome. To run, this script need three unflagged arguments which are : (1) the path to a GFF file, (2) the chromosome targeted and (3) path to list of DE genes.

Example of run of `count_DE_genes.pl` :
```
./count_DE_genes.pl \
	/work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.99.gff3 \
	Z \
	/work/GitHub/methylation-analysis-for-chickstress/RNAseq/DE/DE_sexes_signif.txt
```

The file `RNAseq/DE_Z.txt` is the output from the example.

## DMC analysis <a id="dmc-analysis"></a>
This directory containing scripts, inputs et result files of the analysis of DMC found with [MethDiffDSS_2.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis).
The two directories 'CTHS' and 'sexes' respectively contain all results of the analysis of DMC in the cases of  differential methylation analysis on CT/HS and male/female.

### plotDMC.R <a id="plotdmcr"></a>
The first step of DMC analysis is to check methylation levels on each DMC found with [MethDiffDSS_2.R](https://forgemia.inra.fr/seqoccin/axis2-epigenectics/-/tree/BSseq-DiffAnalysis). To perform this, `plotDMC.R` is used with the output file `DMC.txt` from a MethDiffDSS_2.R run.
`plotDMC.R` exports a violins or boxplot chart, that represents methylation levels in each pool (for example : CT/HS, male/female) for each DMC.

***NB** : All of the seven arguments are required.*

Example of running of plotDMC.R :
``` 
Rscript ./plotDMC.R \
		--out /work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200515/results/ \
		--data /work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/preprocessing/run20200513/Bismark_format/ \
		--dmc /work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200515/results/DMC.txt \
		--names CT,HS \
		--pool1 249,250,252,254,255,257,258,259,266,267,269,270 \
		--pool2 251,253,256,260,261,262,263,264,265,268 \
		--plot boxplot
```

### Files generated with BedTools <a id="files-generated-with-bedtools"></a>
DMC analysis need some few files generated with BedTools : `Galgal6.gene.bed` and `*/Annotation.bed`.

#### Galgal6.gene.bed <a id="galgal6genebed"></a>
This BED format file is necessary for the next script. This file contain, in each line, the Ensembl gene ID and the gene name associated. 
The six columns are : chr &nbsp; coord_begin &nbsp; coord_end &nbsp; Ensembl_info &nbsp; . &nbsp; strand.

How this file was generated (thanks to Sylvain Foissac for this command line) :
```
zcat -f /work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.100.gtf | sed 's/"//g;s/;//g' | awk '$3=="gene"{id="NA";name="NA";for (i=8;i<NF;i++){if($i=="gene_id")id=$(i+1);if($i=="gene_name")name=$(i+1)} print $1,$4-1,$5,id"_"name,".",$7}' OFS="\t" | bedtools sort -i - | uniq > Galgal6.gene.bed
```

***NB** : To use the line on Genotoul, the `bioinfo/bedtools-2.27.1` module is needed.*
#### annotation.bed <a id="annotationbed"></a>
This file is in the sub-directory `CTHS` or `sexes`, this is the first step of DMC analysis. Is generated with Bedtools closest, which associates each DMC the gene in which it is located or the nearest gene, indicating the distance between the two.
Bedtools need DMC file (bed format) and the previously generated file.

How this file was generated, example for CTHS :
```
bedtools closest -D a -a /work2/genphyse/genepi/RRBS/ChickStress/AnalyseDiff/methdiff/run20200515/results/DMC.bed -b ./Galgal6.gene.bed > ./CTHS/Annotation.bed
```

The `-D a` argument indicates that the distance must be collected.

***NB** : To use the line on Genotoul, the `bioinfo/bedtools-2.27.1` module is needed.*
For clarity, and given that the columns of this file are not all identical to the conventions of BED files, it is necessary to add a header, with two following  simple command lines, by replacing the wildcard by 'CT' or 'sexes' :
```
echo -e "chr_dmc\tstart_dmc\tend_dmc\tpool1\tdiff\tstrand_dmc\tchr_gene\tstart_gene\tend_gene\tgene\t.\tstrand_gene\tdistance" | cat - */annotation.bed > */annotation_2.bed
mv */annotation_2.bed */annotation.bed
```
### search_region.pl <a id="search_regionpl"></a>
The third step is the association between each DMC and the genomic region in which it is located. This is make with the script `search_region.pl`, which take two arguments (without flag). The first one is the previous file, and the second is a GTF file.

Example of running `search_region.pl` in the case of CTHS analysis :
```
./search_region.pl ./CTHS/annotation.bed /work2/genphyse/genepi/RRBS/ChickStress/galgal6/Gallus_gallus.GRCg6a.100.gtf > ./CTHS/annotation_with_region.bed
```

### search_gene_expression.pl <a id="search_gene_expressionpl"></a>
This script is the fourth step of the DMC analysis, and this script make three files in a new repository `*/expression/` : `gene_expression.tsv`, `StudentTest.txt` and `boxplot_gene_expression_with_dmc.png`.
* `gene_expression.tsv` contains the list of individuals in the first line and levels of expression for each gene in each individual.
* `StudentTest.txt` contains the pvalue of a Student test for each gene and a boolean ; TRUE if the pvalue is significative, FALSE otherwise. 
* `boxplot_gene_expression_with_dmc.png` is a boxplot of expression level for each gene in each condition.

Jules Sabban's avatar
Jules Sabban committed
**WARNING** : The script retrieves columns from a counting table which are not always the same. It's possible to have to change the numbers of columns in line 56.

The arguments of `search_gene_expression.pl` are :
* rna : path to an expression level counting table of the entire transcriptome.
* bed : path to the previously generated file (DMC annotation bed format file)
* col : the number of the column where are gene_id in bed file (first column is 0).
* dir : path to the work directory
* pool1 : labels of pool1, separated by `,`
* pool2 : labels of pool2, separated by `,`

This script export a boxplot and pvalue table thanks to the next R script.

Example of running `search_gene_expression.pl`, case of CTHS analysis :
```
./search_gene_expression.pl \
	--rna /work/project/1000rnaseqchick/internal_runs/RpRm/embr/F0_galgal5_V87dec2016_GTFrennes/Results/Summary/RSEM_multimap_summary_RpRm_embr_F0_genes_TPM.tsv \
	--bed ./CTHS/annotation_with_region.bed \
	--col 9 \
	--cond CT,HS \
	--dir ./CTHS \
	--pool1 249,250,252,254,255,257,258,259,266,267,269,270 \
	--pool2 251,253,256,260,261,262,263,264,265,268
```

Example of running `search_gene_expression.pl`, case of sexes analysis :
```
./search_gene_expression.pl \
	--rna /work/project/1000rnaseqchick/internal_runs/RpRm/embr/F0_galgal5_V87dec2016_GTFrennes/Results/Summary/RSEM_multimap_summary_RpRm_embr_F0_genes_TPM.tsv \
	--bed ./sexes/annotation_with_region.bed \
	--col 9 \
	--cond male,female \
	--dir ./sexes \
	--pool1 263,262,256,250,261,269,265,270,257 \
	--pool2 252,254,268,259,251,249,260,267,253,266,255,264,258
```
### boxplot_gene_expression_with_DMC.R <a id="boxplot_gene_expression_with_dmcr"></a>
This script performs statistic testing (Student Test) and make boxplot titled `boxplot_gene_expression_with_dmc.png`. It is call in the previous script with the next five arguments :
* out : path to result directory
* data : path to TSV file containing formatted expression levels (see above)
* names : names of differential methylation analysis, such as `CT/HS` or `male/female`
* pool1 : labels of pool1, separated by `,`
* pool2 : labels of pool2, separated by `,`

Example of running `boxplot_gene_expression_with_DMC.R` :
```
Rscript ./boxplot_gene_expression_with_DMC.R :
	--out CTHS/expression/ \
	--data CTHS/expression/gene_expression.tsv \
	--names CT,HS \
	--pool1 249,250,252,254,255,257,258,259,266,267,269,270 \
	--pool2 251,253,256,260,261,262,263,264,265,268
Jules Sabban's avatar
Jules Sabban committed
```

### search_meth_n_exp.pl <a id="search_meth_n_exppl"></a>
This script is use to determine which genes are associate with DMC (DM genes) and differentially expressed between two conditions. It need two or three arguments without flag. the first one must be the DE genes list. The second is the DM genes list (BED format : adapt line 18 if needed). Should redirect outputs with `>` followed by a path to the output file.

Example of running `search_meth_n_exp.pl` :
```
./search_meth_n_exp.pl \
	/home/jsabban/work/tmp/rnaseq_gg6_0619/DE/DE_CTHS.txt \
	/home/jsabban/work/GitHub/methylation-analysis-for-chickstress/DMC_analysis/CTHS/annotation_with_region.bed \
	> /home/jsabban/work/GitHub/methylation-analysis-for-chickstress/DMC_analysis/CTHS/methylated_and_expressed_genes.txt
Jules Sabban's avatar
Jules Sabban committed
```

## BisSNP
In parallel of `BS_preprocessing.R`, and to check SNP filtration of the pipeline of SeqOccin, a VCF is produce from the output of nf-core/methylseq, with `BisSNP`. The aim is to produce a file containing SNP positions from RRBS. 

### Bash pipeline
The Bash script `run_bissnp.sh` is a pipeline which run BisSNP for all BAM file in a directory.
The steps are :
* Creation of an index of the reference genome with `picard`
* Sorting of the reference VCF file with a Perl scrip from `GATK`
* Sorting of each BAM file with `picard`
* Run Perl script of `BisSNP` (bissnp_easy_usage.pl) for each BAM file
Jules Sabban's avatar
Jules Sabban committed
* Compression of each VCF (snp and cpg) produced with `bcftools view`
* Indexation of each compressed VCF (snp and cpg) with `bcftools index`
Jules Sabban's avatar
Jules Sabban committed
* Merge of all VCF in unique VCF with `bcftools merge`

Running of `run_bissnp.sh` :
```
sbatch run_bissnp.sh
```

***NB** : To run this script, change values of sbatch option and values in the `## Paths` part.*

### vcfsorter.pl
This is a script from GATK that sort VCF according to chromosome and position.

### bissnp_easy_usage.pl
This is a script from the GitHub repository of BisSNP, which make easy the run of all step of BisSNP.
To use `bissnp_easy_usage.pl`, it is necessary to clone the GitHub BisSNP repository and make symbolic link of JAR file to the BisSNP JAR installed on the cluster.