2019-10-27

How to map NGS data using BWA-MEM

Hi sir,
It is definetly autumn. I recommend to visit Miyajima and try “Momiji Manju”.
Miyajima (cited from MATCHA)

Today, I introduce how to map NGS data using BWA-MEM.

BWAMEM aligner is one of the most famous tools in NGS analysis. Some tool requires data processed using BWA-MEM (It is difficult for beginners…).

I would like to suggest that you buy >20GB external hard drive/SSD/USB flash drive before you try the following analysis because NGS analysis manages numerous amounts of data. I usually 500 GB external hard drive to save NGS-associated data (saving directory: /Volumes/databank1/ngs). I don’t use my internal storage, it’s for Steam.

We introduce it, step by step. If you want to see the only mapping step, scroll down to “STEP4”.

STEP1: Creating a visual environment

1: Create visual environment

1	conda create --name bwa_env;

*If you cannot use conda command in your terminal, please visit How to install Anaconda.

2: Activate the visual environment

1	source activate bwa_env;

3: Add packages into the visual environment

1
2
3

conda config --add channels bioconda; # add channel
conda install --channel bioconda sra-tools sickle-trim flash bwa samtools; # install packages from the channel
conda config --remove channels bioconda; # remove the channel

STEP2: Preparing FASTQ file

*In this example, I use published amplicon sequence data. If you have the filtered & merged FASTQ files, skip this section.

4: Download compressed FASTQ files

Information Page of DRR147084 in SRA (DRR database)
You don’t have to access the site when you download it. Type the following command.

1	fastq-dump DRR147084 --split-files --gzip --outdir /Volumes/databank1/ngs; # Download from SRR(DRR)

5: Wait for a minute until you can type a command in your terminal again

*You can kill time by reading this paper which is the original report using the downloading deep sequencing data.

6: Filter the raw NGS files using sickle

1
2
3

sickle pe -f /Volumes/databank1/ngs/DRR147084_1.fastq.gz -r /Volumes/databank1/ngs/DRR147084_2.fastq.gz -t sanger \
-o /Volumes/databank1/ngs/qfltr_DRR147084_1.fastq -p /Volumes/databank1/ngs/qfltr_DRR147084_2.fastq \
-s /Volumes/databank1/ngs/single_DRR147084.fastq; # Trim average quality <\Q20 and read length < 20bp

*qfltr_DRR147084_1.fastq, qfltr_DRR147084_2.fastq, single_DRR147084.fastq are my choice. You can name it freely.
*Do you want to trim adapters before that? You can check the article.

7: Merge paired-end reads using FLASh

1	cd /Volumes/databank1/ngs/;flash /Volumes/databank1/ngs/qfltr_DRR147084_1.fastq /Volumes/databank1/ngs/qfltr_DRR147084_2.fastq;cd ~/Desktop;

The merged file “out.extendedFrags.fastq” is saved into /Volumes/databank1/ngs/ when the process is done.

STEP3: Generate the fasta file index of the human reference genome (version 38)

*If you got the NGS data from the other organism, you have to try it using reference genome of that. This site will be helpful to find the genome you want.

8: Download compressed files of the human reference genome (version 38)

1	wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.chroms.tar.gz --directory-prefix /Volumes/databank1/ngs/;

9: Unpack the files

1
2
3

cd /Volumes/databank1/ngs/;
gunzip -c /Volumes/databank1/ngs/hg38.analysisSet.chroms.tar.gz | tar xopf -;
cd ~/Desktop;

10: Conbine files into hg38.fa

1	for i in `seq 1 22` X Y;do cat /Volumes/databank1/ngs/hg38.analysisSet.chroms/chr$i.fa >> /Volumes/databank1/ngs/hg38.fa;done;

11: Delete immediate files to save storage

1	rm -rf /Volumes/databank1/ngs/hg38.analysisSet.chroms/;

12: Generate the fasta file index of hg38

1	bwa index -p /Volumes/databank1/ngs/hg38 /Volumes/databank1/ngs/hg38.fa;

13: Wait for about 20-30 minutes until you can type a command in your terminal again

It is a long time. You can take a nap and browse the shops.

STEP4: Mapping the merged FASTQ file using human reference genome (hg38)

1	bwa mem -t 1 /Volumes/databank1/ngs/hg38 /Volumes/databank1/ngs/out.extendedFrags.fastq \| samtools sort -o /Volumes/databank1/ngs/out.extendedFrags.fastq.bam;

*/Volumes/databank1/ngs/hg38 indicates the name of the reference index.

The mapped bam file “out.extendedFrags.fastq.bam” is saved into /Volumes/databank1/ngs/ when the process is done.
I guess that you can use it for some analysis.

(Option): View the mapped bam file using Integrative Genomics Viewer (IGV)

14: make index of mapped file.

1	samtools index /Volumes/databank1/ngs/out.extendedFrags.fastq.bam;

15: Download IGV

Download Page

Download Page For IGV

16: Install application

When I finished the download, the application “IGV_2.7.2.app” has been already installed in the Download directory.

17: Start IGV_2.7.2.app

18: Click “Genomes”>”Load Genome File…”

19: Select “/Volumes/databank1/ngs/hg38.fa”

20: Click “File” > “Load from File…”

21: Select /Volumes/databank1/ngs/out.extendedFrags.fastq.bam

*The “DRR147084” is amplicon sequence data of ATP5B MMEJ-assisted knock-in sample. The gene region of ATP5B is “chr12:56,638,302-56,638,353”

amplicon sequence data in IGV

*You can check the region using genome browser.

This is a simple example. You can change the procedure because there is no royal road to NGS analysis (fortunately or unfortunately, I don’t know…).

Additionally, you can get the following variant calling if you use specific tools such as CrispRVariants.

amplicon sequence data in CrispRVariants

Environement

Hardware

MacBook Air (2017 model)
Processor 1.8 GHz Intel Core i5
Memory 8GB
500 GB external hard drive

Software

macOS Mojave
GNU bash version : 3.2.57(1)-release (x86_64-apple-darwin18)
conda version : 4.7.11
conda-build version : 3.17.6
python version : 2.7.15.final.0

Thank you so much…

n|nja

Bioinfomatics & Genome editing

How to map NGS data using BWA-MEM

STEP1: Creating a visual environment

1: Create visual environment

2: Activate the visual environment

3: Add packages into the visual environment

STEP2: Preparing FASTQ file

4: Download compressed FASTQ files

5: Wait for a minute until you can type a command in your terminal again

6: Filter the raw NGS files using sickle

7: Merge paired-end reads using FLASh

STEP3: Generate the fasta file index of the human reference genome (version 38)

8: Download compressed files of the human reference genome (version 38)

9: Unpack the files

10: Conbine files into hg38.fa

11: Delete immediate files to save storage

12: Generate the fasta file index of hg38

13: Wait for about 20-30 minutes until you can type a command in your terminal again

STEP4: Mapping the merged FASTQ file using human reference genome (hg38)

(Option): View the mapped bam file using Integrative Genomics Viewer (IGV)

14: make index of mapped file.

15: Download IGV

16: Install application

17: Start IGV_2.7.2.app

18: Click “Genomes”>”Load Genome File…”

19: Select “/Volumes/databank1/ngs/hg38.fa”

20: Click “File” > “Load from File…”

21: Select /Volumes/databank1/ngs/out.extendedFrags.fastq.bam

Environement

STEP1: Creating a visual environment

1: Create visual environment

2: Activate the visual environment

3: Add packages into the visual environment

STEP2: Preparing FASTQ file

4: Download compressed FASTQ files

5: Wait for a minute until you can type a command in your terminal again

6: Filter the raw NGS files using sickle

7: Merge paired-end reads using FLASh

STEP3: Generate the fasta file index of the human reference genome (version 38)

8: Download compressed files of the human reference genome (version 38)

9: Unpack the files

10: Conbine files into hg38.fa

11: Delete immediate files to save storage

12: Generate the fasta file index of hg38

13: Wait for about 20-30 minutes until you can type a command in your terminal again

STEP4: Mapping the merged FASTQ file using human reference genome (hg38)

(Option): View the mapped bam file using Integrative Genomics Viewer (IGV)

14: make index of mapped file.

15: Download IGV

16: Install application

17: Start IGV_2.7.2.app

18: Click “Genomes”>”Load Genome File…”

19: Select “/Volumes/databank1/ngs/hg38.fa”

20: Click “File” > “Load from File…”

21: Select /Volumes/databank1/ngs/out.extendedFrags.fastq.bam

22: Enter the genome region of the amplicon sequence in the search form

Environement