How to map NGS data using BWA-MEM

Hi sir,
It is definetly autumn. I recommend to visit Miyajima and try “Momiji Manju”.
Miyajima (cited from MATCHA)

Today, I introduce how to map NGS data using BWA-MEM.

BWAMEM aligner is one of the most famous tools in NGS analysis. Some tool requires data processed using BWA-MEM (It is difficult for beginners…).

I would like to suggest that you buy >20GB external hard drive/SSD/USB flash drive before you try the following analysis because NGS analysis manages numerous amounts of data. I usually 500 GB external hard drive to save NGS-associated data (saving directory: /Volumes/databank1/ngs). I don’t use my internal storage, it’s for Steam.

We introduce it, step by step. If you want to see the only mapping step, scroll down to “STEP4”.

STEP1: Creating a visual environment

1: Create visual environment

1
conda create --name bwa_env;

*If you cannot use conda command in your terminal, please visit How to install Anaconda.

2: Activate the visual environment

1
source activate bwa_env;

3: Add packages into the visual environment

1
2
3
conda config --add channels bioconda; # add channel
conda install --channel bioconda sra-tools sickle-trim flash bwa samtools; # install packages from the channel
conda config --remove channels bioconda; # remove the channel

STEP2: Preparing FASTQ file

*In this example, I use published amplicon sequence data. If you have the filtered & merged FASTQ files, skip this section.

4: Download compressed FASTQ files

Information Page of DRR147084 in SRA (DRR database)
You don’t have to access the site when you download it. Type the following command.

1
fastq-dump DRR147084 --split-files --gzip --outdir /Volumes/databank1/ngs; # Download from SRR(DRR)

5: Wait for a minute until you can type a command in your terminal again

*You can kill time by reading this paper which is the original report using the downloading deep sequencing data.

6: Filter the raw NGS files using sickle

1
2
3
sickle pe -f /Volumes/databank1/ngs/DRR147084_1.fastq.gz -r /Volumes/databank1/ngs/DRR147084_2.fastq.gz -t sanger \
-o /Volumes/databank1/ngs/qfltr_DRR147084_1.fastq -p /Volumes/databank1/ngs/qfltr_DRR147084_2.fastq \
-s /Volumes/databank1/ngs/single_DRR147084.fastq; # Trim average quality <\Q20 and read length < 20bp

*qfltr_DRR147084_1.fastq, qfltr_DRR147084_2.fastq, single_DRR147084.fastq are my choice. You can name it freely.
*Do you want to trim adapters before that? You can check the article.

7: Merge paired-end reads using FLASh

1
cd /Volumes/databank1/ngs/;flash /Volumes/databank1/ngs/qfltr_DRR147084_1.fastq /Volumes/databank1/ngs/qfltr_DRR147084_2.fastq;cd ~/Desktop;

The merged file “out.extendedFrags.fastq” is saved into /Volumes/databank1/ngs/ when the process is done.

STEP3: Generate the fasta file index of the human reference genome (version 38)

*If you got the NGS data from the other organism, you have to try it using reference genome of that. This site will be helpful to find the genome you want.

8: Download compressed files of the human reference genome (version 38)

1
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.chroms.tar.gz --directory-prefix /Volumes/databank1/ngs/;

9: Unpack the files

1
2
3
cd /Volumes/databank1/ngs/;
gunzip -c /Volumes/databank1/ngs/hg38.analysisSet.chroms.tar.gz | tar xopf -;
cd ~/Desktop;

10: Conbine files into hg38.fa

1
for i in `seq 1 22` X Y;do cat /Volumes/databank1/ngs/hg38.analysisSet.chroms/chr$i.fa >> /Volumes/databank1/ngs/hg38.fa;done;

11: Delete immediate files to save storage

1
rm -rf /Volumes/databank1/ngs/hg38.analysisSet.chroms/;

12: Generate the fasta file index of hg38

1
bwa index -p /Volumes/databank1/ngs/hg38 /Volumes/databank1/ngs/hg38.fa;

13: Wait for about 20-30 minutes until you can type a command in your terminal again

It is a long time. You can take a nap and browse the shops.

STEP4: Mapping the merged FASTQ file using human reference genome (hg38)

1
bwa mem -t 1 /Volumes/databank1/ngs/hg38 /Volumes/databank1/ngs/out.extendedFrags.fastq | samtools sort -o /Volumes/databank1/ngs/out.extendedFrags.fastq.bam;

*/Volumes/databank1/ngs/hg38 indicates the name of the reference index.

The mapped bam file “out.extendedFrags.fastq.bam” is saved into /Volumes/databank1/ngs/ when the process is done.
I guess that you can use it for some analysis.

(Option): View the mapped bam file using Integrative Genomics Viewer (IGV)

14: make index of mapped file.

1
samtools index /Volumes/databank1/ngs/out.extendedFrags.fastq.bam;

15: Download IGV

Download Page

Download Page For IGV

16: Install application

When I finished the download, the application “IGV_2.7.2.app” has been already installed in the Download directory.

17: Start IGV_2.7.2.app

18: Click “Genomes”>”Load Genome File…”

19: Select “/Volumes/databank1/ngs/hg38.fa”

20: Click “File” > “Load from File…”

21: Select /Volumes/databank1/ngs/out.extendedFrags.fastq.bam

22: Enter the genome region of the amplicon sequence in the search form

*The “DRR147084” is amplicon sequence data of ATP5B MMEJ-assisted knock-in sample. The gene region of ATP5B is “chr12:56,638,302-56,638,353”

amplicon sequence data in IGV

*You can check the region using genome browser.

This is a simple example. You can change the procedure because there is no royal road to NGS analysis (fortunately or unfortunately, I don’t know…).

Additionally, you can get the following variant calling if you use specific tools such as CrispRVariants.

amplicon sequence data in CrispRVariants

Environement

Hardware

  • MacBook Air (2017 model)
  • Processor 1.8 GHz Intel Core i5
  • Memory 8GB
  • 500 GB external hard drive

Software

  • macOS Mojave
  • GNU bash version : 3.2.57(1)-release (x86_64-apple-darwin18)
  • conda version : 4.7.11
  • conda-build version : 3.17.6
  • python version : 2.7.15.final.0

Thank you so much…