2.1 Sequencing QC

Teaching: 90 min || Exercises: 30 min

Overview

Questions:
  • How do you perform basic statistics to check the quality of Oxford Nanopore Technology (ONT) sequence reads?
  • How do you identify contaminants in your sequences?
  • What are the various tools used for carrying out quality control?
Learning Objectives:
  • To perform a Quality Control (QC) assessment of high throughput third generation sequence data - ONT
  • Interpret and critically evaluate data quality reports
  • To identify possible contamination in high throughput sequence data
  • Run simple scripts to achieve quality control of your genomic data
Key Points:
  • Before analysing your nanopore sequencing data, it is critical to quality control the data and only carry on with genomes that pass quality control checks
  • Best practice quality control assessments include:
    • Base quality check
    • Identifying mismatches
    • GC content and bias checks
    • False insertions and deletions
    • Identifying contamination
    • etc.
  • Best practice quality control tools for performing the quality checks include
  • Other tools which we will not explore here include
  • It is a good practice to clean up your directory as you proceed from analysis to analysis to avoid getting the error message that tells you you are running out of space/memory.

2.1.1 Background

Before we delve into having a look at our own genomic data. Lets take some minutes to explore what to look out for when performing Quality Control (QC) checks on our sequences. For this part of the course, we will largely focus on third generation sequences obtained from ONT sequencers. As you may already know from our previous lesson, the main output files expected from our ONT sequencer are .fast5 files which are readily converted to FastQ files for downstream analysis. Here, we will attempt to quality trim our Nanopore data using PoreChop and/or fastp and assess basic sequencing QC utilizing NanoPlot, fastq-scan and FastQC. Other frequently used QC tools include PycoQC. We would then check for contamination using both GC content statistics and Kraken 2. We will also combine reports from all these tools using MultiQC.

2.1.2 Joining multiple FastQ files

Normally, Guppy barcoder writes demultiplexed reads to multiple FastQ files per barcode, each having (normally) 4000 reads. To go further with your sequence data, you will have to join these FastQ files together into a single file per barcode.

We can do this in two ways with simple lines of commands.

-1 Unzip gunzip the files, combine them into one zipped file gzip and redirect it > into the required file name.

The basic command looks like this.

gunzip -c *.fastq.gz | gzip -c > joined.fastq.gz

You can now run the command on your first set of fastq files.

gunzip -c *.fastq.gz | gzip -c > barcode30.fastq.gz

This uses a lot of memory and may not be the best option to perforn this task.

-2 The second way to do this is to concatenate the files using the cat command.

cat *.fastq.gz > barcode30.fastq.gz

Let’s navigate into any of our ONT sequence directory and run any of the above codes. For instance let’s run the command on barcode30. Navigate to the barcode30 directory, copy paste the command and run it.

QC assessment of ONT data

As you may already know, QC is an important part of any analysis. In this section we are going to look at some of the metrics and graphs that can be used to assess the QC of NGS data.

Base quality

Illumina sequencing technology relies on sequencing by synthesis. One of the most common problems with this is dephasing. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this.

Base Quality

Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads, for example using a tool called Trimmomatic. In this workshop, we will do this using the tool fastp. In addition to trimming low quality reads, fastpwill also be used to trim off Illumina adapter/primer sequences.

The figures below shows an example of a high-quality read data (left) and a poor quality read data (right).

High-quality read data

Poor quality read data

Base Quality Comparison

In addition to Phasing noise and signal decay resulting from dephasing issues described above, there are several different reasons for a base to be called incorrectly. You can lookup these later by clicking here.

Mismatches per cycle

Aligning reads to a high-quality reference genome can provide insight to the quality of a sequencing run by showing you the mismatches to the reference sequence. This can help you detect cycle-specific errors. Mismatches can occur due to two main causes, sequencing errors and differences between your sample and the reference genome, which is important to bear in mind when interpreting mismatch graphs. The figures below show an example of a good run (top) and a bad one (bottom). In the first figure, the distribution of the number of mismatches is even between the cycles, which is what we would expect from a good run. However, in the second figure, two cycles stand out with a lot of mismatches compared to the other cycles.

Good run

Poor run

GC bias

It is a good idea to compare the GC content of the reads against the expected distribution in a reference sequence. The GC content varies between species, so a shift in GC content like the one seen below (right image) could be an indication of sample contamination. In the left image below, we can see that the GC content of the sample is about the same as for the theoretical reference, at ~65%. However, in the right figure, the GC content of the sample shows two distribution, one is closer to 40% and the other closer to 65%, indicating that there is an issue with this sample — a possible missed sample. Note that, suspecting contamination is perfectly fine for the species we are dealing with (MTBC). For other bacteria where there may be possibility of gene transfer, one can imagine that, such a situation may be from inheriting some plasmids that have a totally different GC content to the bacteria chromosome (This is arguable though).

Single GC distribution

Double GC distribution

Figure: Base Quality Comparison

GC content by cycle

Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, it is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the first of the two figures below. In the second of the figures, the initial spikes are likely due to adapter sequences that have not been removed.

Good run

Poor run

Insert size

For paired-end sequencing the size of DNA fragments also matters. In the first of the examples below, the insert size peaks around 440 bp. In the second however, there is also a peak at around 200 bp. This indicates that there was an issue with the fragment size selection during library prep.

Good run

Poor run

Insertions/Deletions per cycle

Sometimes, air bubbles occur in the flow cell, which can manifest as false indels. The spike in the second image provides an example of how this can look.

Good run

Poor run

In addition to the QC plots you’ve encountered so far, there are other metrics that are generated with very powerful tools. For this workshop, we will explore these quality metrics with the help of fastq-scan and FastQC tools. It is often not a good practice to carry on analysis on samples that are contaminated with sequences from other species. We will identify contamination using either one of two ways. As earlier mentioned, the GC content varies between species, so a shift in GC content could be an indication of sample contamination. One other way of identifying sample contamination is by using specialized tools to determine/predict the species composition of your sample. For this course, we will determine species composition using the Kraken 2 database.

2.1.4 Generating QC stats and metrics

We are now ready to explore some quality metrics on our Nanopore sequence data.

Tools used and how to get help

This tutorial uses NanoPlot, fastq-scan, FastQC, fastp, PoreChop, Kraken 2, Bracken and MultiQC, which have been preinstalled for you in a virtual environment called qc_ont. This is to help speed up the pace of the workshop. However, NanoPlot is in a seperate environment due to dependency issues.

If for some reason, this environment is not pre-created for you, you can refer to the setup page for creating this conda environment. This will also be a good practice for you. In a latter chapter of the course, you will also be introduced to how to set up these virtual environments and explore its usefulness. For each tool, to get help messages that describe how they are used, you can simply type the name of the tool and hit enter. This only works if you activate the environment in which the tools have already been installed. Alternatively, you can use the help flag -help or -h as appropriate. You can also invoke the manual of any tool (where available) by using the man command followed by the name of the tool.

Disk Usage I — Before analysis

Before we start investigating our genomic sequences, let’s pause and check the space of our current working directory.

You can do this with the disk usage du command

du -h
Current Disk Space In QC Directory ~497M

Now, keep this value in mind, we will come back to it at the end of the chapter.

NanoPlot

Nanoplot is a plotting tool for long read sequencing data and alignments. It is also available as a web service.

In addition to various plots also a NanoStats file is created summarizing key features of the dataset.

This script performs data extraction from Oxford Nanopore sequencing data in the following formats:

  • fastq files (can be bgzip, bzip2 or gzip compressed)
  • fastq files generated by albacore, guppy or MinKNOW containing additional information (can be bgzip, bzip2 or gzip compressed)
  • sorted bam files
  • sequencing_summary.txt output table generated by albacore, guppy or MinKnow basecalling (can be gzip, bz2, zip and xz compressed)
  • fasta files (can be bgzip, bzip2 or gzip compressed) Multiple files of the same type can be offered simultaneously
Help

I have tried several options to pull out the help function of NanoPlot, only to find out that the N and P should be in capital letters.

NanoPlot
usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge] [-o OUTDIR] [--no_static] [-p PREFIX]
                [--tsv_stats] [--info_in_report] [--maxlength N] [--minlength N] [--drop_outliers] [--downsample N]
                [--loglength] [--percentqual] [--alength] [--minqual N] [--runtime_until N] [--readtype {1D,2D,1D2}]
                [--barcoded] [--no_supplementary] [-c COLOR] [-cm COLORMAP]
                [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]] [--plots [{kde,hex,dot} ...]]
                [--legacy [{kde,dot,hex} ...]] [--listcolors] [--listcolormaps] [--no-N50] [--N50] [--title TITLE]
                [--font_scale FONT_SCALE] [--dpi DPI] [--hide_stats]
                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram file [file ...] | --pickle pickle | --feather file [file ...])

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:
  -h, --help            show the help and exit
  -v, --version         Print version and exit.
  -t, --threads THREADS
                        Set the allowed number of threads to be used by the script
  --verbose             Write log messages also to terminal.
Usage

for nanopore data

NanoPlot
usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge] [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats] [--info_in_report] [--maxlength N]
                [--minlength N] [--drop_outliers] [--downsample N] [--loglength] [--percentqual] [--alength] [--minqual N] [--runtime_until N] [--readtype {1D,2D,1D2}]
                [--barcoded] [--no_supplementary] [-c COLOR] [-cm COLORMAP] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]] [--plots [{kde,hex,dot} ...]]
                [--legacy [{kde,dot,hex} ...]] [--listcolors] [--listcolormaps] [--no-N50] [--N50] [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI] [--hide_stats]
                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram file [file ...] | --pickle pickle | --feather file [file ...])

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:
  -h, --help            show the help and exit
  -v, --version         Print version and exit.
  -t, --threads THREADS
                        Set the allowed number of threads to be used by the script
  --verbose             Write log messages also to terminal.
...
OUTPUT

NanoPlot creates:

  • a statistical summary
  • a number of plots
  • a html summary file

Now run Nanoplot on your sample and interpret the html file produced.

NanoPlot -t 2 --fastq barcode30.fastq.gz --maxlength 40000 --plots dot --legacy hex
Copying files and directories with SCP or Rsync

The genral synthax for performing the above functions are:

Using scp

  • To copy a file from a remote server:
scp ssh user@IP.address:/path/file_name /local/destination/path/
  • To copy a directory from a remote server:
scp -r ssh user@IP.address:/path/directory[/] /local/destination/path/
  • To copy a file to a remote server:
scp ssh /local/path/file_name user@IP.address:/destination/path/
  • To copy a directory to a remote server:
scp -r ssh /local/path/directory[/]  user@IP.address:/destination/path/

Using rsync

Because Rsync transfers files recursively, you do not need to add the -r flag. You can use the following commands to transfer the files in an archived or compressed manner:

  • To copy a file from a remote server:
rsync [-avz] user@IP.address:/path/file_name /local/destination/path/
  • To copy a directory from a remote server:
rsync [-avz] user@IP.address:/path/directory[/] /local/destination/path/
  • To copy a file to a remote server:
rsync [-avz] /local/path/file_name user@IP.address:/destination/path/
  • To copy a directory to a remote server:
rsync [-avz] /local/path/directory[/]  user@IP.address:/destination/path/

To copy your html file from the server to your downloads, open a new terminal and cd to your downloads. Run the below command

rsync -azhp student??@10.27.7.13:/home/student??/ONT/NanoPlot-report.html .
scp -p ssh student??@10.27.7.13:/home/student??/ONT/NanoPlot-report.html .

Now that we are done with NanoPlot, let’s deactivate the nanoplot environment and go ahead and activate the qc_ont environment, so we can run the remaining tools:

mamba deactivate
mamba activate qc_ont

fastq-scan

fastq-scan is a very quick tool that reads a FASTQ from STDIN and outputs summary statistics (read lengths, per-read qualities, per-base qualities) in JSON format.

Help

Do this to get the help information for fastq-scan

fastq-scan -h
Usage: cat FASTQ | fastq-scan [options]
Version: 1.0.0

Optional arguments:
 -g INT   Genome size for calculating estimated sequencing coverage. (Default 1)
 -p INT   ASCII offset for input quality scores, can be 33 or 64. (Default 33)  
 -q       Print only the QC stats, do not print read lengths or per-base quality scores
 -v       Print version information and exit
 -h       Show this message and exit
Usage

fastq-scan reads from STDIN, so pretty much any FASTQ output can be piped into fastq-scan. There are a few things to be aware of. It assumes that all FASTQ entries are the four line variant. Also, it has a PHRED offset (33 vs 64) guesser function. By default it is set to PHRED33, it could produce errors if there are not any PHRED33 or PHRED64 specific characters in the quality scores.

You can now go ahead and perform the fastq-scan with the below command.

Note

Try this command out first. You may have to add < after the zcat command depending on your OS. zcat works differently with different OS.

Do this to run any of the ONT data and report on your findings:

zcat barcode30.fastq.gz | fastq-scan -g 4000000 > barcode30_fastq-scan.json

*NB. In the above command, we assume a genome size of 4Mb.

You should now see one new file generated in your current directory.

To see if the file have been created, use this command:

ls | grep "fastq-scan.json"
barcode30_fastq-scan.json

To view the statistics for your data use this command:

cat barcode30_fastq-scan.json
Note

The file created is rather small, so you can afford to use cat to view the entire content. You may want to use head or less for huge files.

As you may have realized, the content of the output file doesn’t look friendly.

Let’s convert the .json files into a more friendly format, say tsv. You should know what tsv files are by now. If not, you can visite this link to explore more on file-formats to equip yourself.

Parse json files into tsv format

We will run a simple python script to achieve this purpose. In your working directory, you will find a file named fastq-scan_parser.py. This is a simple python script that will do the job for us.

We can go ahead and execute that python script by running the command below:

python fastq-scan_parser.py

You should now see another file generated in your directory called fastq-scan_summary.tsv. What the python script did was to extract the relevant information from all .json files in the working directory and convert them to a .tsv file.

Now let’s go ahead and have a look at the .tsv file with the command

cat fastq-scan_summary.tsv
sample  total_bp    coverage    read_total  read_min    read_mean   read_std    read_median read_max    read_25th   read_75th   qual_min    qual_mean   qual_std    qual_max    qual_median qual_25th   qual_75th
barcode30   305515607   71.0501 3024907 101 101.0   0.0 101 101 101 101 14  35.2243 3.49844 37  37  35  37

Alternatively, you can open the tsv file with any appropriate GUI software (excel or libreoffice)

FastQC

FastQC is a program designed to spot potential problems in high throughput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

FastQC will highlight any areas where the library looks unusual and where you should take a closer look. The program is not tied to any specific type of sequencing technique and can be used to look at libraries coming from a large number of different experiment types (Genomic Sequencing, ChIP-Seq, RNA-Seq, BS-Seq etc etc).

The main functions of FastQC are:

  • Import of data from BAM, SAM or FastQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application
Help

Do this to get the help information for FastQC

fastqc -h

            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

        fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit
    
    -o --outdir     Create all output files in the specified output directory.
                    Please note that this directory must exist as the program
                    will not create it.  If this option is not set then the 
                    output file for each sequence file is created in the same
                    directory as the sequence file which was processed.
                    
    --casava        Files come from raw casava output. Files in the same sample
                    group (differing only by the group number) will be analysed
                    as a set rather than individually. Sequences with the filter
                    flag set in the header will be excluded from the analysis.
                    Files must have the same names given to them by casava
                    (including being gzipped and ending with .gz) otherwise they
                    won't be grouped together correctly.
...
Usage

FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data. If no files to process are specified on the command line then the program will start as an interactive graphical application. If files are provided on the command line then the program will run with no user interaction required. In this mode it is suitable for inclusion into a standardised analysis pipeline.

The general format of the command is:

fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 seqfile2 .. seqfileN

You can now go ahead and perform the fastQC with the below command.

Note

Do this to run your Nanopore data:

fastqc --threads 4 barcode30.fastq.gz
Started analysis of barcode30.fastq.gz
Approx 5% complete for barcode30.fastq.gz
Approx 10% complete for barcode30.fastq.gz
Approx 15% complete for barcode30.fastq.gz
Approx 20% complete for barcode30.fastq.gz
Approx 25% complete for barcode30.fastq.gz
...
Output

If you specified an output -o directory then you should look out for that file being created in that directory. For our situation, we didn’t specify any output directory so the result will just be in the current directory. You should now see two new files generated in your current directory.

How do you tell which file has been recently produced?
Hint

Perform a simple ls command with the arguments -lhrt and the last file in the output should be the most recent.

ls -lhrt

We are interested in the one that ends with .html. Go ahead and open it. Being an .html file, it will prefer to open in a browser, and that’s just how we want to make sense out of it.

We have already seen some of the content of the output file from the background to this chapter. However, this time, let’s look at a few more and also with some more details.

Basic Statistics

The first information you encounter is the basic statistics. The Basic Statistics module generates some simple composition statistics for the file analysed.

  • Filename: The original filename of the file which was analysed
  • File type: Says whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls
  • Encoding: Says which ASCII encoding of quality values was found in this file.
  • Total Sequences: A count of the total number of sequences processed. There are two values reported, actual and estimated. At the moment these will always be the same.
  • Filtered Sequences: If running in Casava mode sequences flagged to be filtered will be removed from all analyses. The number of such sequences removed will be reported here. The total sequences count above will not include these filtered sequences and will the number of sequences actually used for the rest of the analysis.
  • Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.
  • %GC: The overall %GC of all bases in all sequences

FastQC Basic Statistics

Per Base Sequence Quality

The plot shows an overview of the range of quality values across all bases at each position in the FastQ file.

Per Base Sequence Quality

For each position a BoxWhisker type plot is drawn. The elements of the plot are as follows: - The central red line is the median value - The yellow box represents the inter-quartile range (25-75%) - The upper and lower whiskers represent the 10% and 90% points - The blue line represents the mean quality

The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y-axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.

It should be mentioned that there are number of different ways to encode a quality score in a FastQ file. FastQC attempts to automatically determine which encoding method was used, but in some very limited datasets it is possible that it will guess this incorrectly (ironically only when your data is universally very good!). The title of the graph will describe the encoding FastQC thinks your file used.

NB. Results from this module will not be displayed if your input is a BAM/SAM file in which quality scores have not been recorded.

Warning A warning will be issued if the lower quartile for any base is less than 10, or if the median for any base is less than 25.

Failure This module will raise a failure if the lower quartile for any base is less than 5 or if the median for any base is less than 20.

Common reasons for warnings

The most common reason for warnings and failures in this module is a general degradation of quality over the duration of long runs. In general sequencing chemistry degrades with increasing read length and for long runs you may find that the general quality of the run falls to a level where a warning or error is triggered.

If the quality of the library falls to a low level then the most common remedy is to perform quality trimming where reads are truncated based on their average quality. For most libraries where this type of degradation has occurred you will often be simultaneously running into the issue of adapter read-through so a combined adapter and quality trimming step is often employed.

Another possibility is that a warn / error is triggered because of a short loss of quality earlier in the run, which then recovers to produce later good quality sequence. This can happen if there is a transient problem with the run (bubbles passing through a flow cell for example). You can normally see this type of error by looking at the per-tile quality plot (if available for your platform). In these cases trimming is not advisable as it will remove later good sequence, but you might want to consider masking bases during subsequent mapping or assembly.

If your library has reads of varying length then you can find a warning or error is triggered from this module because of very low coverage for a given base range. Before committing to any action, check how many sequences were responsible for triggering an error by looking at the sequence length distribution module results.

Per Sequence Quality Scores

The per sequence quality score report allows you to see if a subset of your sequences have universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences.

Per Sequence Quality Scores

If a significant proportion of the sequences in a run have overall low quality then this could indicate some kind of systematic problem - possibly with just part of the run (for example one end of a flow cell).

NB. Results from this module will not be displayed if your input is a BAM/SAM file in which quality scores have not been recorded.

Warning A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.

Failure An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.

Common reasons for warnings This module is generally fairly robust and errors here usually indicate a general loss of quality within a run. For long runs this may be alleviated through quality trimming. If a bi-modal, or complex distribution is seen then the results should be evaluated in concert with the per-tile qualities (if available) since this might indicate the reason for the loss in quality of a subset of sequences.

Per Base Sequence Content

Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called.

Per Base Sequence Content

In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.

It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichment of a number of different K-mers at the 5’ end of the reads. Whilst this is a true technical bias, it isn’t something which can be corrected by trimming and in most cases doesn’t seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.

Warning This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position.

Failure This module will fail if the difference between A and T, or G and C is greater than 20% in any position.

Common reasons for warnings

There are a number of common scenarios which would elicit a warning or error from this module.

Overrepresented sequences: If there is any evidence of overrepresented sequences such as adapter dimers or rRNA in a sample then these sequences may bias the overall composition and their sequence will emerge from this plot. Biased fragmentation: Any library which is generated based on the ligation of random hexamers or through tagmentation should theoretically have good diversity through the sequence, but experience has shown that these libraries always have a selection bias in around the first 12bp of each run. This is due to a biased selection of random primers, but doesn’t represent any individually biased sequences. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn’t seem to adversely affect the ability to measure expression. Biased composition libraries: Some libraries are inherently biased in their sequence composition. The most obvious example would be a library which has been treated with sodium bisulphite which will then have converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library If you are analysing a library which has been aggressively adapter trimmed then you will naturally introduce a composition bias at the end of the reads as sequences which happen to match short stretches of adapter are removed, leaving only sequences which do not match. Sudden deviations in composition at the end of libraries which have undergone aggressive trimming are therefore likely to be spurious.

[Per Sequence GC content]

We will talk more about GC content under identifying contaminiation. For now, let’s just have a look at our plot.

Per_sequence_GC Content

Per Base N Content

If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base] call

This module plots out the percentage of base calls at each position for which an N was called.

Per Base N Content

It’s not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence. However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.

Warning This module raises a warning if any position shows an N content of >5%.

Failure This module will raise an error if any position shows an N content of >20%.

Common reasons for warnings

The most common reason for the inclusion of significant proportions of Ns is a general loss of quality, so the results of this module should be evaluated in concert with those of the various quality modules. You should check the coverage of a specific bin, since it’s possible that the last bin in this analysis could contain very few sequences, and an error could be prematurely triggered in this case.

Another common scenario is the incidence of a high proportions of N at a small number of positions early in the library, against a background of generally good quality. Such deviations can occur when you have very biased sequence composition in the library to the point that base callers can become confused and make poor calls. This type of problem will be apparent when looking at the per-base sequence content results.

I am sure by now, you will be bored with the many quality control tools. Now, let’s try our hands on some trimming to improve the quality of our sequences. To do this, we will use fastpand PoreChop.

fastp

click here to view the publication on fastp fastp is a tool designed to provide fast all-in-one pre-processing for FastQ files.

It is mostly used on short-reads illumina data ban can be applied to long-read ONT.

Help

Do this to get the help information for fastp

fastp --help
fastp: an ultra-fast all-in-one FASTQ pre-processor
version 0.23.2
usage: fastp [options] ... 
options:
  -i, --in1                            read1 input file name (string [=])
  -o, --out1                           read1 output file name (string [=])
  -I, --in2                            read2 input file name (string [=])
  -O, --out2                           read2 output file name (string [=])
      --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
      --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
...

Features

  • comprehensive quality profiling for both before and after filtering data (quality - curves, base contents, KMER, Q20/Q30, GC Ratio, duplication, adapter contents…)
  • filter out bad reads (too low quality, too short, or too many N…)
  • cut low quality bases for per read in its 5’ and 3’ by evaluating the mean quality - from a sliding window (like Trimmomatic but faster).
  • trim all reads in front and tail
  • cut adapters. Adapter sequences can be automatically detected, which means you don’t have to input the adapter sequences to trim them.
  • correct mismatched base pairs in overlapped regions of paired end reads, if one base is with high quality while the other is with ultra low quality
  • trim polyG in 3’ ends, which is commonly seen in NovaSeq/NextSeq data. Trim polyX in 3’ ends to remove unwanted polyX tailing (i.e. polyA tailing for mRNA-Seq data)
  • preprocess unique molecular identifier (UMI) enabled data, shift UMI to sequence name.
  • report JSON format result for further interpreting.
  • visualize quality control and filtering results on a single HTML page (like FASTQC - but faster and more informative).
  • split the output to multiple files (0001.R1.gz, 0002.R1.gz…) to support parallel - processing. Two modes can be used, limiting the total split file number, or limiting the lines of each split file.
  • support long reads (data from PacBio / Nanopore devices).
  • support reading from STDIN and writing to STDOUT
  • support interleaved input
  • support ultra-fast FASTQ-level deduplication
Usage

for single end data (not compressed)

fastp -i in.fq -o out.fq

for paired end data (gzip compressed)

fastp -i in_R1.fq.gz -I in_R2.fq.gz -o out_R1.fq.gz -O out_R2.fq.gz

By default, the HTML report is saved to fastp.html (can be specified with -h option), and the JSON report is saved to fastp.json (can be specified with -j option).

You can now go ahead and perform the fastp with the below command. We will perform the run on one of our Nanopore reads.

Note
fastp --cut_front --cut_tail --trim_poly_x --cut_mean_quality 10 --qualified_quality_phred 10 --unqualified_percent_limit 10 --length_required 50 --in1 barcode30.fastq.gz --out1 barcode30.trim.fastq.gz --json barcode30_fastp.json --html barcode30_fastp.html --thread 4
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 54368
total bases: 257262215
Q20 bases: 140726297(54.7015%)
Q30 bases: 59979993(23.3147%)

Read1 after filtering:
total reads: 2788
total bases: 13508368
Q20 bases: 9981247(73.8894%)
Q30 bases: 4923721(36.4494%)

Filtering result:
reads passed filter: 2788
reads failed due to low quality: 51575
reads failed due to too many N: 0
reads failed due to too short: 5
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
reads with polyX in 3' end: 142
bases trimmed in polyX tail: 4340

Duplication rate (may be overestimated since this is SE data): 0%

JSON report: barcode30_fastp.json
HTML report: barcode30_fastp.html

fastp --cut_front --cut_tail --trim_poly_x --cut_mean_quality 10 --qualified_quality_phred 10 --unqualified_percent_limit 10 --length_required 50 --in1 barcode30.fastq.gz --out1 barcode30.trim.fastq.gz --json barcode30_fastp.json --html barcode30_fastp.html --thread 4
fastp v0.23.2, time used: 5 seconds

...

You can also read out what is printed out on the terminal as a log file by passing the output to a textfile named fastp.log. This way, you don’t see all the long print out on the terminal.

fastp --cut_front --cut_tail --trim_poly_x --cut_mean_quality 10 --qualified_quality_phred 10 --unqualified_percent_limit 10 --length_required 50 --in1 barcode30.fastq.gz --out1 barcode30.trim.fastq.gz --json barcode30_fastp.json --html barcode30_fastp.html --thread 4 > fastp.log

Note that the analysis perfomed above is just for demonstration purposes hence parameters set may not represent the true states.

You should now see three new files generated in your current directory.

barcode30.trim.fastq.gz
barcode30_fastp.json
barcode30_fastp.html

Now let’s go ahead and see what our output looks like by opening the barcode30_fastp.html file. You should be able to interpret the output by now.

Note

Note that the date presented below are just for demonstration purposes adn do not represent the actual data from the barcode30.fastq.gz reads.

Summary of fastp report

Let’s have a look at the quality of the reads before and after trimming. What obvious differences do you notice?

Read Quality before filtering

Read Quality after filtering

Read Quality before and after fastp processing

Base content quality before filtering

Base content quality after filtering

Base Content Quality before and after fastp processing

PoreChop

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Help

Do this to get the help information for PoreChop

porechop --help
usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY] [-t THREADS] [-b BARCODE_DIR]
                [--barcode_threshold BARCODE_THRESHOLD] [--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed]
                [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME]
                [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD]
                [--no_split] [--discard_middle] [--middle_threshold MIDDLE_THRESHOLD]
                [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE] [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE]
                [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]

Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting reads with internal adapters

Main options:
  -i INPUT, --input INPUT               FASTA/FASTQ of input reads or a directory which will be recursively searched for FASTQ files
                                        (required)
  -o OUTPUT, --output OUTPUT            Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed reads will be printed to stdout)
  --format {auto,fasta,fastq,fasta.gz,fastq.gz}
                                        Output format for the reads - if auto, the format will be chosen based on the output filename or the
                                        input read format (default: auto)
  -v VERBOSITY, --verbosity VERBOSITY   Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full - output will go to stdout if
                                        reads are saved to a file and stderr if reads are printed to stdout (default: 1)
  -t THREADS, --threads THREADS         Number of threads to use for adapter alignment (default: 4)
...
Usage

For basic adapter trimming:

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz

How it works

Find matching adapter sets

Porechop first aligns a subset of reads (default 10000 reads, change with --check_reads) to all known adapter sets. Adapter sets with at least one high identity match (default 90%, change with --adapter_threshold) are deemed present in the sample.

Identity in this step is measured over the full length of the adapter. E.g. in order to qualify for a 90% match, an adapter could be present at 90% identity over its full length, or it could be present at 100% identity over 90% of its length, but a 90% identity match over 90% of the adapter length would not be sufficient.

The alignment scoring scheme used in this and subsequent alignments can be modified using the --scoring_scheme option (default: match = 3, mismatch = -6, gap open = -5, gap extend = -2).

Trim adapters from read ends

The first and last bases in each read (default 150 bases, change with --end_size) are aligned to each present adapter set. When a long enough (default 4, change with --min_trim_size) and strong enough (default 75%, change with –end_threshold) match is found, the read is trimmed. A few extra bases (default 2, change with --extra_end_trim) past the adapter match are removed as well to ensure it’s all removed.

Identity in this step is measured over the aligned part of the adapter, not its full length. E.g. if the last 5 bases of an adapter exactly match the first 5 bases of a read, that counts as a 100% identity match and those bases will be trimmed off. This allows Porechop to effectively trim partially present barcodes.

The default --end_threshold is low (75%) because false positives (trimming off some sequence that wasn’t really an adapter) shouldn’t be too much of a problem with long reads, as only a tiny fraction of the read is lost.

Split reads with internal adapters

The entirety of each read is aligned to the present adapter sets to spot cases where an adapter is in the middle of the read, indicating a chimera. When a strong enough match is found (default 85%, change with --middle_threshold), the read is split. If the resulting parts are too short (default less than 1000 bp, change with --min_split_read_size), they are discarded.

The default --middle_threshold (85%) is higher than the default --end_threshold (75%) because false positives in this step (splitting a read that is not chimeric) could be more problematic than false positives in the end trimming step. If false negatives (failing to split a chimera) are worse for you than false positives (splitting a non-chimera), you should reduce this threshold (e.g. --middle_threshold 75).

Extra bases are also removed next to the hit, and how many depends on the side of the adapter. If we find an adapter that’s expected at the start of a read, it’s likely that what follows is good sequence but what precedes it may not be. Therefore, a few bases are trimmed after the adapter (default 10, change with --extra_middle_trim_good_side) and more bases are trimmed before the adapter (default 100, change with --extra_middle_trim_bad_side). If the found adapter is one we’d expect at the end of the read, then the “good side” is before the adapter and the “bad side” is after the adapter.

Here is a real example of the “good” and “bad” sides of an adapter. The adapter is in the middle of this snippet (SQK-NSK007_Y_Top at about 90% identity). The bases to the left are the “bad” side and their repetitive nature is clear. The bases to the right are the “good” side and represent real biological sequence.

TGTTGTTGTTGTTATTGTTGTTATTGTTGTTGTATTGTTGTTATTGTTGTTGTTGTACATTGTTATTGTTGTATTGTTGTTATTGTTGTTGTATTATCGGTGTACTTCGTTCAGTTACGTATTACTATCGCTATTGTTTGCAGTGAGAGGTGGCGGTGAGCGTTTTCAAATGGCCCTGTACAATCATGGGATAACAACATAAGGAACGGACCATGAAGTCACTTCT

Discard reads with internal adapters

If you run Porechop with --discard_middle, the reads with internal adapters will be thrown out instead of split.

If you plan on using your reads with Nanopolish, then the --discard_middle option is required. This is because Nanopolish first runs nanopolish index to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop without --discard_middle, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.

This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination.

Porechop can perform a lot of other very important analysis including demultiplexing. You can find more by following this link.

For now let’s make our hands dry by performing porechop on any of our reads.

porechop -i barcode30.fastq.gz -o barcode30_porechopped.fastq.gz
Loading reads
barcode30.fastq.gz
54,368 reads loaded


Looking for known adapter sets
10,000 / 10,000 (100.0%)
                                        Best               
                                        read       Best    
                                        start      read end
  Set                                   %ID        %ID     
  SQK-NSK007                                82.1       80.0
  Rapid                                    100.0        0.0
  RBK004_upstream                           86.8        0.0
  SQK-MAP006                                78.6       80.0
...

We see from our run, three basic analysis:

  • Trimming adapter from the reads
  • Splitting reads containing adapter
  • saving trimed reads to file

Porechop basic analysis

Also take time to go throught the list of adapters. Do you see anything unusual?

Hint You will find that only the adapters used for your samples are highlighted

Known issues

Adapter search

Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:

  • As the number of kits/barcodes has grown, adapter-search as part of the Porechop’s pipeline has become increasingly slow.
  • Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
  • Many ONT adapters share common sequence with each other, making false positive adapter finds possible.
Exercise 2.1.1: Going back to run fastqc and fastp

Now that you are almost becoming a pro in the use fo the commandline, - write down the code to run fastqc and fastp on your output from the previous run. In this case our trimmed reads - barcode30_porechopped.fastq.gz - Go on and run the codes and compare the results to the previous fastqc and fastp outputs. - What do you think are the advantages and disadvantages of using any of the two trimming tools - fastp and porechop?.

fastqc --threads 4 barcode30_porechopped.fastq.gz
fastp --cut_front --cut_tail --trim_poly_x --cut_mean_quality 10 --qualified_quality_phred 10 --unqualified_percent_limit 10 --length_required 50 --in1 barcode30.fastq.gz --out1 barcode30_porechopped.fastq.gz --json barcode30_fastp.json --html barcode30_porechopped.fastp.html --thread 4

2.1.5 Identifying contamination

It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is Kraken. In this workshop we will go through how you can use Kraken to check your samples for contamination.

Kraken 2

click here to view the publication on Kraken 2 kraken is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer.

The first version of Kraken used a large indexed and sorted list of -mer/LCA pairs as its database. While fast, the large memory requirements posed some problems for users, and so Kraken 2 was created to provide a solution to those problems.

Help

Do this to get the help information for Kraken 2

Kraken2 --help
Usage: kraken2 [options] <filename(s)>

Options:
  --db NAME               Name for Kraken 2 DB
                          (default: none)
  --threads NUM           Number of threads (default: 1)
  --quick                 Quick operation (use first hit or hits)
  --unclassified-out FILENAME
                          Print unclassified sequences to filename
  --classified-out FILENAME
                          Print classified sequences to filename
  --output FILENAME       Print output to filename (default: stdout); "-" will
                          suppress normal output
  --confidence FLOAT      Confidence score threshold (default: 0.0); must be
                          in [0, 1].
...
Usage

Kraken 2 is run on a database. You don’t have to worry about the database for now. All has been setup for you. To set up the database yourself, visit the setup page

The general synthax for running Kraken 2 is:

kraken2 --db $DBNAME [other_options] <filename(s)>

To run Kraken 2, you need to provide the path to the database. By default, the input files are assumed to be in FASTA format, so in this case we also need to tell Kraken that our input files are in FASTQ format, gzipped, and that they are ONT reads.

You can now go ahead and run Kraken 2 with the below command.

Note

Take note of these: 1. You have to specify the directory to the database 2. Our input files will be the porechopped fastq sequences from the fastp analysis.

Do this to run Kraken 2:

kraken2 --db ../database/minikraken2_v1_8GB --threads 8 --unclassified-out barcode30.unclassified#.fastq --classified-out barcode30.classified#.fastq --report barcode30.kraken2.report.txt --output barcode30.kraken2.out --gzip-compressed --report-zero-counts barcode30_porechopped.fastq.gz
Loading database information... done.
53949 sequences (251.01 Mbp) processed in 30.316s (106.8 Kseq/m, 496.79 Mbp/m).
  51565 sequences classified (95.58%)
  2384 sequences unclassified (4.42%)

You should now see some new files generated in your current directory. How many new files have been generated? Do you see that the fastq files are unzipped?

Let’s save some disk space by zipping the two new fastq files created
gzip *.fastq

You can try to read the two kraken report output text files created, but I’m sure that won’t make real sense to you.

Quick look at Kraken 2 report

Run the command to have a quick look at the Kraken 2 report.

head -n 20 barcode30.kraken2.report.txt

The six columns in this file are:

  • Percentage of reads covered by the clade rooted at this taxon
  • Number of reads covered by the clade rooted at this taxon
  • Number of reads assigned directly to this taxon
  • A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-’.
  • NCBI taxonomy ID
  • Scientific name

Just like we did for the fastq-scan output, we will have to convert the Kraken 2 output to a more readable format. Before we do this, we will have to perform one more analysis on our sample. The Kraken 2 often does not provide exhaustive results. To re-estimate taxonomic abundance of the samples we have to do this with another tool called Bracken using the output of Kraken 2.

Bracken

click here to view the publication on Bracken Bracken is a companion program to Kraken 1, KrakenUniq, or Kraken 2 While Kraken classifies reads to multiple levels in the taxonomic tree, Bracken allows estimation of abundance at a single level using those classifications (e.g. Bracken can estimate abundance of species within a sample)

Help

Do this to get the help information for fastq-scan

bracken -h
Usage: bracken -d MY_DB -i INPUT -o OUTPUT -w OUTREPORT -r READ_LEN -l LEVEL -t THRESHOLD
  MY_DB          location of Kraken database
  INPUT          Kraken REPORT file to use for abundance estimation
  OUTPUT         file name for Bracken default output
  OUTREPORT      New Kraken REPORT output file with Bracken read estimates
  READ_LEN       read length to get all classifications for (default: 100)
  LEVEL          level to estimate abundance at [options: D,P,C,O,F,G,S,S1,etc] (default: S)
  THRESHOLD      number of reads required PRIOR to abundance estimation to perform reestimation (default: 0)
Usage

The general synthax is:

bracken -d ${KRAKEN_DB} -t ${THREADS} -k ${KMER_LEN} -l ${READ_LEN} -i {INPUT} -o {OUTPUT}

NB. Bracken relies on the Kraken 2 database, so we will specify the same directory as before and also use the output report of Kraken 2 as our input.

You can now go ahead and perform Bracken with the below command.

Note

Run Bracken with the following command:

bracken -l S -t 10 -d ../database/minikraken2_v1_8GB -i barcode30.kraken2.report.txt -o barcode30_bracken_S.tsv
>> Checking for Valid Options...
 >> Running Bracken
      >> python src/est_abundance.py -i barcode30.kraken2.report.txt -o barcode30_bracken_S.tsv -k ../database/minikraken2_v1_8GB/database100mers.kmer_distrib -l S -t 10
PROGRAM START TIME: 08-09-2023 17:20:28
>> Checking report file: barcode30.kraken2.report.txt
BRACKEN SUMMARY (Kraken report: barcode30.kraken2.report.txt)
    >>> Threshold: 10
    >>> Number of species in sample: 11780
      >> Number of species with reads > threshold: 10
      >> Number of species with reads < threshold: 11770
    >>> Total reads in sample: 53949
      >> Total reads kept at species level (reads > threshold): 16065
      >> Total reads discarded (species reads < threshold): 273
      >> Reads distributed: 35185
      >> Reads not distributed (eg. no species above threshold): 42
      >> Unclassified reads: 2384
BRACKEN OUTPUT PRODUCED: barcode30_bracken_S.tsv
PROGRAM END TIME: 08-09-2023 17:20:28
  Bracken complete.

You should now see two new files generated in your current directory. You can take a look at them as they are small files. You can already guess what specie we are working with from the barcode30_bracken_S.tsv file.

Producing a more friendly species composition file

Imagine you have over 100 samples and have produced Bracken output files for each file using a simple bash script. Ho do you check each individual file to see what species abundance there is? This will be tedious right? How can we make things easier for us? For instance have them all in one .tsv file to examine at a go. Your guess is as good as mine. …Let’s write a script for that.

To do this, we can simply parse all the Kraken 2 and Bracken results into a more friendly .tsv file that summarises all the output into one file. We will achieve this with a simple python script.

Parse Kraken 2 and Bracken results files into tsv format

We will run a simple python script to achieve this purpose. In your working directory, you will find a file named kraken_parser.py. This is a simple python script that will do the job for us.

We can go ahead and execute that python script by running the command below:

python kraken_parser.py

You should now see another file generated in your directory called Bracken_species_composition.tsv.

Now let’s go ahead and have a look at the tsv file with the command below. Remember, we have run analysis on only one sample and so will expect only one line of results.

cat Bracken_species_composition.tsv
name    Escherichia coli    Salmonella enterica other
barcode30_bracken   59.43420856332985   34.6785021632105    5.887289273459658

Alternatively, you can open the tsv file with any appropriate GUI software (excel or libreoffice)

Now, what if we want to also see all the QC statistics in one go provided we performed the analysis for multiple samples (or in our case even for just one sample).

Thankfully, bioinformaticians are not sleeping, there is one final tool (MultiQC) we will discuss here which puts all that analysis we have performed into a single report that is viewable on a web browser, yes an .html file.

MultiQC

click here to view the publication on MultiQC or check out their website

MultiQC is a tool to create a single report with interactive plots for multiple bioinformatics analyses across many samples.

Reports are generated by scanning given directories for recognised log files. These are parsed and a single HTML report is generated summarising the statistics for all logs found. MultiQC reports can describe multiple analysis steps and large numbers of samples within a single plot, and multiple analysis tools making it ideal for routine fast quality control.

Help

Do this to get the help information for multiqc

multiqc -h

Usage

Once installed, you can use MultiQC by navigating to your analysis directory (or a parent directory) and running the tool.

The general synthax is:

multiqc .

Yes, that’s it! MultiQC will scan the specified directory (. is the current dir) and produce a report detailing whatever it finds.

You can now go ahead and perform the MultiQC with the below command. We will only add the -f flag which overwrites any existing MultiQC reports.

Note

Perform the analysis with this command:

multiqc -f .

The report is created in multiqc_report.html by default. Tab-delimited data files are also created in a directory called multiqc_data/, containing extra information. These can be easily inspected using Excel.

You should now see two new files generated in your current directory.

Let’s have a quick look the report by simply opening it in a web browser.

We should now know what each plot represent. If you don’t at this point, then you may want to start from the beginning of this document.

2.1.6 QC bash script: Putting it all together

Now let’s try this out! We will generate QC stats for all five Nanopore reads. We will perform all the analysis above (except fastp) in one run while generating all the various files at each stage within the pipeline in tandem.

Running multiple samples with a simple bash script

All the fastq files needed for this analysis are located in the current working QC directory. You can have a look by simply running the below command to check out all the reads present in the current directory.

check out all the reads present in the current directory
ls *.fastq.gz 
barcode30.fastq.gz  barcode33.fastq.gz barcode34.fastq.gz barcode36.fastq.gz barcode40.fastq.gz

Let’s have a look at the QC bash script we are going to run:

cat qcloop_ont.sh

The script contains several commands, some are combined together using pipes. (UNIX pipes is a very powerful and elegant concept which allows us to feed the output of one command into the next command and avoid writing intermediate files. If you are not comfortable with UNIX or how scripts look like and how they work, consider having a go at the UNIX tutorial or more specifically bonus shell script.

Now run the script to create the QC stats (this may take a while):

bash qcloop.sh
###########################################
#### loop bash script for running QC ######
###########################################
#               #
#  ___   ____   #
# / _ \ / ___|  #
# | | | | |     #
# | |_| | |___  #
# \__\_\ \____| #
#               #
#               #
############################################
fastq-scan_parser.py exists and will run in a second.
 <<<<<<<running fastq-scan_parser>>>>>>>
Traceback (most recent call last):
...

The script will produce all the intermediate and final files we have encountered in this session.

Perform a simple ls command with the arguments -lhrt and view the most recent files created.

ls -lhrt

Which files do you consider most useful?

Now, let’s have a look at the two most important files here — The multiqc_report.html and the Bracken_species_composition.tsv.

Exercise 2.1.2: MultiQC and Species Composition Report Interpretation

Have a quick look at both MultiQC report and Bracken species composition files and attempt these questions

  1. How many genomes were analysed?
  2. What is the average read length of each read?
  3. What is the average GC content of each genome?
  4. Will you pass all the reads as being good reads? Given a phred quality score threshold of 20.
  5. Identify the reads that failed QC and report what could be wrong with the sequences.
  6. What are the most abundant species identified in each of the genomes?
  7. Are there any suspected contaminants in any of the genomes?

If you are looking for a solution, I am sorry, no solution.

If you are having any challenges just hold on, we will solve this together in class.

If you have successfully gone through the exercise above, congratulations!!! You are a bioinformatics QC expert now.

Now, let’s take particular note of the porechopped genomes that we are most confident in. We will carry out our de novo assembly of long reads from these genomes in our next lesson.

Disk Usage II — Cleaning up after analysis

Now that we are done investigating our genomic sequences, let’s pause again and check the space of our current working directory.

You can do this with the disk usage du command

du -h

How much disk space have you used since the start of the analysis? I’m sure it’s more than 4G. That’s fine. Let’s do some cleaning up. We will remove rm all files that we may not need for further analysis. Most of these files are intermediate files and can always be reproduced if we need to.

You will find this cleaning process very useful in the next chapter where we will generate tonnes and tonnes of data.

remove all porechopped fastq files if not needed

rm *porechopped*

remove all classified and unclassified fastq files if not needed

rm *classified*

remove all Kraken 2 output if not needed

rm *kraken2.out

Exercise 2.1.4: Advance Exercise

If you have gone through this material in good time and still want to try your hands on some more advanced stuff, you are free to do this exercise.

Try modifying the qcloop_ont script to be able to carry out analysis on a specific pair of fastq reads given that you have more than one pair of fastq in your working directory (assuming you want to perform QC again on only one of the sequences that passed our quality checks).

You can call the new script `qcloop_ont_modified``

You should be able to call out the script to carry out the analysis on the specified fastq files with the below command

bash qcloop_ont_modified barcode30.fastq.gz

2.1.7 deactivate qc_ont environment

Now that we are done with all our analysis, let’s deactivate the qc_ont environment:

mamba deactivate

2.1.8 Credit

Information on this page has been adapted and modified from the following source(s):

  • https://github.com/sanger-pathogens/QC-training

  • https://github.com/wdecoster/NanoPlot

  • https://github.com/rpetit3/fastq-scan

  • https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • https://github.com/OpenGene/fastp

  • https://github.com/rrwick/Porechop

  • https://github.com/DerrickWood/kraken2

  • https://github.com/jenniferlu717/Bracken

  • https://github.com/ewels/MultiQC