Assembly and annotation of plastid genomes using QIAGEN® CLC

Application Note

Assembly and annotation of plastid genomes

using QIAGEN

CLC Genomics Workbench

Introduction

Chloroplast sequences are highly conserved, making

them very useful for taxonomic classification of plants.

Next-generation sequencing (NGS) allows cost-efficient

sequencing of whole chloroplast genomes. The arche-

typical chloroplast genome follows a distinct architecture:

A large single copy region, a small single copy region

and a pair of inverted repeats (IRs). The sequence of the

IRs is challenging to assemble from short reads alone,

so a strategy of combined long- and short-read hybrid

sequencing and assembly is usually preferred to obtain

the complete chloroplast genome sequence. Once the

genome sequence has been obtained, the next challenge

is annotation of the genes in the sequence.

In this application note, we describe how to assemble and

annotate plastid genomes using QIAGEN CLC Genomics

Workbench. We describe three distinct workflows. In the

first workflow, the chloroplast sequence is assembled

de novo using chloroplast reads extracted from whole

genome sequencing (WGS) data by first mapping the

reads to a related plastid reference. In the second workflow,

the chloroplast sequence is assembled from a sub-sampled

WGS dataset. This second workflow is suitable when no

related plastid genome is available. The third workflow

uses very long but low-fidelity reads for de novo assembly,

followed by contig polishing using short reads. This

workflow is used for species where long IRs are expected

to be present in the chloroplast genome.

We cover all the main steps required for efficient

chloroplast assembly:

1. Extracting plastid reads from the WGS data

2. Reducing sequencing datasets prior to de novo

assembly

3. De novo assembly of plastid genomes using long reads

4. Contig polishing using short reads

5. Transferring annotation from plastids of related plants

6. Validating and visualizing the newly assembled

chloroplasts

2 Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022

Figure 1. QIAGEN CLC Genomics Workbench-assembled and annotated alfalfa chloroplast (left) and alfalfa chloroplast from a different cultivar,

imported from GenBank (right).

An important consideration when assembling plastid

genomes is the average length of reads used for

the assembly. Many plastids contain long IRs that

interfere with the accuracy and efficiency of assembly.

The plastid genomes themselves are usually between

110–200 kb and the IRs can be between 10–30 kb

long. Because of these repeats, we need reads that are

long enough to be unambiguously placed in the contig.

Not all plants have long IRs, and the first and second

workflows are suitable in these cases. We use alfalfa

for the first two workflows. “Shorter” long NGS reads

suffice for plastid assembly in such species. For the third

workflow, we use a rice dataset. Rice plastid genomes

contains IRs that are approximately 20 kb long. The

long-read dataset used here for this workflow includes

reads that are up to 84 kb long.

Results

All three workflows produced fully assembled and

annotated plastid genomes. Figure 1 shows the

visualizations produced by QIAGEN CLC Genomics

Workbench for two different alfalfa chloroplasts.

The Workbench-assembled genome contains the

same number of annotations as the GenBank alfalfa

plastid reference. However, the Workbench-assembled

genome differs in length by about 300 nt compared

to the reference genome because it originates from a

different alfalfa cultivar.

Data

The data used for the assemblies of alfalfa chloroplast

(Workflows 1 and 2) are from Chen et al, 2020. The

alfalfa reads are available in the Sequence Read Archive

(SRA): A long PacBio

read dataset (SRR11285798),

and an Illumina

dataset (SRR9026574). The data

used for the assembly of rice chloroplast (Workflow 3)

is from a study by researchers at the University of

Arizona: A long read dataset (SRR10302209) and a

short read dataset (SRR10302299).

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022 3

The alfalfa plastid reference sequence used in

Workflow 1 has the GenBank accession NC_042841.1.

The rice plastid reference sequence used in Workflow 3

has GenBank accession NC_008155.1.

Workflow 1. Chloroplast assembly using a plastid reference from a related species

In this workflow, shown in Figure 2, we collected the

relevant plastid reads from the WGS long-read data

by mapping them to a related plastid reference. Then,

the reads were sub-sampled (to reduce the dataset)

and assembled de novo. The annotations were

transferred from a related plastid genome using the

Whole Genome Alignment Tool. The short-read plastid

reads were also mapped to the reference plastid

genome and sub-sampled. These short reads were only

used to evaluate the quality of the de novo assembled

plastid genome contig.

Figure 2. Chloroplast assembly using a related reference genome to extract the plastid reads from WGS datasets.

Map long reads to a related plastid

Extract and sample mapped reads

De novo assemble the long reads

Align the de novo assembled plastid contig

with a related plastid, shift the start position

and transfer the annotation

Map short reads to a related plastid

Extract and sample mapped reads

Map short reads to de novo

assembled contigs

Call variants on mapped reads to

confirm the contig quality

Import a related plastid genome with annotation

Import long and short NGS reads

4 Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022

Data preparation

The first step in all three workflows was to import a

related plastid genome with annotation from GenBank.

We imported directly to QIAGEN CLC Workbench

using the “Download from GenBank” importer. Both

the long and short NGS reads were imported using

the Workbench “Short Read Archive” importer. The

long reads were mapped using “Map Long Reads

to Reference” under the “Long Read Support” folder

(Figure 3). The short reads were mapped using

“Map Reads to Reference” under the “Resequencing

Analysis” folder (Figure 4).

The plastid read-mapping coverage was excessive

for both datasets: 28,000x coverage for long reads

and 15,000x coverage for short reads. Both sets

of mapped reads were extracted using the “Extract

Reads” tool under the “Utility Tools” folder. For efficient

assembly and further data analysis, the reads were

sampled using the “Sample Reads” utility tool (Figure 5)

to significantly reduce the amount of data. For

the PacBio dataset, we down-sampled to 5,000

reads representing approximately 500x coverage

of the plastid genome, and, for the Illumina data,

we down-sampled to 100,000 reads representing

approximately 120x coverage.

De novo assembly

The alfalfa reads used in this workflow are high-fidelity

PacBio reads. The reads are between 10 and 17 kb

in length. This length is sufficient to easily assemble

chloroplasts without long IRs. With a large quantity of

high-fidelity reads, we found the best word size was

28 rather than the default of 13. Running the “De

Novo Assemble Long Reads” tool (Figure 6) with this

configuration produced a single circular contig of

125,637 nucleotides (Figure 7).

Figure 4. The short read mapping tool.

Figure 5. Tools used to extract reads from mappings to sample reads.

Figure 6. The “Long Read Support” tools with the “De Novo Assemble

Long Reads” tool selected.

Figure 7. The single circular contig of 125,637 nucleotides produced by

the “De Novo Assemble Long Reads” tool.

Figure 8. The “Basic Variant Detection” tool under the “Variant

Detection” folder.

Figure 3. The long read mapping tool.

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022 5

Validation of the assembly quality

In the next step, we confirmed that the assembled

contig was free of errors by mapping Illumina reads to

it and calling variants. The short reads were mapped

using the “Map Reads to Reference” tool under the

“Resequencing Analysis” folder (Figure 4). The variants

were called using the “Basic Variant Detection” tool

under the “Variant Detection” folder (Figure 8). No

significant variants were found. The highest variant

frequency found was 19% in a homopolymeric area.

The validation we performed here confirms that the de

novo assembled contig is of a high quality and does

not contain assembly errors.

Whole genome alignment and transfer

of annotations

In this step, we annotated the de novo assembled

contig by aligning it to a related chloroplast reference

genome and transferring its annotations to the new

contig.

The “Create Whole Genome Alignment” tool does

not just align the genomes but also shifts the contig’s

start position relative to a reference genome. It can

also transfer the reference genome annotations to the

newly assembled contig. The tool can be found under

the “Whole Genome Alignment” folder (Figure 9).

These tools are available when the Whole Genome

Alignment plugin is installed.

In the settings dialog, it is important to select “Rearrange

contigs” and “Copy annotations from reference”. The

“Genetic Code” option should be set to “11” for plastid

genomes (Figure 10).

The resulting newly assembled contig contained the

same number of annotations as the GenBank alfalfa

reference genome. The annotations can be displayed

in tabular or graphical views (Figure 11).

Figure 9. The “Create Whole Genome Alignment” tool.

Figure 10. The recommended parameters for transferring plastid annotations.

6 Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022

Workflow 2. Chloroplast assembly without a reference plastid

In this section, we describe a workflow for assembling

plant plastids using the same NGS data but without

extracting the plastid reads from WGS data. Instead,

we went directly to reducing the WGS data by

sampling the reads. The assembly workflow is shown

in Figure 12.

Data preparation: Sampling

The purpose of sampling is to reduce the number of

nuclear genomic reads. Plant genomes usually contain

plastid-related fragments in their nuclear chromosomes.

To prevent nuclear homologs of plastid sequences from

being included in the plastid assembly, we reduced the

data set size, which reduced both the number of nuclear

and plastid reads. Because there is a significantly

smaller quantity of plastid-related nuclear reads, they

were reduced enough to prevent their incorporation

into the plastid assembly.

Despite the small genome size of chloroplasts, the

chloroplast-originating reads usually comprise 5–6%

of WGS data from green plant tissues. This high

percentage results of the fact that a green cell can

contain a few hundred chloroplasts, each with their

own genome. In this dataset, we had approximately

28,000x coverage for the plastid genome. We reduced

this excessive coverage not just to suppress nuclear

homologs, but also to prevent erroneous assembly

caused by systemic sequencing errors. These errors

become apparent at excessively high coverage. We

sampled 2% long reads (about 110K reads) from the

original data set using the “Sample Reads” tool in the

“Utility Tools” folder (Figure 5).

Figure 11. Annotation displayed in graphical 9 (left) and tabular (right) views.

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022 7

De novo assembly

The “De Novo Assemble Long Reads” tool (Figure 6)

with the default parameters assembled these

approximately 110,000 long reads into over 2,000

contigs (Figure 13).

The second longest contig in the table was circular, as

is expected for a chloroplast contig. This contig also

had the same number of nucleotides as the previously

assembled contig from the first workflow. We expected

the correctly assembled plastid contig to have

disproportionally high coverage, as there are hundreds

of copies of plastid genome in each cell. To confirm

that the candidate contig had the highest coverage, we

mapped the long reads back to all contigs. This was

done using the “Map Long Reads to Reference” tool

(Figure 3). As expected, the candidate contig had an

average coverage of over 600x, which is significantly

higher than any other contigs in the mapping table that

are presumably contigs originating from the nuclear

genome (Figure 14).

Figure 13. Contigs assembled from the reduced WGS dataset.

Figure 14. The mapping coverage information for the de novo assembled

contigs. The longest circular contig is selected.

Figure 12. Chloroplast assembly using sampled reads from WGS datasets.

Sample long reads

De novo assemble long reads,

map long reads back to contigs

Align the de novo assembled plastid contig

with a related plastid, shift the start position

and transfer the annotation

Sample short reads

Map short reads to de novo

assembled contigs

Call variants on mapped reads to

confirm the contig quality

Import long and short NGS

8 Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022

Validating the assembly quality

The assembly quality validation was performed as

described for Workflow 1, by mapping a subset of

short reads and calling the variants. No variants were

found, which confirmed that the contig is of high quality

and does not contain assembly errors.

Annotating the longest circular contig

The annotation was performed as described for

Workflow 1. We used the “Create Whole Genome

Alignment” tool (Figure 9) to transfer the annotation

from a related plastid genome. The tool produced the

same annotations as shown in Figure 11 for the plastid

assembled with Workflow 1.

Two other options for annotating newly assembled

plastid genomes are to search for coding sequences

using the “Find Open Reading Frames” tool and then

annotate these with the “Annotate with DIAMOND”

and “Annotate with BLAST” tools in the Microbial

Genomics Module (Figure 15).

Figure 15. “Annotate with DIAMOND” and “Annotate with BLAST” tools

in the Microbial Genomics Module.

Figure 16. Chloroplast assembly using long low-fidelity reads.

Map long reads to a related plastid

Extract and sample mapped reads

De novo assemble the long reads and

polish the contigs with short reads

Align the de novo assembled plastid contig

with a related plastid, shift the start position

and transfer the annotation

Map short reads to a related plastid

Extract and sample mapped reads

Map short reads to de novo

assembled contigs

Call variants on mapped reads to

confirm the contig quality

Import a related plastid genome with annotation

Import long and short NGS reads

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022 9

Workflow 3. Chloroplast assembly of a data sets containing long inverted repeats

Finally, we describe a workflow for assembling plant

plastid genomes with long IRs. For correct assembly

of these plastid genomes, a portion of the long reads

should be longer than the length of the repeats. This

workflow (Figure 16) is similar to Workflow 1, but with

an additional step for polishing the contigs created

using long but imperfect reads. The short Illumina reads

are used in the polishing step.

Data preparation: selecting long chloroplast

reads for assembly

The long reads used here were the low fidelity PacBio

reads up to 84 kb in length, which are sufficient to

span the 20 kb long rice plastid IRs. There were over

300,000 of these long PacBio reads. They were

mapped to a related plastid (NC_008155.1). This

resulted in about 20,000 mapped reads, which were

then extracted from the mapping (see Figures 3 and 5

for the tools used). In this dataset of long reads, there

were erroneously short reads, which we excluded from

further processing by removing reads under 2 kb in

length. From the resulting set of approximately 19,000

reads, we used 500 randomly sampled reads for de

novo assembly (Figure 17).

Figure 17. Selection of long reads for de novo assembly.

Original dataset

of long reads

Reads mapped to

related plastid

Mapped reads

>2 kb in length

Randomly sampled

reads for de novo

assembly

10 Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022

De novo assembly and polishing with

high-quality short reads

Using the 500 long reads extracted in the previous

step, we used the “De Novo Assemble Long Reads”

tool (Figure 6) with a word size of 18 to assemble a

single circular contig of 134,674 nucleotides (Figure 18).

For the polishing step, two million short reads were

sampled from the WGS Illumina dataset. This dataset

contained paired-end Illumina reads approximately

151 nucleotides long. They were then mapped to the

related plastid (NC_008155). The mapped reads

(approximately 100,000) were extracted with the

“Extract Reads” tool (Figure 5) and used to polish the

de novo assembled contig. Polishing was performed

using the “Polish with Reads” tool, found under the Long

Read Support Folder (Figure 6). After running this tool,

the final size of the contig was reduced in length by

about 200 nucleotides (Figure 19).

Rice chloroplast assembly validation

and annotation

The assembly was validated and annotated

as described in Workflow 1 in the “Validation of

the assembly quality” section. The annotation was

transferred as described in the “Whole genome

alignment and transfer of annotations” section using

the “Create Whole Genome Alignment” tool (Figure 9).

Figure 18. The single circular contig produced by the long-read assembler.

Figure 19. The single circular contig after polishing.

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench 01/2022 11

Summary

This application note describes three different

workflows for plastid assembly using QIAGEN

CLC Genomics Workbench. The choice of tools and

workflows depends on the structure of the plastid in the

species of interest, as well as the type of sequencing

data. Assembling plastids with long IRs requires reads

that are long enough to span the repeats. Such long

reads are usually of low fidelity and the assemblies

require polishing. Assembling plastids without long

IRs can be achieved using “shorter” high-fidelity

long reads and does not require contig polishing.

Another step we emphasize is the reduction of NGS

datasets before assembling plastids. We describe

different de novo assembly workflows with and without

preselection of chloroplast reads from whole genome

sequencing data.

1126623 01/2022

QIAGEN CLC Genomics products are intended for molecular biology applications. These products are not intended for

the diagnosis, prevention or treatment of a disease.

For up-to-date licensing information and product-specific disclaimers, see the respective QIAGEN OmicSoft Land product

website. Further information can be requested from [email protected] or by contacting your local account

manager at [email protected].

Reference

Chen, H., Zeng, Y., Yang, Y. et al. (2020) Allele-aware chromosome-level

genome assembly and efficient transgene-free genome editing for

the autotetraploid cultivated alfalfa. Nature Communications 11 , 2494.

https://doi.org/10.1038/s41467-020-16338-x

Trademarks: QIAGEN

, Sample to Insight

(QIAGEN Group); Illumina

(Illumina, Inc); PacBio

(Pacific Biosciences of California, Inc.). Registered names, trademarks, etc. used in this

document, even when not specifically marked as such, may still be protected by law.

Ordering www.qiagen.com/bioinformatics Technical Support digitalinsights.qiagen.com/support Website digitalinsights.qiagen.com

Learn more and request a trial at digitalinsights.qiagen.com/GXWB.