17
Nov

sequencing.png

Take home message: use capture arrays to enrich the protein-coding region and then parallel sequence by Illumina.


Online methods from PMID: 19684571

Genomic DNA samples

Targeted capture was performed on genomic DNA from eight HapMap individuals (four Yoruba (NA18507, NA18517, NA19129 and NA19240), two East Asians (NA18555 and NA18956) and two European-Americans (NA12156 and NA12878)) and four European-American individuals affected by Freeman–Sheldon syndrome (FSS10066, FSS10208, FSS22194 and FSS24895). Genomic DNA for HapMap individuals was obtained from Coriell Cell Repositories. Genomic DNA for FSS individuals was obtained by M.B.

Oligonucleotides and adaptors

All oligonucleotides were synthesized by Integrated DNA Technologies and resuspended in nuclease-free water to a stock concentration of 100 M. Sequences are shown in Supplementary Table 5. Double-stranded library adaptors SLXA_1 and SLXA_2 were prepared to a final concentration of 50 M by incubating equimolar amounts of SLXA_1_HI and SLXA_1_LO together and SLXA_2_HI and SLXA_2_LO together at 95 °C for 3 min and then leaving the adaptors to cool to room temperature in the heat block.

Shotgun library construction

Shotgun libraries were generated from 10 g of genomic DNA (gDNA) using protocols modified from the standard Illumina protocol12. Each library provided sufficient material for hybridization to two microarrays. For each sample, gDNA in 300 ul 1x Tris-EDTA was first sonicated for 30 min using a Bioruptor (Diagenode) set at high, then end-repaired for 45 min in a 100 ul reaction volume using 1x End-It Buffer, 10 ul dNTP mix and 10 ul ATP as supplied in the End-It DNA End-Repair Kit (Epicentre). The fragments were then A-tailed for 20 min at 70 °C in a 100 ul reaction volume with 1x PCR buffer (Applied Biosystems), 1.5 mM MgCl2, 1 mM dATP and 5 U AmpliTaq DNA polymerase (Applied Biosystems). Next, library adaptors SLXA_1 and SLXA_2 were ligated to the A-tailed sample in a 90 ul reaction volume with 1x Quick Ligation Buffer (New England Biolabs) with 5 lx Quick T4 DNA Ligase (New England Biolabs) and each adaptor in 10 molar excess of sample. Samples were purified on QIAquick columns (Qiagen) after each of these four steps and DNA concentration determined on a Nanodrop-1000 (Thermo Scientific) when necessary.

Each sample was subsequently size-selected for fragments of size 150–250 bp using gel electrophoresis on a 6% TBE-polyacrylamide gel (Invitrogen). A gel slice containing the fragments of interest was then excised and transferred to a siliconized 0.5 ml microcentrifuge tube (Ambion) with a 20 G needle-punched hole in the bottom. This tube was placed in a 1.5 ml siliconized microcentrifuge tube (Ambion), and centrifuged in a tabletop microcentrifuge at 16,110g for 5 min to create a gel slurry that was then resuspended in 200 ul 1x Tris-EDTA and incubated at 65 °C for 2 h, with periodic vortexing. This allowed for passive elution of DNA, and the aqueous phase was then separated from gel fragments by centrifugation through 0.2 m NanoSep columns (Pall Life Sciences) and the DNA recovered using a standard ethanol precipitation.

Recovered DNA was resuspended in elution buffer (EB; 10 mM Tris-Cl, pH 8.5, Qiagen) and the entire volume used in a 1 ml bulk PCR reaction volume with 1x iProof High-Fidelity Master Mix (Bio-Rad) and 0.5 M each of primers SLXA_FOR_AMP and SLXA_REV_AMP with the conditions: 98 °C for 30 s, 20 cycles at 98 °C for 30 s, 65 °C for 10 s and 72 °C for 30 s, and finally 72 °C for 5 min. PCR products were purified across four QIAquick columns (Qiagen) and all the eluants pooled.

Design of exome capture arrays

We targeted all well-annotated protein-coding regions as defined by the CCDS (version 20080902). Coordinates were extracted from entries with 'public' status, and regions with overlapping coordinates were merged. This resulted in a target with 164,007 discontiguous regions summing to 27,931,548 bp. By comparison, coding sequence defined by all of RefSeq (NCBI 36.3) comprises 31.9 Mb (14% larger). Hybridization probes against the target were designed primarily such that they were evenly spaced across each region. Probes were also constrained (1) to be relatively unique, such that the average occurrence of each 15-mer in the probe sequence is less than 1008, (2) to be between 20 and 60 bases in length, with preference for longer probes, and (3) to have a calculated melting temperature (Tm) 69 °C, with preference for higher Tm values. Tm was calculated by 64.9 + 41 (number of G + Cs - 16.4)/length of probe.

Two arrays (Agilent, 244K format) were designed and used per individual. The first array was common to all individuals, and contained 241,071 probes designed mainly against the subset of the target that was also found in a previous version of the CCDS (CCDS20070227). For most exomes, the second array was custom-designed specifically against target regions that had not been adequately represented after capture on the first array and subsequent sequencing. For two individuals (FSS10066, FSS10208), the matching was to a different individual's first-array data. However, this did not seem to have a significant effect on performance, probably because features capturing poorly on the first array largely did so consistently. Additionally, all of the second arrays also targeted sequences found in CCDS20080902 that were not in CCDS20070227 and hence not targeted by the first array. A subset of arrays used lacked control grids.

Targeted capture by hybridization to DNA microarrays

Hybridizations to Agilent 244K arrays were performed following manufacturer's instructions with modifications. For each enrichment, a 520 l hybridization solution containing 20 g of the bulk-amplified genomic DNA library, 1 aCGH hybridization buffer (Agilent), 1 blocking agent (Agilent), 50 g human CotI DNA (Invitrogen) and 0.92 nmol each of the blocking oligonucleotides SLXA_FOR_AMP, SLXA_REV_AMP, SLXA_FOR_AMP_rev and SLXA_REV_AMP_rev was incubated at 95 °C for 3 min and then at 37 °C for at least 30 min. The hybridization solution was then loaded and the hybridization chamber assembled following the manufacturer's instructions. Incubation was done at 65 °C for at least 66 h with rotation at 20 r.p.m. in a hybridization oven (Agilent).

After hybridization, the slide-gasket sandwich was removed from the chamber and placed in a 50 ml conical tube filled with aCGH Wash Buffer 1 (Agilent). The slide was separated from the gasket while in the buffer and then washed, first with fresh aCGH Wash Buffer 1 at room temperature for 10 min on an orbital shaker (VWR) set on low speed, and then in pre-warmed aCGH Wash Buffer 2 (Agilent) at 37 °C for 5 min. Both washes were also done in 50 ml conical tubes.

A Secure-Seal ( SA2260, Grace Bio Labs) was then affixed firmly over the active area of the washed slide and heated briefly according to the manufacturer's instructions. One port was sealed with a seal tab and the seal chamber completely filled with approximately 1 ml of hot EB (95 °C). The other port was sealed and the slide incubated at 95 °C on a heat block. After 5 min, one port was unsealed and the solution recovered. DNA was purified from the solution using a standard ethanol precipitation.

Precipitated DNA was resuspended in EB and the entire volume used in a 50 l PCR volume comprising of 1x iTaq SYBR Green Supermix with ROX (Bio-Rad) and 0.2 M each of primers SLXA_FOR_AMP and SLXA_REV_AMP. Thermal cycling was done in a MiniOpticon Real-time PCR system (Bio-Rad) with the following programme: 95 °C for 5 min, then 30 cycles of 95 °C for 30 s, 55 °C for 2 min and 72 °C for 2 min. Each sample was monitored and extracted from the PCR machine when fluorescence began to plateau. Samples were then purified on a QIAquick column (Qiagen) and sequenced.

Sequencing

All sequencing of post-enrichment shotgun libraries was carried out on an Illumina Genome Analyzer II as single-end 76 bp reads, following the manufacturer's protocols and using the standard sequencing primer. Image analysis and base calling was performed by the Genome Analyser Pipeline version 1.0 or 1.3 with default parameters, but with no pre-filtering of reads by quality. Quality values were recalibrated by alignment to the reference human genome with the Eland module.

Read mapping

The reference human genome used in these analyses was UCSC assembly hg18 (NCBI build 36.1), including unordered sequence (chrN_random.fa) but not including alternate haplotypes. For each lane, reads with calibrated qualities were extracted from the Eland export output. Base qualities were rescaled and reads mapped to the human reference genome using Maq (version 0.7.1) 13. Unmapped reads were dumped using the –u option and subsequently used for indel mapping. Mapped reads that overlapped target regions ('target reads') were used for all other analyses.

Target masking

All possible 76-bp reads that overlapped the aggregate target were simulated, mapped using Maq and consensus called using Maq assemble with parameters –q 1 –r 0.2 –t 0.9. Target coordinates that had read depth <76 (that is, half of the expected depth), reflecting a poor ability to have reads confidently mapped to them (Supplementary Data 1), were removed from consideration for downstream analyses, leaving a 26,553,795 bp target.

Variant calling

All reads with a map score >0 from each individual were merged and filtered for duplicates such that only the read with the highest aggregate base quality at any given start position and orientation was retained. Sequence calls were obtained using Maq assemble with parameters –r 0.2 –t 0.9, and only coordinates with at least 8 coverage and an estimated Phred-like consensus quality value of at least 30 were used for downstream variant analyses.

Comparison of sequence calls to array genotypes, dbSNP and whole genome sequencing

For the eight HapMap individuals, sequence calls were compared to array-based genotyping data (Illumina Human1M-Duo) provided by Illumina. We excluded from consideration genotyping assays where all eight individuals were called by the arrays as homozygous non-reference as well as the MHC locus at chromosome 6:32500001–33300000, as both sets are likely to be error-enriched in the genotyping data. We downloaded dbSNP(v129) from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts on 13 May 2008. Approximately 14.2 million non-redundant coordinates were defined by this file set. For comparison of NA18507 cSNPs to whole genome data, variant lists were obtained from Illumina 12.

Identification of coding indels

Reads for which Maq was unsuccessful in identifying an ungapped alignment were converted to fasta format and mapped to the human reference genome with cross_match (v1.080812, http://www.phrap.org), using parameters –gap_ext -1 –bandwidth 10 -minmatch 20 –maxmatch 24. Output options –tags –discrep_lists –alignments –score_hist were also set. Alignments with an indel were then filtered for those that: (1) had a score at least 40 more than the next best alignment; (2) mapped at least 75 bases of the read; (3) had no substitutions in addition to the indel; and (4) overlapped a target region. Reads from filtered alignments that mapped to the negative strand were then reverse-complemented and, together with the rest of the filtered reads, re-mapped with cross_match using the same parameters. This was to reduce ambiguity in called indel positions due to different read orientations. After the second mapping, alignments were re-filtered using the same criteria (1) to (4). For each sample, a putative indel event was called if at least two filtered reads covered the same event. A fasta file containing the sequences of all called events 75 bp, as well as the reference sequence at the same positions, was then generated for each individual. All the reads from each individual were then mapped to its 'indel reference' with Maq using default parameters. Reads that mapped multiple times (map score 0) or had redundant start sites were removed, after which the number of reads mapping to either the reference or the non-reference allele was counted for each individual and indel. An indel was called if there were at least eight non-reference allele reads making up at least 30% of all reads at that genomic position. Indels were called as heterozygous if non-reference alleles were 30–70% of reads at that position, and homozygous non-reference if >70%.

Variant annotation

For cSNP annotation, we constructed a local server that integrates data from NCBI (including dbSNP and Consensus CDS files) and from UCSC Genome Bioinformatics. We also generated PolyPhen predictions 24 for all cSNPs identified here, using the PolyPhen Grid Gateway and Perl scripts supplied by I. Adzhubey. The server reads files with SNP locations and alleles, and produces annotation files available for download. Annotation includes dbSNP rs IDs, overlapping-gene accession numbers, SNP function (for example, whether coding missense), conservation scores, HapMap minor-allele frequencies and various protein annotations (sequence, position, amino acid changes with physicochemical properties and PolyPhen classification). Indels were considered annotated by dbSNP if an entry was found with the same allele (or reverse complemented) within 1 bp of the variant position. This was to allow for ambiguities in calling the indel position.

Calculation of genome-wide estimates

Extrapolated estimates for the genome-wide number of cSNPs of various classes (Table 2b) were calculated based on the number of cSNP calls in that individual, the estimated sensitivity for making a variant call in that individual at any given position within the aggregate target (based on the fraction of array-based genotypes of that class that were successfully called; calculated separately for heterozygous and homozygous non-reference variants), and extrapolation to an estimated exome size of exactly 30 Mb (that is, multiplying by 30/26.6 = 1.13). A similar approach was taken to estimate the genome-wide number of uncommon cSNPs introducing nonsense codons, starting with the number observed in each individual and extrapolating based on estimated sensitivity for heterozygote detection and an estimated exome size of exactly 30 Mb.

In this proof-of-principle paper

Nature. 2009 Sep 10;461(7261):272-6. Epub 2009 Aug 16.

Targeted capture and massively parallel sequencing of 12 human exomes.

Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J.

Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA. sarahng@u.washington.edu
Genome-wide association studies suggest that common genetic variants explain only a modest fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability. Although DNA sequencing costs have fallen markedly, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. We have therefore sought to develop second-generation methods for targeted sequencing of all protein-coding regions ('exomes'), to reduce costs while enriching for discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of 12 humans. These include eight HapMap individuals representing three populations, and four unrelated individuals with a rare dominantly inherited disorder, Freeman-Sheldon syndrome (FSS). We demonstrate the sensitive and specific identification of rare and common variants in over 300 megabases of coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of non-synonymous variants by predicted functional impact.

PMID: 19684571

* On average, 6.4 gigabases (Gb) of mappable sequence was generated per individual (20-fold less than whole genome sequencing with the same platform
* the average fold-coverage of each exome was 51x
* A comparison of our data to past reports on exonic or exomic array-based capture revealed roughly equivalent capture specificity, but greater completeness in terms of coverage and variant calling
* On average, 17,272 cSNPs were called per individual, of which 92% were already annotated in a public database (dbSNP v129)
* Our non-redundant cSNP catalogue included 225 nonsense mutations (112 novel) and 102 splice-site disruptions (49 novel).
* We developed and applied an approach for identifying indels from our unpaired 76 bp reads. In total, 664 coding indels were called in one or more individuals. On average, 166 coding indels were called per individual, of which 63% were previously annotated in dbSNP ... To assess specificity, we attempted PCR and Sanger sequencing of 28 novel coding indels chosen at random. Of 21 successful assays, 20 coding indels were verified and 1 was a false positive.
* Specifically, MYH3 is the only gene where: (1) at least one (but not necessarily the same) nonsynonymous cSNP, splice-site disruption or coding indel is observed in all four individuals with FSS; (2) the mutations are not in dbSNP, nor in the eight HapMap exomes.
* our analysis suggests that direct sequencing of exomes of small numbers of unrelated individuals (but more than one) with a shared monogenic disorder can serve as a genome-wide scan for the causative gene.
* A clear limitation of exome sequencing is that it does not identify the structural and noncoding variants found by whole genome sequencing.


Later, the same group used this technique and identified the cause of a rare genetic disorder - Miller syndrome.

Nat Genet. 2009 Nov 13. [Epub ahead of print]

Exome sequencing identifies the cause of a mendelian disorder.

Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ.

[1] Department of Genome Sciences, University of Washington, Seattle, Washington, USA. [2] These authors contributed equally to this work.

We demonstrate the first successful application of exome sequencing to discover the gene for a rare mendelian disorder of unknown cause, Miller syndrome (MIM%263750). For four affected individuals in three independent kindreds, we captured and sequenced coding regions to a mean coverage of 40x and sufficient depth to call variants at approximately 97% of each targeted exome. Filtering against public SNP databases and eight HapMap exomes for genes with two previously unknown variants in each of the four individuals identified a single candidate gene, DHODH, which encodes a key enzyme in the pyrimidine de novo biosynthesis pathway. Sanger sequencing confirmed the presence of DHODH mutations in three additional families with Miller syndrome. Exome sequencing of a small number of unrelated affected individuals is a powerful, efficient strategy for identifying the genes underlying rare mendelian disorders and will likely transform the genetic analysis of monogenic traits.

PMID: 19915526

Exome sequencing 1.png

Exome sequencing 2.png

Exome sequencing 3.png

There are at least three aspects of this approach where we see substantial scope for improvement.

* The first relates to missed variant calls, either due to low coverage or because some variants are not identified easily with current sequencing platforms—for example, those within repeat tracts in coding sequences.

* The second is that our filtering relied on a public SNP database (dbSNP) that has a highly uneven ascertainment of variation across the genome. It would be better to rely on catalogs of common variation that are ascertained in a single study either exome wide (as with the eight HapMap exomes2) or genome wide (for example, as with the 1,000 genomes project) and where estimates of allele frequency are available. Increasing the number of control exomes progressively reduces the relevance of dbSNP to this analysis (Supplementary Fig. 2). Furthermore, as increasingly deep catalogs of polymorphism become available, it may be necessary to establish frequency-based thresholds for defining common variation that is unlikely to be causal for disease.

* A third concern is that the specificity of this approach is currently reduced by a subset of genes that recurrently appear to be enriched for new variants. These include long genes, but also genes that are subject to systematic technical artifacts (for example, mismapped reads due to duplicated or highly similar sequences in the genome). For sequences that are known to be duplicated or have paralogs (for example, genes from large gene families, or pseudogenes), these artifacts are mostly removed during read alignment (as reads with nonunique placements are removed from consideration). However, duplicated sequences not represented in the reference genome are not removed and spuriously appear as enriched for new variants (for example, CDC27).

Leave a Reply