You are here
Skate Genome Sequence Assembly And Annotation
The purpose of this page is to document NEBC's plans to assemble and annotate the genome sequence of the little skate. We will have a number of teleconferences to discuss and revise this intial outline. This initial outline was written following discussions at the Annual NECC Meeting, held March 12, 2010 in Burlington, VT.
[next videoconference: April 15th, 10am]
Genome Sequence Data
Genomic DNA samples were obtained at MDIBL and sent to Dr. Bruce Kingham at UDE. Sequencing libraries were constructed and the some of the first lanes of paired-end sequence are expected in late March.
- How many lanes of sequencing do we expect to have given our budget?
- What is the breakdown of the lanes by type of sequencing library? e.g., How many are single-end, how many are paired-end, how many are mate pair?
- What is the rough schedule for sequencing those lanes?
NEBC plans to use two different de novo assemblers to generate contigs that will be annotated at the Skate Genome Annotation Workshops. The first assembler is Velvet http://www.ebi.ac.uk/~zerbino/velvet/ , an open source software package from the Sanger Institute, and CLCbio's http://www.clcbio.com/ new de novo assembler (released in March 2010). Jim Vincent at UVM has acquired a NSF TeraGrid account to run Velvet and also has a powerful workstation to run CLCbio's assembler. Ben King at MDIBL will also be working on the assembly.
Depending on the schedule of when sequence data is generated at UDE, we plan to create data sets that include all data generated up to a particular date. These "data freezes" will then be assembled and the resulting contigs compared.
Contigs from each assembler will be compared with one another to validate the assemblies. This will be a challenging endevour, but some initial analyses include:
- Comparing the distribution of contig lengths from each assembler (e.g., N25, N50, N75 lengths) in addition to mean contig lengths.
- Comparing the longest contig from each assembler to all contigs from the other assembler.
- Take a small set of genes and find which contigs from which assembly best represents those genes. Here, a small set of genes previously cloned in the little skate can be used in addition to a small set of human proteins. We hope to find contigs that contain portions or entire exons in this process.
As assemblies are generated, they will need to be annotated. The overall strategy still needs to be discussed. Below are notes about our discussions so far.
Requirements and Tools
1. Genome Browser To View Annotations
We plan to use GBrowse http://gmod.org/wiki/GBrowse ; The University of Delaware has a GBrowse server available for participants to access via a web browser.
2. Database To Store List of Genes and Annotations
We plan to use the the Chado http://gmod.org/wiki/Chado component of GMOD http://gmod.org/wiki/ ; Marc Farnum Rendino firstname.lastname@example.org at UVM has been learning how to set it up, etc.
3. Genome Sequence Annotation Tool
We plan to use Apollo http://apollo.berkeleybop.org/current/index.html since it is part of the GMOD project and should let us get up and running quickly. Marc Farnum Rendino email@example.com at UVM has been learning how to set it up, etc.
4. BLAST Web Interface
We need to be able to search the following nucleotide sequence databases:
- Contigs and singletons from any of the assemblies - This is needed to look for specific genes.
- All reads (yes, the millions and millions of reads) - This is needed to look for specific genes in the entire data set and to validate contigs. If a gene or exon appears to be missing after searching the contigs and singletons, then one could check to see if there are any reads that represents that feature.
5. Automated Sequence Analyses
Given a set of contigs from an assembly, then need a number of analyses performed to generate the "tracks" of evidence that are displayed in Apollo that annotators will evaluate. Analyses include the following.
- Identify repetitive sequences using Repeatmasker or NCBI Windowmasker
- Align all contigs (longer than a particular length as some may be very short, e.g., 200bp) against
- NCBI "nr" (non-redundant) protein database (blastx) - perhaps just human subset
- Skate transcriptome contigs (blastn)
- ~31,000 Skate ESTs (blastn)
- Elephant skark genome assembly (blastn and tblastx)
- Run ab initio gene prediction tool(s) against contigs longer than a particular length, e.g., 2kb, to predict gene model
Genome Annotation Workshop Examples
1. Small Region of Human Chr. Containing Wnt1
At least this region: chr12:49,371,196-49,377,435
Also several genes nearby (WNT10B is proximal).
2. Skate HOXA Cluster
GenBank FJ944024 (http://www.ncbi.nlm.nih.gov/nuccore/254212170)
3. (may not be neccessary) Human Chromosome 21 - The shortest Chromosome.
RefSeq NC_000021 (http://www.ncbi.nlm.nih.gov/nuccore/NC_000021.8)
Apollo Input File Requirements
For these examples, prepare an input file for Apollo that contain results from the following analyses.
- RepeatMasker (http://www.repeatmasker.org)
- GENSCAN (http://genes.mit.edu/GENSCAN.html)
- FGENESH (http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&sub...)
- FGENESH+ (http://linux1.softberry.com/berry.phtml?topic=fgenes_plus&group=programs...)
- BLAST alignments from searching sequence against:
- Human proteins (BLASTX vs. human subset of 'nr' from NCBI)>/li>
- All proteins (BLASTX vs. 'nr' from NCBI)
- Skate ESTs (BLASTN vs. subset of 31,167 Raja erinacea ESTs in dbEST)
- 14,726 ESTs from Normalized Mixed Tissue Library - Adult liver, kidney, brain, testis, ovary, gill, heart, spleen, rectal gland. NCBI dbEST libraries: "Skate Multiple Tissues, Normalized" (9,028 ESTs) and "Little Skate Multiple Tissues, Normalized" (5,698 ESTs).
- 6,016 ESTs from Normalized Liver Library - NCBI dbEST library "Little Skate Liver, Normalized"
- 5,600 ESTs from Embryonic Library - Embryonic stages 19, 20 and 25 sequence at University of Washington. NCBI dbEST library "Little skate (Leucoraja erinacea) embryo tissues; 5' sequences of ESTs".
- 4,825 ETSs from Embryonic Cell Line (LEE-1) - Cell line derived from mincing a single embryo head and body. NCBI dbEST library "Little skate (Leucoraja erinacea embryo cell line 1 (LEE-1): 5' sequences of ESTs".
- Elephant shark genome sequence scaffolds (TBLASTX vs. 685,094 sequence records that are "bundled" together withGenBank accession number AAVX01000000)
A working copy of the agenda for the annotation workshop to be held in Delaware May 24-28, 2010 is available.