Suggested Methods for Annotation
Please note: For subsequent phases of this project,
a form-based annotation submission system, with more stringent annotation
quality standards, is being developed. The methods described below were
part of a project to investigate at what level researchers would be willing
to submit annotations on a purely volunteer basis.
The following is a suggested procedure for annotating
genes for a particular pathway or functional class with links to the tools
needed.
The critical step we feel you, as a participant in
the project, will provide will involve using your advanced knowledge of
the biology/function of the genes to gauge more accurately the function
of a given gene (even if your analysis suggests that not much can be said
at all about the nature of certain genes). This is not an easy process;
however our hope is that you will receive adequate recognition for your
contribution through the acknowledgment and peer review process being initiated.
Please note: If you would like help with some
of the more tedious aspects of this procedure (i.e. BLASTs of a series
of genes or P. aeruginosa sequences), please contact us. In particular,
if you find there is a larger than expected number of genes in the pathway/functional
class you are studying, please contact us for some help.
You may wish to view our Glossary
of general terminology if you are unfamiliar with any of the terms
used.
Step 1:
Find genes from other organisms that are members
of the functional class or pathway you are studying (these genes will
subsequently be used to pull out homologs in the P. aeruginosa genomic
sequence). To do this, perform a search of databases with Entrez
using relevant keywords, and retrieve a selection of appropriate protein
sequences. Note gene names commonly used for these genes. There are some
minor differences in approaches, depending on whether you are annotating
genes for a pathway, or annotating a family of related genes:
Annotating pathway genes: You will want
to retrieve all the genes associated with the pathway from organisms that
are most similar to P. aeruginosa in terms of evolutionary relatedness
and in terms of function. The degree to which each gene retrieved from
Genbank has been studied experimentally should also be noted and genes
that are well studied should be considered the best models.
Annotating a family of genes of similar structure
or function: A selection of different members of the family should
be retrieved which have as little similarity to each other as possible,
in order to provide a good spectrum of sequences with which to search for
P.
aeruginosa orthologs and paralogs. The amount of experimental data
available for each gene should be noted and preferably well studied genes
should be retrieved.
Step 2:
Use NCBI's unfinished
genomes BLAST server (or examine other options
for performing the BLAST using
NCBI's standalone or network BLAST programs) to identify P. aeruginosa
sequences
similar to the genes retrieved above.
Using each of the genes retrieved in step 1, perform a TBLASTN search using
the default cut off (cut and paste the protein sequence from the Genbank
file into the TBLASTN window and be sure to choose P. aeruginosa
in the list of organisms on the page so you only search against this one
genome). For any E. coli genes you have retrieved, you may wish
to refer to the file available for participants on the PseudoCAP
file
server that contains a TBLASTN of the contigs using all E. coli
genes
(though it uses a fairly stringent cut off of 1e-20 and so would only be
useful for detection of genes that are quite well conserved). For some
genes that are reasonably conserved between organisms, the other BLASTP
and BLASTX analyses available on the file server may also be useful when
gene hunting, since you can use the "find" feature in your spreadsheet
or analogous program to search these tables by keyword. However, we wish
to emphasize that the tables of in silico analysis available on
the file server at this time are predominantly of use in providing a rough
overview of the genome. Blast analysis as described above is a more effective
approach for gene discovery when specific genes, or families of genes,
are being searched for.
Step 3:
Retrieve P. aeruginosa contig sequence
ranges that are found to be similar to the genes chosen above. Use
PathoGenesis's contig
range retrieval tool and retrieve the range with
similarity (according to contig number and range given in the above TBLASTN
output) plus an additional 1-2 kb of sequence flanking the range. Retrieve
both the DNA sequence and translated sequence. If the TBLASTN output indicated
that the similarity was to the complementary strand, you will need to click
on the "reverse complement" option to get the appropriate translated sequence.
Often
it is useful to also type in the range given for an ORF (from the table
of GeneMark.hmm ORFs available from the
file
server) into the contig
range retrieval tool, and then compare the result with the ORF you
believe is located in the range.
Step 4:
Examine the contig ranges obtained in step 3 for the presence of ORFs.
Examine the translated sequence from the contig ranges retrieved to find
an ORF, and then refer back to the DNA sequence to determine if there is
an appropriate ribosome binding site upstream of the predicted start codon(s).
For reference, the 3' end of the P. aeruginosa 16S rRNA is 5'-TCCACCTCCTTA-3'
and the specific point of attachment of the 16S rRNA sequence upstream
of a given gene is thought to be 5'-AGGAGGU-3' (or a shorter variation
thereof). The most common start codon is ATG, however remember that GUG
is frequently used and UUG, AUU, and CUG are other common start codons.
See the common code information from NCBI for more
information on other rare variations from the common genetic code. Again,
it is useful to compare the position of any ORF that you find with the
ORF(s) deduced in that region by GeneMark.hmm
(a list of ORFs is available on the file
server).
Step 5:
Compare the contig ranges and deduced ORFs with NCBI's nr database and
COGs database. Using NCBI's Advanced
BLAST 2.0 Server, cut and paste the DNA sequence for the contig range
retrieved in step 3 into the server's sequence window, and choose BLASTX
as the program. You should also perform a BLASTP analysis using the deduced
protein sequence from the ORF you predicted in step 4. Finally, a similar
comparison should be performed by searching against the COGs
(clusters of orthologous groups) and BLOCKS
databases.
Step 6:
Evaluate the results of database comparisons in step 5. The BLAST
results should be evaluated, considering the following issues:
-
What database genes are most similar to the P. aeruginosa ORF?
-
Does the level of similarity suggest they are orthologs, or is the similarity
somewhat unusual - possibly even suggesting horizontal transfer?
-
Is the similarity in each case over the entire length of the P. aeruginosa
sequence,
or in a certain domain/region of the sequence?
-
Is the similarity in each case over the entire length of the corresponding
nr database sequence, or in a certain domain/region of the sequence?
-
Where there is similarity, are there notations for the database sequence
describing motifs in the relevant domain?
-
Has this database gene been studied experimentally?
-
Have the relevant domains, where there is similarity, been studied experimentally?
-
Are there neighbouring genes which may provide clues regarding functional
assignment?
-
Do neighbouring genes appear in similar arrangements in other genomes?
-
Is there Pseudomonas N-terminal sequence data in the nr database
that is highly similar to the start of the ORF (or start of predicted processed
protein)? This data can be valuable in linking genomic sequence data and
experimental data for Pseudomonas proteins.
-
Does the ORFs predicted protein sequence have a similar predicted molecular
weight and pI as a previously studied Pseudomonas protein?
-
The BLOCKS and COGs results may also be considered, and a reasonable estimation
of what can be determined from all of these analyses should be summarized.
You may wish to review some important points regarding limitations
of current analysis.
The most relevant literature should be noted.
Step 7:
Complete other analyses as necessary - such as PSORT,
SignalP
or transmembrane/secondary
structure prediction programs if you have reason to believe the protein
may be a cytoplasmic membrane protein, periplasmic protein, outer membrane
protein, or secreted protein. You may also wish to search for any motifs
flanking the genes (such as fur boxes or transcription terminators).
Step 8:
Determine appropriate gene names. Search Entrez
(nucleotides) for a gene name you propose, to determine how prevalent it
is and what type of gene it commonly refers to. All gene names should follow
the "three lower case letters followed by an upper case letter" format
and should follow existing naming conventions for that type of gene, where
applicable.
Step 9:
Summarize your results. We will need the following information from
you:
Please note that in some cases you may wish to simply only certain
annotation information - such as a gene and protein name and predicted
function (based on detailed analysis), without calling the start and stop
codon, depending on your confidence with the sequence data.
-
The contig number and range for each gene (indicate which contig version
you are referring too)
-
Whether this range matches an ORF called by GeneMark.
-
The amino acid sequence of each deduced ORF.
-
Proposed name for each gene, with a brief explanation for why the particular
name was chosen.
-
Proposed name for each corresponding protein.
-
Any alternate protein names and/or gene names that have been used.
-
EC number for the protein, if applicable.
-
Map location of the gene, if known.
-
Note any other notations for other motifs.
-
List any functional categories that you believe this gene should belong
to. More than one is acceptable. Appropriate terminology from existing
classification schemes (i.e., Blattner's
and Monica Riley's schemes) are not required - descriptive words are
adequate at this time.
-
A summary indicating your methods (you may cite this page) and rationale
for your approach (what genes you used to pull out P. aeruginosa orthologs/paralogs
and why), and your rationale for the conclusions you came to (e.g. "The
deduced protein was similar to E. coli CorA which encodes a magnesium
cobalt transport protein, however it was similar only to the C-terminus
of CorA where there exists putative cytoplasmic membrane spanning domains.
Analogous transmembrane spanning domains were also identified in the P.
aeruginosa sequence. The lack of similarity in the N-terminus, yet
similarity with a domain of a protein known to associate with the cytoplasmic
membrane, leads to a conclusion that this protein should be assigned as
a 'putative cytoplasmic membrane associated protein' ". Please take into
consideration the limitations of current
analysis when generating your conclusions.
-
Note relevant literature that supports your conclusions.
-
Mention Medline accession numbers for the relevant sequences which support
your conclusions (i.e., to which a gene shares similarity).
-
You may wish to edit the BLASTP table found on the file
server, indicating the proposed changes to the gene name, protein
name, ORF start and stop, etc...
-
Indicate which individuals, and institutions you would like to have referenced
with this annotation on the final submitted genome sequence. Please note
that the individuals will be listed solely with their annotation - they
will not become an author of the publication of the genome sequence.
Step 10:
Submit your analysis to the PseudoCAP moderator. Submissions should
contain the above listed information and can be submitted either within
the body of an email, as an attached text or word-processing file, or as
a cut and paste document into our simple form. See
the information page on making submissions
for some more information.
Your analysis will then be subject to review by the moderator and then
the PAGAC, and then once accepted, entered into a PathoGenesis genome database.
Please
note that this submitted annotation, even if accepted by the committee,
will not necessarily appear in the final annotated sequence exactly as
the author has indicated - it will simply form a strong "vote" for what
the annotation should be in that genome region. Final decisions regarding
annotation of the region (for example, gene name) will be at the discretion
of members of the Pseudomonas Genome Project. However, the PseudoCAP participant
will still receive recognition for their submission in the form of an "authorship"
notation, even if some aspects of their annotation are changed. A collection
of annotations will be later released to the PseudoCAP file server for
viewing by all participants and a fairly complete annotation of the genome
will also be released to participants so they may view it and check it
for errors.
Pseudomonas
aeruginosa Community Annotation Project
Last updated: December 15, 1998
Copyright © 1998