Suggested Methods for Annotation

Please note: For subsequent phases of this project, a form-based annotation submission system, with more stringent annotation quality standards, is being developed. The methods described below were part of a project to investigate at what level researchers would be willing to submit annotations on a purely volunteer basis.

The following is a suggested procedure for annotating genes for a particular pathway or functional class with links to the tools needed.

The critical step we feel you, as a participant in the project, will provide will involve using your advanced knowledge of the biology/function of the genes to gauge more accurately the function of a given gene (even if your analysis suggests that not much can be said at all about the nature of certain genes). This is not an easy process; however our hope is that you will receive adequate recognition for your contribution through the acknowledgment and peer review process being initiated.

Please note: If you would like help with some of the more tedious aspects of this procedure (i.e. BLASTs of a series of genes or P. aeruginosa sequences), please contact us. In particular, if you find there is a larger than expected number of genes in the pathway/functional class you are studying, please contact us for some help.

You may wish to view our Glossary of general terminology if you are unfamiliar with any of the terms used.

Step 1:

Find genes from other organisms that are members of the functional class or pathway you are studying (these genes will subsequently be used to pull out homologs in the P. aeruginosa genomic sequence). To do this, perform a search of databases with Entrez using relevant keywords, and retrieve a selection of appropriate protein sequences. Note gene names commonly used for these genes. There are some minor differences in approaches, depending on whether you are annotating genes for a pathway, or annotating a family of related genes:
Annotating pathway genes: You will want to retrieve all the genes associated with the pathway from organisms that are most similar to P. aeruginosa in terms of evolutionary relatedness and in terms of function. The degree to which each gene retrieved from Genbank has been studied experimentally should also be noted and genes that are well studied should be considered the best models.
Annotating a family of genes of similar structure or function: A selection of different members of the family should be retrieved which have as little similarity to each other as possible, in order to provide a good spectrum of sequences with which to search for P. aeruginosa orthologs and paralogs. The amount of experimental data available for each gene should be noted and preferably well studied genes should be retrieved.
 

Step 2:

Use NCBI's unfinished genomes BLAST server (or examine other options for performing the BLAST using NCBI's standalone or network BLAST programs) to identify P. aeruginosa sequences similar to the genes retrieved above. Using each of the genes retrieved in step 1, perform a TBLASTN search using the default cut off (cut and paste the protein sequence from the Genbank file into the TBLASTN window and be sure to choose P. aeruginosa in the list of organisms on the page so you only search against this one genome). For any E. coli genes you have retrieved, you may wish to refer to the file available for participants on the PseudoCAP file server that contains a TBLASTN of the contigs using all E. coli genes (though it uses a fairly stringent cut off of 1e-20 and so would only be useful for detection of genes that are quite well conserved). For some genes that are reasonably conserved between organisms, the other BLASTP and BLASTX analyses available on the file server may also be useful when gene hunting, since you can use the "find" feature in your spreadsheet or analogous program to search these tables by keyword. However, we wish to emphasize that the tables of in silico analysis available on the file server at this time are predominantly of use in providing a rough overview of the genome. Blast analysis as described above is a more effective approach for gene discovery when specific genes, or families of genes, are being searched for.
 

Step 3:

Retrieve P. aeruginosa contig sequence ranges that are found to be similar to the genes chosen above. Use PathoGenesis's contig range retrieval tool and retrieve the range with similarity (according to contig number and range given in the above TBLASTN output) plus an additional 1-2 kb of sequence flanking the range. Retrieve both the DNA sequence and translated sequence. If the TBLASTN output indicated that the similarity was to the complementary strand, you will need to click on the "reverse complement" option to get the appropriate translated sequence. Often it is useful to also type in the range given for an ORF (from the table of GeneMark.hmm ORFs available from the file server) into the contig range retrieval tool, and then compare the result with the ORF you believe is located in the range.
 

Step 4:

Examine the contig ranges obtained in step 3 for the presence of ORFs. Examine the translated sequence from the contig ranges retrieved to find an ORF, and then refer back to the DNA sequence to determine if there is an appropriate ribosome binding site upstream of the predicted start codon(s). For reference, the 3' end of the P. aeruginosa 16S rRNA is 5'-TCCACCTCCTTA-3' and the specific point of attachment of the 16S rRNA sequence upstream of a given gene is thought to be 5'-AGGAGGU-3' (or a shorter variation thereof). The most common start codon is ATG, however remember that GUG is frequently used and UUG, AUU, and CUG are other common start codons. See the common code information from NCBI for more information on other rare variations from the common genetic code. Again, it is useful to compare the position of any ORF that you find with the ORF(s) deduced in that region by GeneMark.hmm (a list of ORFs is available on the file server).
 

Step 5:

Compare the contig ranges and deduced ORFs with NCBI's nr database and COGs database. Using NCBI's Advanced BLAST 2.0 Server, cut and paste the DNA sequence for the contig range retrieved in step 3 into the server's sequence window, and choose BLASTX as the program. You should also perform a BLASTP analysis using the deduced protein sequence from the ORF you predicted in step 4. Finally, a similar comparison should be performed by searching against the COGs (clusters of orthologous groups) and BLOCKS databases.
 

Step 6:

Evaluate the results of database comparisons in step 5. The BLAST results should be evaluated, considering the following issues: You may wish to review some important points regarding limitations of current analysis.
The most relevant literature should be noted.
 

Step 7:

Complete other analyses as necessary - such as PSORT, SignalP or transmembrane/secondary structure prediction programs if you have reason to believe the protein may be a cytoplasmic membrane protein, periplasmic protein, outer membrane protein, or secreted protein. You may also wish to search for any motifs flanking the genes (such as fur boxes or transcription terminators).
 

Step 8:

Determine appropriate gene names. Search Entrez (nucleotides) for a gene name you propose, to determine how prevalent it is and what type of gene it commonly refers to. All gene names should follow the "three lower case letters followed by an upper case letter" format and should follow existing naming conventions for that type of gene, where applicable.
 

Step 9:

Summarize your results. We will need the following information from you:

Please note that in some cases you may wish to simply only certain annotation information - such as a gene and protein name and predicted function (based on detailed analysis), without calling the start and stop codon, depending on your confidence with the sequence data.

Step 10:

Submit your analysis to the PseudoCAP moderator. Submissions should contain the above listed information and can be submitted either within the body of an email, as an attached text or word-processing file, or as a cut and paste document into our simple form. See the information page on making submissions for some more information.

Your analysis will then be subject to review by the moderator and then the PAGAC, and then once accepted, entered into a PathoGenesis genome database. Please note that this submitted annotation, even if accepted by the committee, will not necessarily appear in the final annotated sequence exactly as the author has indicated - it will simply form a strong "vote" for what the annotation should be in that genome region. Final decisions regarding annotation of the region (for example, gene name) will be at the discretion of members of the Pseudomonas Genome Project. However, the PseudoCAP participant will still receive recognition for their submission in the form of an "authorship" notation, even if some aspects of their annotation are changed. A collection of annotations will be later released to the PseudoCAP file server for viewing by all participants and a fairly complete annotation of the genome will also be released to participants so they may view it and check it for errors.
 
 



Pseudomonas aeruginosa Community Annotation Project
Last updated: December 15, 1998
Copyright © 1998