Results of Gene Prediction exercise

I. GENSCAN results

GENSCAN predicts a single protein from this genomic sequence:

>/tmp/01_15_13-09:49:42.fasta|GENSCAN_predicted_peptide_1|1120_aa
MAVCVGSHCRSGNVRVVNGNVKRLLYRNETIPLDECAAGNLSGQLEFKWNATYNVIAMRN
LPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQL
GGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCFFLIIDVLPCRNPVGVVTV
VVALKSPFDSSSRGSTDSSALTGQAVWVRIKSFHIFFGDVLELLSEFGGLEIAFEKFFLG
SSSVEPGSYLGDFLQQWRASISESRRTGWGQCGLAVKRQFVVWSNKPGHSSYLPIEGYTK
VFIRLVVVDRFTIKADTRTTTCHIGTLFSSSSGSQPIHPVIRPREQTLYLCFSHSSIVSL
GGYGDIFFLYSCDWYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDR
LLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVS
NVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGP
LEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIP
ISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSD
ETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDV
FARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLT
DSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIR
PSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSG
SFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMT
LRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSS
AADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHM
SYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV

II. BLASTX results

A. Top hits

Performing a BLASTX search of the genomic sequence as described in the exercise produces top hits as shown in the screen shot below.

Look at the column showing E values.

Notice that there is a break in the scores after the first four hits. The E value rises from 4e-87 for the last of the eyeless isoforms to 8e-27 for the next protein, encoded by the twin of eyeless gene. There is a long tail of significant hits following. These are all genes related to eyeless by protein sequence similarity.

We are going to make a hypothesis here, that only the four eyeless isoforms are encoded by the genomic segment that we used as a query sequence. How would you test this hypothesis?

BLASTX HITS

B. Alignments

Here is the alignment for the top hit, isoform B encoded by eyeless. Notice that there are five segments that align. Remember that we are using a six-frame translation of the query DNA sequence, so the coordinates on the query sequence don't correspond to amino acid positions in a predicted protein. The subject sequence coordinates are informative, however.

Make a model of isoform B of eyeless using only these alignments. Is this model realistic? The most important questions here are:

Is the entire protein encoded on a single strand of genomic DNA?
How do you account for segments encoded by different reading frames?
Are any amino acids of isoform B present in more than one aligned segment? Why?
Are any amino acids of isoform B missing in the alignments shown? Why?

BLASTX alignment

Here is the alignment for the next hit, isoform A encoded by eyeless.

Make a model of isoform A of eyeless using only these alignments. Is this model realistic? The most important questions here are:

Is the entire protein encoded on a single strand of genomic DNA?
How do you account for segments encoded by different reading frames?
Are any amino acids of isoform A present in more than one aligned segment? Why?
Are any amino acids of isoform A missing in the alignments shown? Why?

BLASTX alignment

III. BLASTP Comparison of the GENSCAN predicted protein with isoform A of eyeless

The dot plot comparison of the GENSCAN protein (X axis) vs. isoform A of eyeless (Y axis) is shown below. Notice that there are three segments that align.

Why is there a break in the diagonal line? Consider the two longest alignments. What is the meaning of the gap along the X axis when considering these two alignments?

The shortest alignment represents a segment near the end of the protein on the Y axis. Notice that this same segment is also represented in the longest alignment. How do you explain this?

The aligned sequences are presented below the dot plot.

GENSCAN Dot plot

GENSCAN BLASTP

IV. Augustus results

A. Predicted proteins

Augustus (when set to Drosophila melanogaster) predicts two very similar proteins from this genomic sequence:

>Augustus_prediction1
MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHP
HSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEV
VSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDL
MQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASV
TAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSL
EKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTIN
GLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGA
MYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSG
YEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTS
AHVAPGKQQFFASCFYSPWV

>Augustus_prediction2
MSSACTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYF
ATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQY
KRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL
NSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGP
SLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERT
HYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPST
LSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHH
TALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAY
ALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGK
QQFFASCFYSPWV

B. BLASTP Comparison of the AUGUSTUS predicted proteins with isoform A of eyeless

The proteins predicted by Augustus are similar to each other, and are very close to isoform A of eyeless. Explain the discrepancy.

The dot plot and the alignment of Augustus protein 1 with isoform A of eyeless are shown below.

AUGUSTUS dot plot

AUGUSTUS BLASTP

The dot plot and the alignment of Augustus protein 2 with isoform A of eyeless are shown below.

AUGUSTUS dot plot

AUGUSTUS BLASTP

The dot plot and the alignment of Augustus protein 1 with Augustus protein 2 are shown below.

AUGUSTUS dot plot

AUGUSTUS BLASTP

V. Search for human ortholog

The list of top hits and the alignment of the best hit are shown below. The top two hits are isoforms of PAX6. There is a slight gap in the E values before significant hits on other members of the PAX gene family.

Human ey BLASTP

In the alignment, note that the three aligned segments have very different E values (listed under "Expect"). The first alignment is 94% identical, 96% positive. "Positive" in this context means that the amino acid substitutions are conservative. All eight amino acid substitutions from the best-matched segment in the first alignment are shown below.

You can review the structures of the amino acids at the entry on proteinogenic amino acids at Wikipedia.

The evaluation of conservative vs. nonconservative amino acid substitutions is made using a Blosum matrix.

Query Position (Drosophila eyeless)	Subject Position (Human PAX6)	Change (Drosophila -> Human)	Change Type
50	17	G (GLY) -> N (ASN)	nonconservative
114	81	A (ALA) -> P (PRO)	nonconservative
121	88	S (SER) -> A (ALA)	conservative
141	108	Q (GLN) -> S (SER)	nonconservative
143	110	N (ASN) -> G (GLY)	nonconservative
165	132	A (ALA) -> S (SER)	conservative
166	133	Q (GLN) -> E (GLU)	conservative
168	135	E (GLU) -> Q (GLN)	conservative

V. Phenotype

A. FlyBase

It is a good idea to look through the FlyBase entry on eyeless. You will find that the gene is named by the first obvious phenotypic difference that mutant flies exhibited: they have small eyes, or sometimes no eyes. There are dominant alleles that produce small eyes when heterozygous. These are lethal when homozygous, with the affected flies dying as pupae.

B. OMIM

The human ortholog, PAX6, also has mutant alleles associated with eye defects in people. This is evident from the top of the OMIM entry for PAX6, shown in the screen shot below. At the OMIM entry for PAX6, clicking the link from any of the MIM numbers will take you to the phenotype entry.

PAX6 OMIM

C. Mouse Genome Informatics (MGI)

The mouse ortholog, Pax6, has mutant alleles associated with dominant small eyes and recessive lethality. There is a lot to learn about Pax6 at MGI.