A First Look at D. ananassae fosmid 1049B07

This shows the initial steps of annotation of D. ananassae fosmid 1049B07, including the construction of a gene model for a single gene. This page illustrates the following steps:

1. Viewing the fosmid in the GEP UCSC genome browser
2. Finding D. melanogaster orthologs to genes on the fosmid using BLASTX
3. Finding the correct gene symbol for a D. melanogaster ortholog in FlyBase
4. Downloading transcript and CDS information for the D. melanogaster ortholog from the Gene Record Finder
5. Locating the orthologous exons on the fosmid
6. Locating the orthologous CDS on the fosmid
7. Annotating Exon 4, including dealing with a gene on the minus strand
8. Using RNA-Seq data as a tool in annotation
9. Using the Gene Model Checker on a single exon to check our work
10. Comparison of genomic sequences for D. melanogaster and D. ananassae
11. Graphical representation of the D. melanogaster and D. ananassae gene models
12. Species distribution of gain of the intron

1. Viewing the fosmid in the GEP UCSC genome browser

Here is the GEP UCSC genome browser view of D. ananassae fosmid 1049B07 from the 3L control region.

GEP UCSC

We would like to show how to annotate genes on the minus strand in this example. Notice the cluster of aligned D. melanogaster proteins on the right. The direction of transcription is from right to left in this view (note the arrowheads in the introns). This means that this gene is on the minus strand.


2. Finding D. melanogaster orthologs to genes on the fosmid using BLASTX

Next we carry out a BLASTX search of the nr database of D. melanogaster proteins using the six-frame translation of D. ananassae fosmid 1049B07 as the query sequence. The top hits are shown below.

BLASTX

The top hit is the Octopamine-Tyramine receptor. The alignment is shown below.

BLASTX_align

Notice that the alignment is in two segments, to frame -3 and -2.


3. Finding the correct gene symbol for a D. melanogaster ortholog in FlyBase

We go to FlyBase to discover the gene symbol. A search for "octopamine" gives the results shown below.

FlyBase octopamine

Only one gene, Oct-TyrR, is in the 79B region.


4. Downloading transcript and CDS information for the D. melanogaster ortholog from the Gene Record Finder

We go to the Gene Record Finder at GEP to retrieve the transcript and CDS information for this gene. The results are shown below.

Gene Record Finder

Gene Record Finder

Gene Record Finder

Gene Record Finder

We retrieve the transcript and CDS data.


5. Locating the orthologous exons on the fosmid

We will use the transcript data to begin annotating the Oct-TyrR gene on the fosmid.

We carry out five bl2seq BLASTN alignments of the D. ananassae fosmid 1049B07 (the subject sequence) and individual D. melanogaster exon sequences from the transcript data.

>Oct-TyrR:1
CGAATTGGCGTTGGGTGTGGACGCGAGTGGCGGAAGCAGGCGGCGTTGAT
ATATACGAGCTCTTCCATCTTTCGTGATGCGGTATTAGAAGCTGGCAGCT
CAGAGATTCCCGCAAAGTTTAAGTGACAATTTGCCAGCCAACAACAACTT
TCCGACGCAGGCAACGCAAATTGAATTCGATTCGATTCGATTGTCTCTCG
ATCTTCATCAATTCATTACGCACAGGAAAAGAGGGCGAACCGTAAAGTTC
TGGTGAAAAAGTTTCCTGGGCTCCGTTGGCGTGGCAAAGCGACCGAAACC
AAAACGAAATTTTGAAAATGAGCTTTGCTAACGACGGGCCAAACCAATTA
ACAGAATTCGTTCTTGTGTAATAAATAAATTGCCAACAATTATAACTTGC
AGTCCACTCAGGCATATTCAAATGAAATGTGCCACAAAATGTTTACGGTC
ATTGCAACTCAAAGCGACAGACCATAGACGAGGTGCAAGGTGTTGTGGCA
GTTGCAGAAAAACTAAAAGAAAGCCGTAAGGCTTGACCAAAAATTAATAA
CTGATAAAAGCAG

Results: No significant match.


>Oct-TyrR:2
AAATAAGTCAAAGAAGTCGGGGAAATCGCACTCAACGTCCGCCTTTCCAC
CAAGACGCATGTAAACGCAACCGGAGCCCAAAGAAGGCAAGTGGCAGGGC
AGGGAAAGATGCCATCGGCAGATCAGATCCTGTTTGTAAATGTCACCACA
ACGGTGGCGGCGGCGGCTCTAACCGCTGCGGCCGCCGTCAGCACCACAAA
GTCCGGAAGCGGCAACGCCGCACGGGGCTACACGGATTCGGATGACGATG
CGGGCATGGGAACGGAGGCGGTGGCTAACATATCCGGCTCGCTGGTGGAG
GGCCTGACCACCGTTACCGCGGCATTGAGTACGGCTCAGGCGGACAAGGA
CTCAGCGGGAGAATGCGAAGGAGCTGTGGAGGAGCTGCATGCCAGCATCC
TGGGCCTCCAGCTGGCTGTGCCGGAGTGGGAG

Results are shown below.

Oct-TyrR exon2


>Oct-TyrR:3
TCAAAGAAGTCGGGGAAATCGCACTCAACGTCCGCCTTTCCACCAAGACG
CATGTAAACGCAACCGGAGCCCAAAGAAGGCAAGTGGCAGGGCAGGGAAA
GATGCCATCGGCAGATCAGATCCTGTTTGTAAATGTCACCACAACGGTGG
CGGCGGCGGCTCTAACCGCTGCGGCCGCCGTCAGCACCACAAAGTCCGGA
AGCGGCAACGCCGCACGGGGCTACACGGATTCGGATGACGATGCGGGCAT
GGGAACGGAGGCGGTGGCTAACATATCCGGCTCGCTGGTGGAGGGCCTGA
CCACCGTTACCGCGGCATTGAGTACGGCTCAGGCGGACAAGGACTCAGCG
GGAGAATGCGAAGGAGCTGTGGAGGAGCTGCATGCCAGCATCCTGGGCCT
CCAGCTGGCTGTGCCGGAGTGGGAG

Results are shown below.

Oct-TyrR exon3


>Oct-TyrR:4
GCCCTTCTCACCGCCCTGGTTCTCTCGGTCATTATCGTGCTGACCATCAT
CGGGAACATCCTGGTGATTCTGAGTGTGTTCACCTACAAGCCGCTGCGCA
TCGTCCAGAACTTCTTCATAGTTTCGCTGGCGGTGGCCGATCTCACGGTG
GCCCTTCTGGTGCTGCCCTTCAACGTGGCTTACTCGATCCTGGGGCGCTG
GGAGTTCGGCATCCACCTGTGCAAGCTGTGGCTCACCTGCGACGTGCTGT
GCTGCACTAGCTCCATCCTGAACCTGTGTGCCATAGCCCTCGACCGGTAC
TGGGCCATTACGGACCCCATCAACTATGCCCAGAAGAGGACCGTTGGTCG
CGTCCTGCTCCTCATCTCCGGGGTGTGGCTACTTTCGCTGCTGATAAGTA
GTCCGCCGTTGATCGGCTGGAACGACTGGCCGGACGAGTTCACAAGCGCC
ACGCCCTGCGAGCTGACCTCGCAGCGAGGCTACGTGATCTACTCCTCGCT
GGGCTCCTTCTTTATTCCGCTGGCCATCATGACGATCGTCTACATCGAGA
TCTTCGTGGCCACGCGGCGCCGCCTAAGGGAGCGAGCCAGGGCCAACAAG
CTTAACACGATCGCTCTGAAGTCCACTGAGCTCGAGCCGATGGCAAACTC
CTCGCCCGTCGCCGCCTCCAACTCCGGCTCCAAGTCGCGTCTCCTAGCCA
GCTGGCTTTGCTGCGGCCGGGATCGGGCCCAGTTCGCCACGCCTATGATC
CAGAACGACCAGGAGAGCATCAGCAGTGAAACCCACCAGCCGCAGGATTC
CTCCAAAGCGGGTCCCCATGGCAACAGCGATCCCCAACAGCAGCACGTGG
TCGTGCTGGTCAAGAAGTCGCGTCGCGCCAAGACCAAGGACTCCATTAAG
CACGGCAAGACCCGTGGTGGCCGCAAGTCGCAGTCCTCGTCCACATGCGA
GCCCCACGGCGAGCAACAGCTCTTACCCGCCGGCGGGGATGGCGGTAGCT
GCCAGCCCGGCGGAGGCCACTCTGGAGGCGGAAAGTCGGACGCCGAGATC
AGCACGGAGAGCGGGAGCGATCCCAAAGGTTGCATACAG

Results are shown below.

Oct-TyrR exon4


>Oct-TyrR:5
GTCTGCGTGACTCAGGCGGACGAGCAAACGTCCCTAAAGCTGACCCCGCC
GCAATCCTCGACGGGAGTCGCTGCCGTTTCTGTCACTCCGTTGCAGAAGA
AGACTAGTGGGGTTAACCAGTTCATTGAGGAGAAACAGAAGATCTCGCTT
TCCAAGGAGCGGCGAGCGGCTCGCACCCTGGGCATCATCATGGGCGTGTT
CGTCATCTGCTGGCTGCCCTTCTTCCTCATGTACGTCATTCTGCCCTTCT
GCCAGACCTGCTGCCCCACGAACAAGTTCAAGAACTTCATCACCTGGCTG
GGCTACATCAACTCGGGCCTGAATCCGGTCATCTACACCATCTTCAACCT
GGACTACCGCCGGGCCTTCAAGCGACTTCTGGGCCTGAATTGAGGCTGGC
TGGCGGGGGCGGGTGGAAGATATAAACCGGGCCAATCATGGTTCGAGCGG
GAGAGCGTAACTCAAAGTTTGTGCCAAACTTAAATGGTCGTGTATGCGTT
CAGCGGAGATCTCAGTCTATGTACAGTTGGACCCCCAGTTGATGAACTTC
CGAGTTCAACTTCTCTAACACATATATACTTTCAAATGCGTTCTTGGTGA
ACTCATTTTGAAGAAGTGGATGAATTTGGTAAAGTGTAATAGATTGAATA
TAATTTTTAATGTTTAACGTTTCGGCAAAGTGAAAAGCCCCCACATTGGA
AAGTCAAAGATGAGACTCGAGTGTATATATAGTTTCAAACTAAGTTATTA
TTTCTAGCCGTAATTAAAATACTTTCATTTAGTTTTGAACATTTTTTTAA
TATATTGTTGTTTGGAATCGATTGAGATGTACCACCACATTAAGCGTAGA
TTGTTCAATACTCATACTAAAATGGGTTGTGCTGCGATTAAAGTGAGGAT
GTTGCCTCAAGGCACAGCTACTAGGAAAATCATAAAAATTACATGGTAAA
GAATTATACATGCATTATACTCCAGCTAAGTGGCATCCCAAACGAGAATA
GCATCAAATTGAATTTAATACAATTAAATTAAATGTTTAGGCACAAAGAA
TTGTGGCAACTTTCGTGTTTCACCCTAAGCGTATGGATAACCAAAAAGGT
GTTTGTTAAATTAAATCTGCGCTCAAGATATGTAAGCAACTACTAAGCTA
AATAATAACTTCCAAGAGAGAAACGTTTTCTAGGCATTACTTTAACGATT
TGTATTTATATGTACTTTAATTGTAGGTAAACGATAAACCACTATACCTA
ATGTATACTTTCAAATACGCTTTGGACTATTTGTTAAATAATTTAACGAT
TAATTGTTTTTATGGCATAGCAACTATTGTGTTGAGTGGGCAGCTTAAAG
CTAGCACATCGAAACTTACTTAAGGTAGATAAATGTTTAACTGCACGTTA
CGAAATGCAACAGAGTTGGCGAAAGGACGTAATTCAATGGATGTGTTAAC
TCAAGTACATGCTATATCGTAAATGTATATCACAATTTATGTCTTTTAAC
GACGATGTACGATAGTTTCACTAATTATATTGTTTAACGAGAAAGAGCGA
GCAAAGCGTAAATGAAACAAATAAAAGACACATTCGAATTAAAGTTAT

Results are shown below.

Oct-TyrR exon5


6. Locating the orthologous CDS on the fosmid

We will use the CDS data to begin annotating the Oct-TyrR gene on the fosmid.

We carry out three bl2seq TBLASTN alignments of the D. ananassae fosmid 1049B07 (the subject sequence) and individual D. melanogaster CDS segments (the query sequences) as shown below.

>Oct-TyrR:2_759_0
MPSADQILFVNVTTTVAAAALTAAAAVSTTKSGSGNAARGYTDSDDDAGM
GTEAVANISGSLVEGLTTVTAALSTAQADKDSAGECEGAVEELHASILGL
QLAVPEWE

TBLASTN


>Oct-TyrR:4_759_0
ALLTALVLSVIIVLTIIGNILVILSVFTYKPLRIVQNFFIVSLAVADLTV
ALLVLPFNVAYSILGRWEFGIHLCKLWLTCDVLCCTSSILNLCAIALDRY
WAITDPINYAQKRTVGRVLLLISGVWLLSLLISSPPLIGWNDWPDEFTSA
TPCELTSQRGYVIYSSLGSFFIPLAIMTIVYIEIFVATRRRLRERARANK
LNTIALKSTELEPMANSSPVAASNSGSKSRLLASWLCCGRDRAQFATPMI
QNDQESISSETHQPQDSSKAGPHGNSDPQQQHVVVLVKKSRRAKTKDSIK
HGKTRGGRKSQSSSTCEPHGEQQLLPAGGDGGSCQPGGGHSGGGKSDAEI
STESGSDPKGCIQ

TBLASTN


>Oct-TyrR:5_759_0
VCVTQADEQTSLKLTPPQSSTGVAAVSVTPLQKKTSGVNQFIEEKQKISL
SKERRAARTLGIIMGVFVICWLPFFLMYVILPFCQTCCPTNKFKNFITWL
GYINSGLNPVIYTIFNLDYRRAFKRLLGLN*

TBLASTN


This gives us the information shown below.

Dmel Exon Dana Start Dana End
Oct-TyrR:1 N/A N/A
Oct-TyrR:2 23850 23539
Oct-TyrR:3 23850 23539
Oct-TyrR:4 23538 22465
Oct-TyrR:5 A 22354 21938
Oct-TyrR:5 B 21362 20759

Dmel Exon Dana Start Dana End Frame
CDS 2
Oct-TyrR:2_759_0
23850 23539 -3
CDS 4
Oct-TyrR:4_759_0
23487 22465 -3
CDS 5
Oct-TyrR:5_759_0
22354 21962 -2

Notice that the coordinates of the start of the coding segments on the fosmid are larger than the coordinates of the end of the coding segments. When we read the minus strand, we move from right to left, that is, from high coordinates to low coordinates.

There is an unusual feature here. Notice that Exon 3 and Exon 4 from the D. melanogaster transcript align to a single continuous sequence: the alignment of D. melanogaster Exon 3 ends at 23539 on the fosmid; the alignment of D. melanogaster Exon 4 begins at 23538. There is no intron separating Exons 3 and 4 in D. ananassae.


7. Annotating Exon 4, including dealing with a gene on the minus strand

Now we go back to the GEP UCSC genome browser to locate the beginning and end of the first exon. Part of the view is shown below.

GEP UCSC genome browser

We would like to locate the ATG that begins at 23,850 according to our alignment. Notice in the screenshot above that the D. melanogaster Oct-TyrR isoforms align to this position.

We use scroll and zoom to move in closer, as shown below.

GEP UCSC genome browser

At first glance, this looks OK. We have an ATG and a methionine at about the right position. However, we know that the alignment is to the reading frame -3. We are viewing an ATG in reading frame +2. We can read the ATG in frame +2 from left to right, but the CDS for this gene is read from right to left. Note the red arrow pointing to the small arrow in the upper left. Clicking the small arrow changes the display as shown below.

GEP UCSC genome browser

Now we see a methionine in frame -3, and we can read the ATG from right to left. The gray text for the DNA sequence (as opposed to the black text in the previous screenshot) shows that we are reading the minus strand.

We confirm that the position of the first base of the ATG is 23850.

While the first segment of our D. melanogaster CDS alignment ends at 23539, reading frame -3 is open all the way through the end of the second aligning segment at 22465. Let's see if we can find the end of the second aligning segment. We use scroll and zoom to arrive at the view shown below.

GEP UCSC genome browser


8. Using RNA-Seq data as a tool in annotation

There is a high donor site after 22465. We can read the GT from right to left. The RNA-Seq tracks support a splice here. All of the gene prediction programs call a splice here also. Using frame -3, we see that the next exon is phase 0.

We zoom out and turn off the detailed RNA-Seq track to get the view below.

GEP UCSC genome browser


9. Using the Gene Model Checker on a single exon to check our work

The modENCODE RNA-Seq data support the idea that the first exon runs from 23850-22465. We can use the Gene Model Checker to verify Exon 4, as shown below.

Gene Model Checker

Our model for Exon 4 looks good. The dotplot is shown below. Notice that the dotplot is color-coded to conform to the D. melanogaster exons. Our model, which merges the first and second exons, aligns perfectly with the D. melanogaster protein. This appears to be one of those rare instances when we have lost an intron in D. ananassae, or gained one in D. melanogaster. The alignment is offset a bit near the end of our first exon. This seems to be a loss of some amino acids in D. ananassae relative to D. melanogaster at this position.

Gene Model Checker

Here is the last exon.

GEP UCSC genome browser

There is a high acceptor AG that ends at 22355. Most of the gene predictors call a splice here, which is supported by the RNA-Seq data. The exon begins at 22354. Using frame -2, we see that this exon is phase 0 as predicted.

We move to the end of the second exon, as shown below.

GEP UCSC genome browser

The end of the coding sequence in frame -2 is 21965. The protein alignment ends here exactly. The stop is 21964-21962. The gene predictors call the stop here.

Here is a check using the Gene Model Checker. We are warned that the number of coding exons does not match the D. melanogaster ortholog. The dot plot looks really good, giving us confidence in the model, but it is necessary to look more carefully into this.

Gene Model Checker

Gene Model Checker


10. Comparison of genomic sequences for D. melanogaster and D. ananassae

Here is the D. melanogaster genomic sequence for the region including Oct-TyrR from FlyBase.

>3L:22053241,22058240
agttcagtttatgagcgggttattttgattaatgttccttcaactggtgtcaactggtag
gcggacaaaaaggagcccgatagcaaaattcagcaaagctggtaaggacttagtggatag
tgaagtgtcaggcggaagtgaaaagagcagagcaacaattattcgatgaccacagaaggg
gagtggggagcgtattgacttagggcaaaaagtgatatccctgaacatcttagaaccctc
tttccttatcgacacacatttacttaggaccatctttttgacccaattcgttctgacatt
tttaattacttttagataattaatcccaagttcagataattcgagaagatttgtctgtaa
aattgttagggaccttcccaccgtgattcctttaggttttcttttgatggtcccttctat
cgaatgtcctttgggcttattgacaagctgtaaataatttggcatgacatgatcctgata
gccttgtgcggtaactttgtccaccctgatccgtaagtctgaactccttctaaacgcaat
ttgtaaacctataaaaaaagaaaaaagcagctctcatattctttcgtaattatgtgaatg
caagttaaaagtttaagccgattgagccacaattttgtctttccaattgatattagcaaa
cagctatggtatagttattttaatacaatgccatctccatacatggttttttactctctg
cgtacaatacagttaagttaagtaagccagaaaatttcaacgaaaattcccgcctcccaa
cgaagtgttgaagtcacaataatttgctttcctctcaattttcctttcgcagaaataagt
caaagaagtcggggaaatcgcactcaacgtccgcctttccaccaagacgcatgtaaacgc
aaccggagcccaaagaaggcaagtggcagggcagggaaagatgccatcggcagatcagat
cctgtttgtaaatgtcaccacaacggtggcggcggcggctctaaccgctgcggccgccgt
cagcaccacaaagtccggaagcggcaacgccgcacggggctacacggattcggatgacga
tgcgggcatgggaacggaggcggtggctaacatatccggctcgctggtggagggcctgac
caccgttaccgcggcattgagtacggctcaggcggacaaggactcagcgggagaatgcga
aggagctgtggaggagctgcatgccagcatcctgggcctccagctggctgtgccggagtg
ggaggtaataagcacaatgatagtatttcatttcattactaataataacggatagcttct
gcctttttgttttttattaacgtttgacagcaaaaataaaatattcacatgcaaacctta
ccttttaaaaaattgcgttataaatccccttttacttttgaccattattttgatcaagta
tttgtatacccaaccgcctgtttgttcgtacttctgtctgtccgtccgtatatctgtccg
tatgaacagaaactataaaagaatgtacaacaggaagttgagattaagcatattggtttt
agttaacttttacgcagtgcaaccaaaatttaaaaaaatgttataattttaatagattct
atcatttgtcttgccaatttctattgaaagttcgcctgccacaaatttcacttatattaa
gtgtgaccgtcaccacgtagtgcaagtgaccctccctagaccaccaaactggtagtttct
actctttcaataagtctaaataccagtagatcttatttatgttggtttaaagctctccct
aacttcttttctattttaactaggcccttctcaccgccctggttctctcggtcattatcg
tgctgaccatcatcgggaacatcctggtgattctgagtgtgttcacctacaagccgctgc
gcatcgtccagaacttcttcatagtttcgctggcggtggccgatctcacggtggcccttc
tggtgctgcccttcaacgtggcttactcgatcctggggcgctgggagttcggcatccacc
tgtgcaagctgtggctcacctgcgacgtgctgtgctgcactagctccatcctgaacctgt
gtgccatagccctcgaccggtactgggccattacggaccccatcaactatgcccagaaga
ggaccgttggtcgcgtcctgctcctcatctccggggtgtggctactttcgctgctgataa
gtagtccgccgttgatcggctggaacgactggccggacgagttcacaagcgccacgccct
gcgagctgacctcgcagcgaggctacgtgatctactcctcgctgggctccttctttattc
cgctggccatcatgacgatcgtctacatcgagatcttcgtggccacgcggcgccgcctaa
gggagcgagccagggccaacaagcttaacacgatcgctctgaagtccactgagctcgagc
cgatggcaaactcctcgcccgtcgccgcctccaactccggctccaagtcgcgtctcctag
ccagctggctttgctgcggccgggatcgggcccagttcgccacgcctatgatccagaacg
accaggagagcatcagcagtgaaacccaccagccgcaggattcctccaaagcgggtcccc
atggcaacagcgatccccaacagcagcacgtggtcgtgctggtcaagaagtcgcgtcgcg
ccaagaccaaggactccattaagcacggcaagacccgtggtggccgcaagtcgcagtcct
cgtccacatgcgagccccacggcgagcaacagctcttacccgccggcggggatggcggta
gctgccagcccggcggaggccactctggaggcggaaagtcggacgccgagatcagcacgg
agagcgggagcgatcccaaaggttgcatacaggtaagactcttcgtaatcgtgaagatat
tgcaagcccataattgttttacacatttattttaataaaaatatttgtaataaagggtct
tgtgtttacaactgctaataagtctagataataaatacatatacatattatacccgttac
ttatcgagtataagggtatactagattcgttgaaaagtatttaacaggccgactatataa
agtatataaattcttgatcaggatcaaaagccgagtcgatctaggtatgtccgtctgtcc
gattgtccgtatgaacgttgagatctcaggaaatataaaaactagaaggctcagacaaag
acgcagcgcaactttgactcatgttgccacgcccacaaaccgcacaaaactgccccgccc
aaacttttgaaaaacgtcttgatattttttcatatttttataagccttgtaaatttctat
ctatttgccacgctagggtatctaatattcagggaaatcgacaaaaaaattaaaaatcta
aagtgctcttttatttagagagccacatggaacaagcgatataagttgtataagttgtat
gggagaaatggctttttccatgccacttgattaattaaacttttcctcacctcgattcca
ggtctgcgtgactcaggcggacgagcaaacgtccctaaagctgaccccgccgcaatcctc
gacgggagtcgctgccgtttctgtcactccgttgcagaagaagactagtggggttaacca
gttcattgaggagaaacagaagatctcgctttccaaggagcggcgagcggctcgcaccct
gggcatcatcatgggcgtgttcgtcatctgctggctgcccttcttcctcatgtacgtcat
tctgcccttctgccagacctgctgccccacgaacaagttcaagaacttcatcacctggct
gggctacatcaactcgggcctgaatccggtcatctacaccatcttcaacctggactaccg
ccgggccttcaagcgacttctgggcctgaattgaggctggctggcgggggcgggtggaag
atataaaccgggccaatcatggttcgagcgggagagcgtaactcaaagtttgtgccaaac
ttaaatggtcgtgtatgcgttcagcggagatctcagtctatgtacagttggacccccagt
tgatgaacttccgagttcaacttctctaacacatatatactttcaaatgcgttcttggtg
aactcattttgaagaagtggatgaatttggtaaagtgtaatagattgaatataattttta
atgtttaacgtttcggcaaagtgaaaagcccccacattggaaagtcaaagatgagactcg
agtgtatatatagtttcaaactaagttattatttctagccgtaattaaaatactttcatt
tagttttgaacatttttttaatatattgttgtttggaatcgattgagatgtaccaccaca
ttaagcgtagattgttcaatactcatactaaaatgggttgtgctgcgattaaagtgagga
tgttgcctcaaggcacagctactaggaaaatcataaaaattacatggtaaagaattatac
atgcattatactccagctaagtggcatcccaaacgagaatagcatcaaattgaatttaat
acaattaaattaaatgtttaggcacaaagaattgtggcaactttcgtgtttcaccctaag
cgtatggataaccaaaaaggtgtttgttaaattaaatctgcgctcaagatatgtaagcaa
ctactaagctaaataataacttccaagagagaaacgttttctaggcattactttaacgat
ttgtatttatatgtactttaattgtaggtaaacgataaaccactatacctaatgtatact
ttcaaatacgctttggactatttgttaaataatttaacgattaattgtttttatggcata
gcaactattgtgttgagtgggcagcttaaagctagcacatcgaaacttacttaaggtaga
taaatgtttaactgcacgttacgaaatgcaacagagttggcgaaaggacgtaattcaatg
gatgtgttaactcaagtaca

Here is a bl2seq alignment of the sequences using BLASTN.

Dot plot

We see alignments to the D. ananassae fosmid in the range from around 21,000 - 24,000 as noted earlier in the BLAST alignments of the D. melanogaster transcript.

Using only part of the fosmid as the subject sequence gives us a closer look at the aligning segments, shown below.

Dot plot

The alignments are in three gapped segments. Starting at the top left, the first two segments are in line. The middle segment is on a separate line. The gapped alignment in the lower right is on a third line. This shows that there are three segments of conserved sequence between D. melanogaster and D. ananassae, which is what we saw from the CDS alignments using BLAST.

Note the position on the X axis of the end of the second aligned segment and the beginning of the third. There is a substantial insertion in the D. melanogaster sequence relative to the D. ananassae sequence near 3,000 bp on the X axis. This is the missing intron in D. ananassae.

If this is correct, the position of this spot on the D. ananassae fosmid can be read from the first dotplot. It is located near 23,500 on the fosmid.

This position was noted earlier in our BLASTN alignment of the D. melanogaster transcript to the fosmid:

Dmel Exon Dana Start Dana End
Oct-TyrR:1 N/A N/A
Oct-TyrR:2 23850 23539
Oct-TyrR:3 23850 23539
Oct-TyrR:4 23538 22465
Oct-TyrR:5 A 22354 21938
Oct-TyrR:5 B 21362 20759

Note that there is no gap in D. ananassae fosmid coordinates between the end of D. melanogaster Exon 3 and the beginning of Exon 4. We can use BLASTN to obtain the coordinates of the D. melanogaster exons on the assembly by using the D. melanogaster exons as query sequences and the D. melanogaster assembly as the subject sequence.

For this search, the query sequences are the exons from the D. melanogaster transcripts, the subject sequence is NCBI genomes restricted to Drosophila melanogaster, and the program is optimized for highly similar sequences (megablast). The results are shown below.

genomic megaBLAST

genomic megaBLAST

genomic megaBLAST

genomic megaBLAST

genomic megaBLAST

Tabulating the coordinates of the D. melanogaster alignments together with our earlier results gives us the results shown below.

Dmel Exon Dmel Start Dmel End Dana Start Dana End
Oct-TyrR:1 22028103 22028665 N/A N/A
Oct-TyrR:2 22054073 22054504 23850 23539
Oct-TyrR:3 22054080 22054504 23850 23539
Oct-TyrR:4 22055064 22056152 23538 22465
Oct-TyrR:5 A 22056782 22058379 22354 21938
Oct-TyrR:5 B 22056782 22058379 21362 20759

These results show that:

1. D. melanogaster Exons 2 and 3 are nearly identical, with the same 3' end on both the D. melanogaster assembly and on the D. ananassae fosmid. The 5' ends differ by 7 bp on the D. melanogaster assembly, but the alignment to the D. ananassae fosmid ends at the same position for both exons.

2. There is an intron in D. melanogaster between the 3' end of Exon 2/3 and the 5' end of Exon 4. This intron is 560 bp in D. melanogaster, but is missing in D. ananassae.

3. The two segments of D. ananassae that align to Exon 5 of D. melanogaster are separated by a length of sequence that aligns poorly due to divergence, but is contiguous. This is best seen in the dotplots, where the first and second segments of aligning sequence in the upper left are on the same line.


11. Graphical representation of the D. melanogaster and D. ananassae gene models

The model is easier to understand in a drawing, shown below.

Gene Model

The drawing (introns not to scale) shows the D. melanogaster gene model on the top line and the D. ananassae gene model below. The D. melanogaster gene has a noncoding Exon 1, two very similar Exons 2 and 3 (shown as a single exon in the model) in which the CDS begins, a coding Exon 4, and the final coding Exon 5. The D. ananassae model has two blocks of CDS separated by a single intron.


12. Species distribution of gain of the intron

It is interesting to investigate which Drosophila species contain the intron found in D. melanogaster but not in D. ananassae. To explore this question, I made a single protein sequence that consists of the CDS of Exons 2/3 and Exon 4 from D. melanogaster, shown below.

>Oct-TyrR:2-4_hybrid
MPSADQILFVNVTTTVAAAALTAAAAVSTTKSGSGNAARGYTDSDDDAGM
GTEAVANISGSLVEGLTTVTAALSTAQADKDSAGECEGAVEELHASILGL
QLAVPEWE
ALLTALVLSVIIVLTIIGNILVILSVFTYKPLRIVQNFFIVSLAVADLTV
ALLVLPFNVAYSILGRWEFGIHLCKLWLTCDVLCCTSSILNLCAIALDRY
WAITDPINYAQKRTVGRVLLLISGVWLLSLLISSPPLIGWNDWPDEFTSA
TPCELTSQRGYVIYSSLGSFFIPLAIMTIVYIEIFVATRRRLRERARANK
LNTIALKSTELEPMANSSPVAASNSGSKSRLLASWLCCGRDRAQFATPMI
QNDQESISSETHQPQDSSKAGPHGNSDPQQQHVVVLVKKSRRAKTKDSIK
HGKTRGGRKSQSSSTCEPHGEQQLLPAGGDGGSCQPGGGHSGGGKSDAEI
STESGSDPKGCIQ

This sequence was used to carry out a TBLASTN search of genus Drosophila (subgenus Sophophora and subgenus Drosophila) at FlyBase. The graphical summary is shown below.

If the most significant visual alignment breaks (with a dotted line) in the position near 100, it indicates the presence of the intron.

Species BLAST

It is possible to divide the results into two categories: those with the intron (like D. melanogaster) and those without the intron (like D. ananassae). The results are summarized below.

intron no intron
D. melanogaster D. ananassae
D. simulans D. pseudoobscura
D. sechellia D. persimilis
D. yakuba D. willistoni
D. erecta D. mojavensis
D. biarmipes D. grimshawii
D. eugracilis D. elegans
D. takahashii D. rhopaloa
  D. ficusphila
  D. kikkawai
  D. bipectinata
  D. miranda
  D. virilis

Examination of the taxonomy of the Sequenced Species at FlyBase shows that D. ananassae, a member of the melanogaster group, is intronless. It branches off from the rest of the melanogaster subgroup, all of which have the intron (D. simulans, D. sechellia, D. melanogaster, D. yakuba, and D. erecta). Members of the more distantly related obscura group (D. pseudoobscura and D. virilis) lack the intron, as does the last remaining species in the Sophophora, D. willistoni. The three species in the distant Drosophila branch, D. mojavensis, D. virilis, and D. grimshawii, also lack the intron.

This suggests that gain of this intron in the melanogaster subgroup occurred after that group branched from D. ananassae around 20 million years ago, and that the ancestral form lacks the intron.