Gene Finding on D. ananassae fosmid 1475K17

Paul Szauter - June 11, 2013

Summary
BLASTX Analysis Graphic summary
BLASTX Analysis Descriptions
BLASTX Analysis Ten-m alignment
BLASTX Analysis Hem alignment
BLASTX Analysis Aats-ile alignment
BLAST Analysis by GEP
BLASTX Analysis at FlyBase
GENSCAN Analysis
UCSC Genome Browser at GEP
Lightning Round Slides

Summary

D. ananassae fosmid 1475K17 encodes three genes: the D. ananassae orthologs of the D. melanogaster genes Hem, Aats-ile, and Ten-m. The Hem and Aats-ile genes are complete, while the 5' end of the Ten-m gene is not on the fosmid. Based on the gene model of the D. melanogaster Ten-m gene, the first and second exons are missing. GENSCAN predicts five peptides encoded on the fosmid. Three of these peptides match the three genes identified by BLASTX analysis, while two (peptides 4 and 5) are invalid predictions.

BLASTX analysis

I used the file fosmid_1475K17.fasta from the src folder in the GEP project file for D. ananassae fosmid 1475K17 as a query sequence in a BLASTX search of Non-redundant protein sequences (nr) restricted to Drosophila melanogaster. All BLAST parameters were left at the default settings.

Graphic summary. The graphic summary is shown below.

BLASTX results

The results suggest that there are three genes on the fosmid with protein sequence similarity to the D. melanogaster protein set.

The left gene (Gene A) has only one good match to D. melanogaster proteins.

The middle gene (Gene B) has one good match to D. melanogaster proteins, with eight additional hits at lower scores. Each of the eight lower-scoring hits covers only a portion of the top hit.

The right gene (Gene C) has multiple good matches. These appear to be isoforms derived from alternative splicing. There are many lower-scoring hits that cover only a portion of the top hit.


Descriptions. All descriptions with E values smaller than e-10 are shown below.

BLASTX results


Alignments

Gene C. The first seven alignments in the list of descriptions (tenascin, odd Oz, odz) all align to a region of the fosmid corresponding to Gene C in the graphic summary. The coordinates of the top hit (tenascin major, isoform E) are shown in the table below.

NP_001262211.1 fosmid alignment
start end start end frame E identity positive
1425 2533 18566 15240 -1 0.0 98% 99%
2534 3349 15178 12704 -2 0.0 92% 95%
1165 1346 21028 20483 -2 0.0 99% 99%
1026 1164 21534 21118 -3 0.0 97% 98%
933 1033 21941 21639 -1 0.0 95% 98%
1341 1415 20442 20218 -3 0.0 95% 96%
682 933 23192 22362 -1 1e-57 59% 68%
161 307 25759 25316 -2 6e-54 79% 86%
376 667 24499 23609 -2 2e-46 49% 62%
1086 1195 20860 20528 -2 6e-11 39% 52%
1237 1330 21399 21124 -3 1e-09 39% 54%
1140 1234 21387 21118 -3 2e-08 40% 53%
1053 1162 20872 20531 -2 6e-06 35% 50%
1265 1332 21408 21220 -3 0.054 35% 48%

Reordering these in order of the segments of NP_001262211.1 gives the results shown below. Alignments with E values larger than e-10 are grayed out.

NP_001262211.1 fosmid alignment
start end start end frame E identity positive
161 307 25759 25316 -2 6e-54 79% 86%
376 667 24499 23609 -2 2e-46 49% 62%
682 933 23192 22362 -1 1e-57 59% 68%
933 1033 21941 21639 -1 0.0 95% 98%
1026 1164 21534 21118 -3 0.0 97% 98%
1053 1162 20872 20531 -2 6e-06 35% 50%
1086 1195 20860 20528 -2 6e-11 39% 52%
1140 1234 21387 21118 -3 2e-08 40% 53%
1165 1346 21028 20483 -2 0.0 99% 99%
1237 1330 21399 21124 -3 1e-09 39% 54%
1265 1332 21408 21220 -3 0.054 35% 48%
1341 1415 20442 20218 -3 0.0 95% 96%
1425 2533 18566 15240 -1 0.0 98% 99%
2534 3349 15178 12704 -2 0.0 92% 95%

Summary of Gene C: Gene C appears to be the D. ananassae ortholog of the D. melanogaster gene Ten-m. The gene is on the minus strand (all matching reading frames are -1, -2, or -3). The 5' end of the gene does not appear to be on the fosmid. A view of D. melanogaster Ten-m from FlyBase GBrowse is shown below.

Ten-m GBrowse

The D. melanogaster Ten-m gene has three isoforms derived from alternative splicing. The first and second introns are very large (approximately 70 kb and 30 kb, respectively). If the Ten-m gene of D. ananassae has a similar structure, the 5' end cannot be on the fosmid, which is only 44 kb.

To confirm this, I used the Gene Record Finder at GEP. This returns the following protein sequences for the first three exons of D. melanogaster Ten-m:

>Ten-m:13_536_0
MNPYEYESTLDCRDVGGGPTPAHAHPHAQGRTLPMSGHGRPTTDLGPVHG
SQTLQHQNQQNLQAAQAAAQSSHYDYEYQHLAHRPPDTANNTAQRTHGRQ

>Ten-m:12_536_2
FLLEGVTPTAPPDVPPRNPTMSRMQNGRLTVNNPNDADFEPSCLVRTPSG
NVYIPSGNL

>Ten-m:11_536_2
INKGSPIDFKSGSACSTPTKDTLKGYERSTQGCMGPVLPQRSVMNGLPAH
HYSAPMNFRKDLVARCSSPWFGIGSISVLFAFVVMLILLTTTGVIKWNQS
PPCSVLVGNEASEVTAAKSTNTDLSKLHNSSVRAKNGQGIGLAQGQSGLG
AAGVGSGGGSSAATVTTATSNSGTAQGLQSTSASAEATSSAATSSSQSSL
TPSLSSSLANANNG

These three sequences were used as subject sequences in a bl2seq search with BLASTX using fosmid_1475K17.fasta as the query sequence.

The first exon gives no significant match (E = 0.49). The second exon gives no significant match (E = 0.045). The third exon gives a significant match (E = 7e-66) to 25765 - 25316 on the fosmid, corresponding to the first segment of the fosmid that aligns with NP_001262211.1 in the analysis above. This shows that the third exon of Ten-m is the most 5' exon encoded on the fosmid.


Gene A. The eighth hit on the description list is HEM-protein. The coordinates of the alignment are shown in the table below.

NP_524214.1 fosmid alignment
start end start end frame E identity positive
394 1121 3703 1520 -2 0.0 99% 98%
1 395 4943 3759 -1 0.0 97% 98%

Reordering these in order of the segments of NP_524214.1 gives the results shown below.

NP_524214.1 fosmid alignment
start end start end frame E identity positive
1 395 4943 3759 -1 0.0 97% 98%
394 1121 3703 1520 -2 0.0 99% 98%

Summary of Gene A: Gene A is the ortholog of the D. melanogaster Hem gene. The D. ananassae Hem gene is on the minus strand (the aligning segments are in frames -1 and -2). The gene has two exons. NP_524214.1 has 1126 amino acids, so the entire Hem gene is on the fosmid.


Gene B. The ninth hit on the description list is Isoleucyl-tRNA synthetase. The coordinates of the alignment are shown in the table below.

NP_730716.1 fosmid alignment
start end start end frame E identity positive
50 467 5833 7086 +1 0.0 96% 98%
462 749 7121 7984 +2 0.0 96% 97%
914 1228 8600 9541 +2 0.0 77% 87%
750 918 8047 8553 +1 0.0 91% 95%
1 58 5621 5806 +2 2e-19 76% 82%

Reordering these in order of the segments of NP_730716.1 gives the results shown below.

NP_730716.1 fosmid alignment
start end start end frame E identity positive
1 58 5621 5806 +2 2e-19 76% 82%
50 467 5833 7086 +1 0.0 96% 98%
462 749 7121 7984 +2 0.0 96% 97%
750 918 8047 8553 +1 0.0 91% 95%
914 1228 8600 9541 +2 0.0 77% 87%

Summary of Gene B: Gene B is the ortholog of D. melanogaster Aats-ile gene. The gene is on the + strand (all matching reading frames are +1 or +2). The gene has five exons. NP_730716.1 has 1229 amino acids, so the entire Aats-ile gene appears to be on the fosmid.


BLAST analysis by GEP

The project folder from GEP contains the following analysis (analysis/BLASTresults/dmel_translation_548_fosmid_1475K17.blx):

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

Ten-m-PE FlyBase:FBpp0303192                                  5823  0.       11
Ten-m-PD FlyBase:FBpp0297244                                  5823  0.       11
Ten-m-PB FlyBase:FBpp0078161                                  5823  0.        8
Hem-PA FlyBase:FBpp0078162                                    3929  0.        2
Aats-ile-PA FlyBase:FBpp0078150                               2289  0.        5
Aats-ile-PC FlyBase:FBpp0078151                               2289  0.        5
Aats-ile-PD FlyBase:FBpp0078152                               2289  0.        5
Ten-a-PE FlyBase:FBpp0289137                                  1374  0.       16
Ten-a-PH FlyBase:FBpp0289439                                  1374  0.       15
Ten-a-PI FlyBase:FBpp0289440                                  1374  0.       15
Ten-a-PJ FlyBase:FBpp0289441                                  1374  0.       15
Ten-a-PN FlyBase:FBpp0301779                                  1374  0.       15
Ten-a-PD FlyBase:FBpp0289136                                  1374  0.       15
Ten-a-PF FlyBase:FBpp0289138                                  1374  0.       15
Ten-a-PK FlyBase:FBpp0300541                                  1374  0.       15
Ten-a-PL FlyBase:FBpp0300542                                  1374  0.       15
Ten-a-PM FlyBase:FBpp0300543                                  1374  0.       15
Ten-a-PG FlyBase:FBpp0289438                                  1374  0.       14
CG5414-PB FlyBase:FBpp0291534                                  229  5.2e-42   6
Sgs1-PA FlyBase:FBpp0077084                                     81  9.4e-18  14
drpr-PE FlyBase:FBpp0306204                                     73  3.4e-14  11
drpr-PF FlyBase:FBpp0306205                                     77  9.9e-14   9
drpr-PB FlyBase:FBpp0072681                                     77  1.6e-13   9
crb-PC FlyBase:FBpp0293268                                     101  1.7e-13   6
crb-PB FlyBase:FBpp0110307                                     101  2.6e-13   6
crb-PD FlyBase:FBpp0306945                                     101  2.7e-13   6
CG5660-PA FlyBase:FBpp0076287                                  113  9.1e-13   6
C901-PA FlyBase:FBpp0073256                                    118  1.5e-12   3
drpr-PA FlyBase:FBpp0072680                                    110  8.3e-12   7
drpr-PC FlyBase:FBpp0301579                                    110  1.6e-11   6
crb-PA FlyBase:FBpp0083987                                     101  2.7e-11   8
Muc96D-PA FlyBase:FBpp0084219                                   80  5.4e-11  10

I combined isoforms into single genes, then identified the segment of the fosmid aligning to each sequence, as summarized in the table below. A representative protein sequence (NP_xxxxx) from D. melanogaster was chosen for each gene (only one isoform for genes with multiple isoforms) and used as the subject sequence for a bl2seq search with BLASTX using fosmid_1475K17.fasta as the query sequence. All aligning segments were combined to give a coordinate range on the fosmid. The E value for the highest scoring segment is shown.

Two of the proteins in the report do not significantly align with the translation of the fosmid in this analysis. All of the remaining alignments correspond to the coordinate range for two of the genes identified above (Aats-ile and Ten-m), aligning with lower scores, indicating homology to these genes. These are not new genes missed by the BLASTX analysis above.

Dmel gene Representative protein sequence fosmid E Conclusion
start end strand
Ten-m NP_001262211.1 25759 12704 minus 0.0 Dana Ten-m (see above)
Hem NP_524214.1 4943 1520 minus 0.0 Dana Hem (see above)
Aats-ile NP_730716.1 5621 9541 plus 0.0 Dana Aats-ile (see above)
Ten-a NP_001138189.1 21552 12938 minus 0.0 Homology to Ten-m
CG5414 NP_648837.2 5836 7966 plus 2e-48 Homology to Aats-ile
Sgs-1 NP_523475.3 14537 14632 plus 2.2 Not significant
drpr NP_001261276.1 21405 20639 minus 9e-13 Homology to Ten-m
crb NP_001247284.1 21429 20528 minus 1e-06 Homology to Ten-m
CG5660 NP_648268.1 5848 7657 plus 2e-18 Homology to Aats-ile
C901 NP_572673.1 21399 20525 minus 2e-13 Homology to Ten-m
Muc96D NP_733106.2 40701 40654 minus 2.8 Not significant

FlyBase BLAST

FlyBase BLAST makes the analysis of the fosmid easier in some ways. I used the link to FlyBase BLAST from the Tools page of the course website.

I set the Database to Annotated proteins (AA), the Program to BLASTX, and uploaded the fosmid sequence. I restricted the species to D. melanogaster and clicked BLAST.

The graphic output is shown below. Notice that each hit is labeled, unlike the results at NCBI BLAST.

BLAST output

The summary table is also useful, shown below.

BLAST output


GENSCAN analysis

The GENSCAN results from analysis/Genefinder/Genscan in the project folder predict five proteins on the fosmid:

>fosmid_1475K17|GENSCAN_predicted_peptide_1|1126_aa
MARPIFPNQQKIAEKLIILNDRGLGILTRIYNIKKACGDTKSKPGFLSEKSLESSIKFIV
KRFPNIDVKGLNAIVNIKAEIIKSLSLYYHTFVDLLDFKDNVCELLTTMDACQIHLDITL
NFELTKNYLDLVVTYVSLMIVLSRVEDRKAVLGLYNAAYELQNNQADTGFPRLGQMILDY
EVPLKKLAEEFIPHQRLLANALRSLTSIYALRNLPADKWREMQKLSLVGNPAILLKAVRT
ETMSCEYISLEAMDRWIIFGLLLNHQMLGQYPEVNKIWISALESSWVVALFRDEVLQIHQ
YIQSTFDGIKGYSKRISEVKDAYNTAVQKAAHMHRERRKFLRTALKELALIMTDQPGLLG
PKAIFIFIGLCLARDEILWLLRHNDNPPPVKNKGKSNEDLVDRQLPELLFHMEELRALVR
KYSQVMQRYYVQYLSGFDATDLNIRMQSLQMCPEDESIIFSSLYNIAASLTVKQVEDNEL
FYFRPFRLDWFRLQTYMSVGKAALRITEHIELARLLDSMVFHTRVVDNLDEILVETSDLS
IFCFYNKMFDDQFHMCLEFPAQNRYIIAFPLICSHFQNCTHEMCPEERHHIRERSLSVVN
IFLEEMAKEAKNIITTICDEQCTMADALLPKHCAKILSVQSARKKKDKAKSKHFDDIRKP
GDESYRKTREDLTTMDKLHMALTELCFAINYCPTVNVWEFAFAPREYLCQNLEHRFSRDL
VGMVMFNQETMEIAKPSELLASVRAYMNVLQTVENYVHIDITRVFNNCLLQQTQALDSHG
EKTIAALYNTWYSEVLLRRVSAGNIVFSINQKAFVPISPEGWVPFNPQEFSDLNELRALA
ELVGPYGIKTLNETLMWHIANQVQELKSLVVTNKEVLITLRTSFDKPEVMKEQFKRLQDV
DRVLQRMTIIGVIICFRNLVHEALVDVLDKRIPFLLSSVKDFQEHLPGGDQIRVASEMAS
AAGLLCKVDPTLATTLKSKKPEFDEGEHLTACLLMVFVAVSIPKLARNENSFYRATIDGH
SNNTHCMAAAINNIFGALFTICGQNDMEDRMKEFLALASSSLLRLGQESDKEATRNRESI
YLLLDEIVKQSPFLTMDLLESCFPYVLIRNAYHGVYKQEQILGLVL

>fosmid_1475K17|GENSCAN_predicted_peptide_2|1228_aa
MGKKLERSDVCRVPENINFPAEEENVLQRWREENVFERCSQLSKGKPKYTFYDGPPFATG
LPHYGHILAGTIKDIVTRYAYQQGYHVDRRFGWDCHGLPVEFEIDKLMNIRGPEDVAKMG
ITAYNAECRKIVMRYADEWENIVTRVGRWIDFKNDYKTLYPWYMESIWWIFKQLYDKGLV
YQGVKVMPYSTACTTSLSNFEANQNYKEVVDPCVVIALEAVSLPNTFFLVWTTTPWTLPS
NFACCVHPTMTYVKVRDVKSDRLFILAESRLSYVYKTEAEYEVKDKFAGKTLKDLHYKPL
FPYFAKRGAEVKAYRVLVDEYVTEDSGTGIVHNAPYFGEDDYRVCLAAGLITKSSEVLCP
VDEAGRFTKEASDFEGQYVKDADKQIMAVLKTRGNLVSSGQVKHSYPFCWRSDTPLIYKA
VPSWFVRVEHMSKNLLTCSSQTYWVPDFVKEKRFGNWLREARDWAISRNRYWGTPIPIWR
SPNGDETIVIGSIKQLAELSGVQVEDLHRESIDHIEIPSAVPGNPPLRRIAPVFDCWFES
GSMPFAQQHFPFENEKDFMNNFPADFIAEGIDQTRGWFYTLLVISTALFNKAPFKNLIAS
GLVLAADGQKMSKRKKNYPDPMEVVHKYGADALRLYLINSPVVRAESLRFKEEGVRDIIK
DVFLPWYNAYRFLLQNIARYEKEDLGGKGQYIYERERHLKNMDKASVIDVWILSFKESLL
QFFAEEMKMYRLYTVVPRLTKFIDQLTNWYVRLNRRRIKGELGAEQCIQSLDTLYDVLYT
MVKMMAPFTPYLTEYIFQRLVLFQPPGSLEHADSVHYQMMPVSQSKFIRNDIERSVSLMQ
SVVELGRVMRDRRTLPVKYPVSEIIVIHKDAKVLEAVKNLQDFILSELNVRKLTLSSDKE
KYGVTLRAEPDHKTLGQRLKGNFKAVMAAIKALKDDEIQKQVAQGYFNILDQRIELDEVR
VIYCTSEQVGGHFEAHSDNEVLVLLDMTPNEELLEEGLAREVINRVQKLKKKAQLIPTDP
VLIFHELEANSTKKETLETQAQLKKVLSSYSDMIKTAIKSDFGPFSAEKSAKKRVIASEL
VDLKGIPLKLTICSTEDLQLPNLPFLNVALAEDLKPRFGNGDKASLFLQHNASKQIISLP
QLRSEIEILFGLYGVNFNIYVVDQKSNAKELTSIDKSLNGKLLVLSRGPEALKSKASFDV
PSSPYSKFVNKGSGTAAFIENPKGTTLN

>fosmid_1475K17|GENSCAN_predicted_peptide_3|3291_aa
MRHVAVQTNECARKYAIEFKYCKCMSRTLGPGRGEQQQRNNWRYMFFAAVPQPPIPANDT
RFQTTLRRCKKGGYGMRIRMRTRSPIIYPNTHLIRTRGGAQVACYINKGSPIDFKSGSAC
STPTKETLKGGYDRGTQGCMGPVLPPRSVMNGLPSHHYSAPMNFRKDLAARCSSPWVGVA
AISALVLVVLMLILLTTFGGLHWTQSAPCTVLVGNEASEVTAAKSTNTDLSKLHNSATRS
KNGQAIGPVPGQYGSGGGMGSSTATVTAATSNSGTAQGLQSTSASAEASSSAATTSSQSS
LTPSTSLSSSLANANNGGRDILTRMAADGAGKNKRRNRRSMDVAENGGDVATDETFSNFI
TIESLNREQTGEYFATTPARKLQEVERSSSDRTSFGINGVLSPQGDEEVEDITSDYVYED
EPVPDTSPATQRPRTRQQFGKSLNSNLRSAAKTLVNKKTKYEGEAGKNIRLEQEQKLEAF
IEAGMTLESTTTTATRATTTTESGTSTFVAVIDDDNQSDSSSSGAVPPVLTVLRSDTDDI
VEINTALPEPTEGSILAPFPRSSMANDFQIKGKAVSESGQEKPATDDNNNERDLADNYEL
KPEEPAASATPLQGINQVHSTFLAGEINKSESDFMVNDMASQFEDIDIVKLGEAPSSHEE
AVYKTSSSKDKAVPMAPAQPEAIENEVLKDQDEARQVPLHLRPLKPYVSETIDQPGRRIL
VNLTIATDDGPDNVYTLHVEVPTGGGPHTIKEVLTHEKPPQQADHQAENCVPEPPPRMPD
CPCSCLPPPAPIYLDDRGDSGDSGSAPPVDTDSAPLASTTNGASTSPPLETATILGDRHS
EKDDHGVGNGNSSTVEGESTASSCATNTPSTEIDNHIDAFHTEPPVGGGEPFACPDVMPV
LILEGARTFPARSFPPDGTTFGQISLGQKLTKEIQPYSYWNMQFYQSEPAYVKFDYTIPR
GASIGVYGRRNALPTHTQYHFKEVLSGFSASTRTARAAHLSITREVTRYMEPGHWFMSLY
NDDGDVQELTFYAAVAEDMTQNCPNGCSGNGQCLLGHCQCNPGFGGHDCSESVCPVLCSQ
HGEYTNGECICNPGWKGKECSLRHDECEVADCNGHGHCVSGKCQCMRGYKGKFCEEVDCP
HPNCSGHGFCADGTCICKKGWKGPDCATMDQDALQCLPDCSGHGTFDLDTQTCTCEAKWS
GDDCSKELCDLDCGQHGRCEGDACACDPEWGGEYCNTRLCDTRCNEHGQCKNGTCLCVTG
WNGKHCTIEGCPNSCAGHGQCRVSGEGQWECRCYEGWDGPDCGIALELNCGDSKDNDKDG
LVDCEDPECCASHVCKTSQLCVSAPKPIDVLLRKQPPAITASFFERMKFLIDESSLQNYA
KLETFNESRSAVIRGRVVTSLGMGLVGVRVSTTTLLEGFTLTRDDGWFDLMVNGGGAVTL
QFGRAPFRPQSRIVQVPWNEVVIIDGVVMSMSEEKGLATTTTHTCFAHDYDLMKPVVLAS
WKHGFQGACPDRSAILAESQVIQESLQIPGTGLNLVYHSSRAAGYLSTIKLQLTPDNIPP
TLHLIHLRITIEGILFERVFEADPGIKFTYAWNRLNIYRQRVYGVTTAVVKVGYQYTDCT
DIVWDIQTTKLSGHDMSISEVGGWNLDIHHRYNFHEGILQKGDGSNIYLRNKPRIILTTM
GDGHQRPLECPDCDGLATKQRLLAPVALAAAPDGSLFVGDFNYIRRIMSDGSIRTVVKLN
ATRVSYRYHMALSPLDGTLYVSDPESHQIIRVRDTNNYSQPELNWEAVVGSGERCLPGDE
AHCGDGALAKDAKLAYPKGIAISSDNILYFADGTNIRMVDRDGIVSTLIGNHMHKSHWKP
IPCEGTLKLEEMHLRWPTELAVSPMDNTLHIIDDHMILRMTPDGRVRVISGRPLHCATAS
TAYDTDLATHATLVMPQSIAFGPLGELYVAESDSQRINRVRVIGTDGRIAPFAGAESKCN
CLERGCDCFEAEHYLATSAKFNTIAALSVTPDGHVHIADQANYRIRSVMSSIPEASPSRE
YEIYAPDMQEIYIFNRFGQHVSTRNILTGETTYVFTYNVNTSNGKLSTVTDAAGNKVFLL
RDYTSQVNSIENTKGQKCRLRMTRMKMLHELSTPDNYNVTYEYHGPTGLLKTKLDSTGRS
YVYNYDEFGRLTSAVTPTGRVIELSFDLSVKGAQVKVSENAQKEQSLLIQGATVTVRNGA
AESRTSVDMDGSTTSITPWGHNVQMEVAPYTILAEQSPLLGESYPVPAKQRTEIAGDLAN
RFEWRYFVRRQQPLQAGKQSKGAPRPVTEVGRKLRVNGDNVLTLEYDRETQSVVVLVDDK
QELLNVTYDRTSRPISFRPQSGDYADVDLEYDRFGRLVSWKWGVLQEAYSFDRNGRLNEI
KYGDGSTMVYAFKDMFGSLPLKVTTPRRSDYLLQYDDAGALQSLTTPRGHIHAFSLQTSL
GFFKYQYFSPINRHPFEILYNDEGQILAKIHPHQSGKVAFVYDAAGRLETILAGLSSTHY
TYQDTTSLVKTVEVQEPGFELRREFKYHAGILKDEKLRFGSKNSLASAHYKYAYDGNARL
SGIEMAIDDKELPTTRYKYSQNLGQLEVVQDLKITRNAFNRTVIQDSAKQFFAIVDYDQH
GRVKSVLMNVKNIDVFRLELDYDLRNRIKSQKTTFGRSTAFDKINYNADGHVVEVLGTNN
WKYLYDENGNTVGVVDQGEKFNLGYDIGDRVIKVGDVEFNNYDARGFVVRRGEQKYRYNN
RGQLIHAFERERFQSWYYYDDRSRLVAWHDNQGNTTQYYYANPRTPHLVTHAHFPKLART
MKFFYDDRDMLIAMENADQRYYVATDQNGSPLAFFDLNGGIAKELKRTPFGRIIKDTKPD
FFVPIDFHGGLIDPHTKLIYTEQRQYDPHVGQWMTPQWETLATEMSHPTDVFIYRYHNND
PINPNRPQNYMIDLDAWLQLFGYDLDNMQSRRYTKLAQYTPQASIKSNMLAPDFGVISGL
ECIVEKTSEKFSDFDFVPKPLLKMEPKMRNLLPRISYRRGVFGEGVLLSRIGGRALVSVV
DGSNSVVQDVVSSVFNNSYFLDLHFSIHDQDVFYFVKDNVLKLRDDNEELRRLGGMFNIS
THEVSDHGGSAAKELRLHGPDAVVIVKYGVDPEQERHRILKHAHKRAVERAWELEKQLVA
AGFQGRGDWTEEEKEELVQHGDVDGWIGIDIHSIHKYPQLADDPGNVAFQRDAKRKRRKT
GNSHRSASSRRQMKFGELSALYDYDCNEQLVFNVENSLENQISRRRRNEKY

>fosmid_1475K17|GENSCAN_predicted_peptide_4|111_aa
MEKKPSNNGRFNGRLSEASAIINGRNNSLLRRRLLATTEGELVGGERGGREVAPWTHIID
QNFTDYKAINGIDYSISKTDRNFAFAAQFLRVQPPSRKFNIIALGLAVIAH

>fosmid_1475K17|GENSCAN_predicted_peptide_5|46_aa
XYGTCLCEHCQLDELLLRQLIALPPDNWWDSQRFYVLIVIIVTAAL

Each protein was used as a query sequence of a BLASTP search of Non-redundant protein sequences (nr) restricted to Drosophila melanogaster. All BLAST parameters were left at the default settings. The results are summarized in the table below. Peptides 4 and 5 do not produce significant alignments with D. melanogaster proteins.

Query Top hit E Coverage Max identity
Accession Gene
GENSCAN_predicted_peptide_1 NP_524214.1 Hem 0.0 100% 98%
GENSCAN_predicted_peptide_2 NP_730716.1 Aats-ile 0.0 99% 91%
GENSCAN_predicted_peptide_3 NP_001262211.1 Ten-m 0.0 97% 86%
GENSCAN_predicted_peptide_4 No significant similarity found.
GENSCAN_predicted_peptide_5 NP_609068.2 Ttll3B 3.2 60% 39%

Because no significant matches were found with peptides 4 and 5, I repeated the BLASTP search of nr with the species restriction turned off.

The top hit for peptide 4 was WP_004022292.1 with an E value of 2.0 (not significnat).

The top hit for peptide 5 was WP_005224522.1 (conjugal transfer protein TraB [Marichromatium purpuratum]) with an E value of 7e-06 (marginally significant). The alignment is shown below.

Query  3    GTCLCEHCQLDELLLRQLIALPPDNWWDSQRFYVLIVIIVTA  44
            G CL E  +  E ++ +L  LPP N W  +  +VL+V+I+  
Sbjct  229  GRCLAEEDESPEPVIAELERLPPPNPWPRRLPWVLVVLILAG  270

Summary of GENSCAN analysis: GENSCAN identified three genes found by the BLASTX search (Hem, Aats-ile, and Ten-m). It made two additional predictions that are invalid.


UCSC Genome Browser at GEP

Here is a view of fosmid 1475K17 in the UCSC Genome Browser at GEP.

BLASTX Alignment of D. melanogaster proteins. The BLASTX track at the top of the image shows alignments to three distinct regions, as was seen in the prior BLASTX analaysis. The leftmost D. ananassae gene, Hem, aligns to the protein product of the D. melanogaster Hem gene and no other sequences. The middle D. ananassae gene, Aats-ile, aligns to three D. melanogaster isoforms produced by the D. melanogaster Aats-ile gene. The rightmost D. ananassae gene, Ten-m, aligns to isoforms produced by the D. melanogaster Ten-m gene. In addition, the protein isoforms of the closely-related but distinct D. melanogaster Ten-a gene align to the D. ananassae Ten-m gene. These are the same results seen when the fosmid is used as a query sequence in a BLASTX search of D. melanogaster proteins. There are three genes on the fosmid: the D. ananassae orthologs of Hem, Aats-ile, and Ten-m.

GENSCAN predictions. Starting at the left of the fosmid, the first three GENSCAN predictions align to the D. ananassae orthologs of Hem, Aats-ile, and Ten-m, as shown in the previous analysis. The third GENSCAN prediction contains one additional 3' exon and two additional 5' exons in the Ten-m gene that are not supported by BLASTX analysis. The fourth and fifth GENSCAN predicitons do not align to sequences predicted by BLASTX analysis to encode proteins. The first three GENSCAN predictions are congruent with predictions from other gene-finding programs, while the invalid fourth and fifth predictions do not match predictions from other gene-finding programs, with the exception of one of the exons of GENSCAN peptide 5, also predicted by SNAP.

modENCODE RNA-Seq. Transcripts aligning to Hem, Aats-ile, and Ten-m are seen. There is little evidence of RNA sequences present in mRNA elsewhere on the fosmid.

Conservation. The exons of Hem, Aats-ile, and Ten-m are clearly conserved. The intergenic regions between Hem and Aats-ile, and between Aats-ile and Ten-m are not conserved. There is considerable sequence conservation upstream of the rightmost (5') exon in Ten-m; this region is known to be a Ten-m intron, separating the third exon of Ten-m from the second exon, which is not on the fosmid.

UCSC Genome Browser


Lightning Round

Here are the two slides for a four-minute talk on this fosmid. Slides are reduced to half size; actual jpg images are 10 inches wide by 7.5 high at 150 dpi (1500 x 1125 pixels).

slide 1

slide 1