Skip to Content

2nd Bioinformatic Coding Problem: genes into protein primary structures

On this wiki page:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages

Post 3 (Protein Primary Structures from a Scripting Language Point of Vew ) presents the background needed to understand how to translate a protein gene like this:

atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat

attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa

ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac

ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt

tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc

ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta

aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat

agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata

tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt

ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac

ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa

tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac

gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg

atcgaacgagacaaaagaatacgataa

into a protein primary structure like this:

MNKQIDLPIADVQGSLDTRHIAIDRVGIKAIRHPVVVADKGGGSQHTVAQFNMYVNLPHNFKGTHMSRFV

EILNSHEREISVESFEEILRSMVSRLESDSGHIEMAFPYFINKSAPVSGVKSLLDYEVTFIGEIKHGNQY

SFTMKVIVPVTSLCPCSKKISDYGAHNQRSHVTISVRTNSFIWIEDIIRIAEEQASCELYGLLKRPDEKY

VTERAYNNPKFVEDIVRDVAEVLNHDDRIDAYIVESENFESIHNHSAYALIERDKRIR

using the "standard genetic code":

F: ttt S: tct Y: tat C: tgt

F: ttc S: tcc Y: tac C: tgc

L: tta S: tca *: taa *: tga

L: ttg S: tcg: *: tag W: tgg

L: ctt P: cct H: cat R: cgt

L: ctc P: ccc H: cac R: cgc

L: cta P: cca Q: caa R: cga

L: ctg P: ccg Q: cag R: cgg

I: att T: act N: aat S: agt

I: atc T: acc N: aac S: agc

I: ata T: aca K: aaa R: aga

M: atg T: acg K: aag R: agg

V: gtt A: gct D: gat G: ggtr

V: gtc A: gcc D: gac G: ggc

V: gta A: gca E: gaa G: gga

V: gtg A: gcg E: gag G: ggg

I'd love to have a copy of the necessary translation routine in each of the usual scripting languages - any routines posted in this thread will be added to the above wiki page.

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

1 Answer

  • avatar image
    Former Member
    Jun 11, 2008 at 10:05 PM
    $inseq  = "atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat";
    $inseq .= "attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa";
    $inseq .= "ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac";
    $inseq .= "ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt";
    $inseq .= "tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc";
    $inseq .= "ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta";
    $inseq .= "aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat";
    $inseq .= "agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata";
    $inseq .= "tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt";
    $inseq .= "ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac";
    $inseq .= "ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa";
    $inseq .= "tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac";
    $inseq .= "gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg";
    $inseq .= "atcgaacgagacaaaagaatacgataa";
    
    $trans  = array(
    "ttt" => "F", "ctt" => "L", "att" => "I", "gtt" => "V",
    "ttc" => "F", "ctc" => "L", "atc" => "I", "gtc" => "V",
    "tct" => "S", "cta" => "L", "ata" => "I", "gta" => "V",
    "tcc" => "S", "ctg" => "L", "atg" => "M", "gtg" => "V",
    "tca" => "S", "cct" => "P", "act" => "T", "gct" => "A",
    "tcg" => "S", "ccc" => "P", "acc" => "T", "gcc" => "A",
    "tta" => "L", "cca" => "P", "aca" => "T", "gca" => "A",
    "ttg" => "L", "ccg" => "P", "acg" => "T", "gcg" => "A",
    "tat" => "Y", "cat" => "H", "aat" => "N", "gat" => "D",
    "tac" => "Y", "cac" => "H", "aac" => "N", "gac" => "D",
    "tgt" => "C", "caa" => "Q", "aaa" => "K", "gaa" => "E",
    "tgc" => "C", "cag" => "Q", "aag" => "K", "gag" => "E",
    "tgg" => "W", "cgt" => "R", "agt" => "S", "ggt" => "G",
    "taa" => "*", "cgc" => "R", "agc" => "S", "ggc" => "G",
    "tga" => "*", "cga" => "R", "aga" => "R", "gga" => "G",
    "tag" => "*", "cgg" => "R", "agg" => "R", "ggg" => "G");
    
    $inseq = strtr($inseq, "u", "t");
    $substring_length[0] = -1; $stops[0] = "taa";
    $substring_length[1] = -1; $stops[1] = "tga";
    $substring_length[2] = -1; $stops[2] = "tag";
    $i = 0;
    foreach($substring_length as $sl){
      $substring_length[$i] = strlen($inseq);
      while($sl % 3 <> 0 && $sl <= strlen($inseq)){
        $sl = strpos($inseq, $stops[$i], $sl+1);
      }
      if(!$sl === false){
        $substring_length[$i] = $sl;
      }
      $i++;
    }
    echo strtr(substr($inseq, 0, min($substring_length)),$trans);
    
    

    most of the code (apart from the definition of the input parameters) is an attempt to efficiently find the first stop codon to avoid loading and translating a sequence of several kilobytes where actually the first stop codon appears after a few bytes; if this isn't necessary the actual algorithm is a one-liner.

    the language is of course ... well, a little trivial riddle (google some keywords to find out).

    Add comment
    10|10000 characters needed characters exceeded

    • Hi Anton -

      Thanks for the further guidance.

      I will be sure to post all your comments in the wiki, no just the code snippets, as soon as I'm off work tomorrow.

      Again, I am eager to see what you (and/or others) do with the first "interesting" problem that I will pose this weekend involving protein secondary structure.

      In addition to some regex matters, this problem will also involve something we've talked about before ... submitting a query to a foreign URL from within a WDA application and parsing the html/xml that's returned.

      Best regards

      djh