Skip to Content

Three new bioinformatic coding problems now in WIKI (all require regex)

At the bottom of this page:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages

I've defined three new bioinformatic coding problems (Problems 3a-c).

These all involve regex parsing in one way or another and are therefore a lot more interesting than the first two problems I've presented.

I'm hoping that Anton will continue to knock out solutions, and that others will feel challenged enough to join him.

djh

Add a comment
10|10000 characters needed characters exceeded

Assigned Tags

Related questions

4 Answers

  • author's profile photo Former Member
    Former Member
    Posted on Jun 14, 2008 at 12:34 PM

    Here you go:

    <?
    $inlines = split("\n", file_get_contents("C:\some_path\\pdbs\\1acx.str"));
    $sequence = "TPSGTPVGSV";
    
    $pstruc = $sstruc = "";
    foreach($inlines as $inline){
      if(substr($inline, 0, 3) == "SEQ") {
        $pstruc .= substr($inline, 10, 50);
      }
      if(substr($inline, 0, 3) == "STR") {
        $sstruc .= substr($inline, 10, 50);
      }
    }
    $resstruc = "";
    $pos = strpos($pstruc, $sequence);
    if($pos !== false) {
      $resstruc = substr($sstruc, $pos, strlen($sequence));
    }
    
    if(strlen($resstruc)>=0){
      echo $sequence . " translates to " . $resstruc;
    } else {
      echo "sequence " . $sequence . " not found";
    }
    
    ?>
    

    where ...1acx.str in line 2 is the full path to a regular PDB(Protein Database Format) formatted file and $sequence is the primary structure fragment to search for. Problems a-c are used/solved.

    To be continued.

    anton.

    Add a comment
    10|10000 characters needed characters exceeded

    • Anton -

      Regarding the 1996 STRIDE download of the PDB, it is a little out-of-date because many more sequences have been added and many existing ones have been refined.

      You can get an up-to-date set of the underlying atomic coordinate (".pdb") files at the RCSB PDB site here:

      http://www.rcsb.org/pdb/home/home.do

      Note the yellow banner at the top where it mentions this ftp site:

      ftp://ftp.wwpdb.org.

      Also - note that ".pdb" files do not contain STRIDE read-outs ... see my note below on this point.

      Best

      djh

      Edited by: David Halitsky on Jun 15, 2008 12:20 AM

  • author's profile photo Former Member
    Former Member
    Posted on Jun 14, 2008 at 12:46 PM

    Reading the STRIDE documentation I found out that they supply a database of PDB documents. Actually they supplied it back in 1996 or something, no idea if this knowledge is still valid. Anyway I downloaded the whole database and expanded the 3180 files probably representing as many known and well described sequences.

    So, I assume it would be nice to search the sequence not only in one PDB but the whole database. This is it

    $pdbdir = "C:\some_path\\";
    
    $sequence = "VVVGAPGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVID";
    $start = time();
    
    $pdbs = scandir($pdbdir);
    foreach($pdbs as $pdb) {
      if (strpos($pdb, ".str") !== false){
        $inlines = split("\n", file_get_contents($pdbdir . "\\" . $pdb));
    
        $pstruc = $sstruc = $pdbcode = "";
        foreach($inlines as $inline){
          if (strlen($pdbcode) <= 0) {
            $pdbcode = substr($inline, 75, 4);
          }
          if(substr($inline, 0, 3) == "SEQ") {
            $pstruc .= substr($inline, 10, 50);
          }
          if(substr($inline, 0, 3) == "STR") {
            $sstruc .= substr($inline, 10, 50);
          }
        }
    
        $resstruc = "";
        $pos = strpos($pstruc, $sequence);
        if($pos !== false) {
          $resstruc = substr($sstruc, $pos, strlen($sequence));
        }
        if(strlen($resstruc) >= 1) { break; }
      }
    }
    
    if(strlen($resstruc)>=0){
      echo $sequence . " translates to " . $resstruc . " according to " . $pdbcode;
    } else {
      echo "sequence " . $sequence . " not found in " . $pdbcode;
    }
    
    $end = time();
    $took = $end - $start;
    
    echo "\n------------\nThis run took " . $took . " seconds.";
    
    ?>
    

    assuming that this will take a little longer I added some runtime measurement. The chosen sequence to search for was one from the last file to be parsed to find out the maximum runtime.

    The code stops searching at the first occurence of the search string. If it's found in the first file it takes less than a second, if its found in file number 3180 it takes 470 seconds on my laptop.

    Nice to search the whole DB, but there is room for performance optimisation.

    To be continued.

    anton.

    Add a comment
    10|10000 characters needed characters exceeded

  • author's profile photo Former Member
    Former Member
    Posted on Jun 14, 2008 at 12:52 PM

    why did the earlier run take so long? well we were parsing 250MB of data files opening an closing 3180 files, thats some load for our little helper.

    So why not aggregate the essential information into one file? We quickly write a variant of the last program to do exactly this. From every file we take the needed information and write it into an aggregation file called master.txt. The format we use is

    PDB-Code | primary sequence | secondary sequence

    Here we go

    $pdbdir = "C:\some_path";
    $sequence = "VVVGAPGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVID";
    
    
    //$fh = fopen($pdbdir . "\\master.txt", "w");
    
    $pdbs = scandir($pdbdir);
    foreach($pdbs as $pdb) {
      if (strpos($pdb, ".str") !== false){
        $inlines = split("\n", file_get_contents($pdbdir . "\\" . $pdb));
    
        $pstruc = $sstruc = $pdbcode = "";
        foreach($inlines as $inline){
          if (strlen($pdbcode) <= 0) {
            $pdbcode = substr($inline, 75, 4);
          }
          if(substr($inline, 0, 3) == "SEQ") {
            $pstruc .= substr($inline, 10, 50);
          }
          if(substr($inline, 0, 3) == "STR") {
            $sstruc .= substr($inline, 10, 50);
          }
        }
    fputs($fh, $pdbcode . "|" . $pstruc . "|" . $sstruc . "\n");
      }
    }
    
    fclose($fh);
    ?>
    

    Add a comment
    10|10000 characters needed characters exceeded

    • Anton -

      I am way impressed - not just by your code and how fast you produced it, but also by the speed with which you grasped the STRIDE documentation and put it to very good use.

      Regarding the question you ask at the very end - this question goes to the heart of why protein tertiary structure (the actual 3-dimensional shape of one chain of a protein) is so hard to predict, and why governments, universities, and pharma are throwing so much money at it.

      If protein primary and secondary structure were many-to-one, prediction of protein tertiary strructure would be a lot lot simpler. (For example, if "VVAY" and all "similar" primary structure subsequences mapped to the same secondary structure.)

      But the mapping is, unfortunately, many-to-many because the amino acids (AAs) in a primary structure chain always bond the same way - the COOH (carboxy) of one AA bonds to the HN (amino) of the next one to produce C-N with the OH and H going away to make a water moleculre ... COOH - HN ---> C-N + H20.

      And as pointed out by Linus Pauling (the discoverer of protein secondary structure), the C-N bond can take two forms:

      a) one that is llkely to force the two amino acids to become HH (part of a helix)

      b) one that is likely to force the two amino acids to become EE (part of a "stand")

      You may be surprised to find out that even primary structure subsequences of six amino acids can wind up as different secondary structures - not always "H" vs "E", the variants might be

      E TT H
      H TT E
      EEE HH
      HH EEE
      etc.
      

      Anyway, thanks a lot for what you've done here.

      If you decide to stay involved a little longer, you will very soon be able to understand why there may be real promise in the approach to protein structure that is being taken here:

      http://strucclue.ornl.gov

      (This is why I want to get SAP interested in the vertical bioinformatic sector - you can see why an integrated IDE with a robust DB is so important.)

      BTW, the site mentions Dr. Arthur Lesk as a team member. Arthur is considered one of the fathers of what is called "structural alignment" (as opposed to pure sequence alignment.)

      If you Google him, you'll see that he has written many many books on bioinformatics - all worth purchasing if you're going to become involved in this area. Arthur was at Cambridge (MRC) until recently, when he reached mandatory EU retirement age and took a position at Penn State here in the US.

      Best

      djh

  • author's profile photo Former Member
    Former Member
    Posted on Jun 14, 2008 at 01:01 PM

    finally we use the just compiled master file and search this one instead of the 3180 single files. how long will it take? before we start we add another functionality: given a very short search string the probability will be high to find this sequence encoded more than once. so, let's not stop at the first occurence but ssearch the whole database for all occurences... and measure the time.

    $pdbdir = "C:\some_path";
    $sequence = "VVAY";
    $start = time();
    
    $inlines = split("\n", file_get_contents($pdbdir . "\\master.txt"));
    
    foreach($inlines as $inline){
      $pdbdat = explode("|", $inline);
      $pos = strpos($pdbdat[1], $sequence);
      if($pos !== false) {
        $result["pdbcode"]   = $pdbdat[0];
        $result["resstruc"] = substr($pdbdat[2], $pos, strlen($sequence));
        $results[] = $result;
      }
    }
    
    if(isset($results)) {
      foreach($results as $result){
        echo $sequence . " translates to " . $result["resstruc"] . " according to " . $result["pdbcode"] . "\n";
      } 
    }
    else {
      echo "sequence " . $sequence . " not found.";
    }
    
    $end = time();
    $took = $end - $start;
    
    echo "\n------------\nThis run took " . $took . " seconds.";
    // 9 seconds for all
    //3127 entries in master for 3180 files -> there must be some files containting no SEQ data
    
    ?>
    

    The progamm now needs 9 seconds to execute, that's fine. I think it could even be made much faster by putting the data of master.txt into a relational database an index the search field. We'd end somewhere around one second I guess

    Btw, the output of the last run is

    VVAY translates to HHHH according to 1HDC
    VVAY translates to EEEE according to 1LAF
    VVAY translates to EEEE according to 1LAG
    VVAY translates to EEEE according to 1LAH
    VVAY translates to EEEE according to 1LST
    VVAY translates to EEEE according to 1ORD
    VVAY translates to BEGG according to 1PIV
    VVAY translates to EEGG according to 1POV
    VVAY translates to BEGG according to 1PVC
    VVAY translates to EEEE according to 1YPI
    VVAY translates to HHHH according to 2HSD
    VVAY translates to EEEE according to 2LAO
    VVAY translates to BEGG according to 2PLV
    VVAY translates to EEEE according to 2YPI
    VVAY translates to EEEE according to 3YPI
    VVAY translates to EEEE according to 7TIM
    ------------
    This run took 9 seconds.
    

    It shows what the search string is translated to in the various structures found. What it means that one primary structure translates to different secondary structures remains to be explained by the expert.

    anton.

    Add a comment
    10|10000 characters needed characters exceeded

    • Anton -

      One other note .,,

      each downloadable ".str" file contains the atomic coodinates for an deposition in the PDB (Protein Data Base), but an ".str" file should not be confused with the full and complete ".pdb" file that can be downloaded from the PDB itself.

      An ".str" file and a ".pdb" file both contain the atomic coordinates for a PDB deposition, but the ".pdb" file does not contain the STRIDE read-out that the ".str" file contains, and the ".str" file does not contain certain information that can be found in the ".pdb" file.

      Best

      djh

Before answering

You should only submit an answer when you are proposing a solution to the poster's problem. If you want the poster to clarify the question or provide more information, please leave a comment instead, requesting additional details. When answering, please include specifics, such as step-by-step instructions, context for the solution, and links to useful resources. Also, please make sure that you answer complies with our Rules of Engagement.
You must be Logged in to submit an answer.

Up to 10 attachments (including images) can be used with a maximum of 1.0 MB each and 10.5 MB total.