Skip to Content
avatar image
Former Member

How would you code this in Perl/PHP etc.?

Please note that if you submit an answer to this question, your code will be used in a NOT-FOR-PROFIT scientific research program, and you will be given credit for the code in any publication etc.

Problem Statement:

1. You have a string S of 20-80 characters over the alphabet A-Z (this string is drawn at random from a database DS of strings over A-Z.)

2. You have grouped the letters A-Z into the following three groups a, j, and t:

a = A-I j = J-S t = T-Z.

3. Within the string S, you identify all doublets (two consecutive letters) such that the first letter of the doublet is within group a or group j, and the second letter of the doublet is in group a or j.

4. Since you are only interested in these doublets within S, you represent S (for example) as the string S':

AKS . . . . . BL . . . . . . MD . . . . . EE (where each "." represents a letter in group t.)

where the "group-level" representation of the string S' is the string S":

ajj . . . . . aj . . . . . . ja . . . . . aa

5. You want to search the database DS and return all strings that have the form S" at the group-level.

6. Also, and MOST IMPORTANTLY, you want to return a string Sz from the database even if there is only a rough spacing correspondence between Sz and the template string S". The "allowable spacing difference" rule is that for a doublet Dz in Sz to match a doublet Ds" in S' (at the group-level), there must be no more than four characters between the end of Dz and the start of Ds", or vice-versa.

Edited by: David Halitsky on Jun 25, 2010 10:03 PM

Edited by: David Halitsky on Jun 25, 2010 10:05 PM

Add comment
10|10000 characters needed characters exceeded

  • Get RSS Feed

6 Answers

  • Jun 25, 2010 at 08:06 PM

    Sorry for the all the stupid typos in the original spec - I've now corrected them all so the spec is at least readable ...

    Add comment
    10|10000 characters needed characters exceeded

  • avatar image
    Former Member
    Jun 25, 2010 at 09:35 PM

    Hi David,

    I took a quick look at the pattern and have a question about if the existance of a single "a" or "j" value disqualifies a string. In item (4), the non doublet characters in the note to the side are group "t".

    Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?

    Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?

    The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.

    If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed.

    Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?

    Add comment
    10|10000 characters needed characters exceeded

    • Hi John -

      Thanks for pointing out some serious sins of omission in the spec. To answer your questions:

      Q1:

      Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?

      A1:

      A letter from group a or j that does not occur next to a letter from a or j should be counted as if it were a letter in the group t. That is, in the group-level representation of the string, the letter should be mapped as a "."

      Q2:

      Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?

      A2:

      Yes - including "pseudo-t's" that derive from "singleton" a's or j's in the sense of Q1/A1.

      You also wrote the following:

      "The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.

      If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed. "

      I think my answers to Q1 and Q2 provide the necessary clarification here - if not, please point out any remaining issues.

      Also, you posed Q3:

      Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?

      A3:

      Sorry again! For the sake of this discussion, a search string S is valid only if the number of a and j letters in doublets is at least 10% of the entire length of S.

      Thanks very much again for taking the time you have with this, John.

      Best

      djh

  • avatar image
    Former Member
    Jun 29, 2010 at 02:32 PM

    There are a couple ways to view the proximity of the doublets. For this response I picked proximity as being relative to the current matching doublet and not to the absolute position in the string.

    Due to the nature of the matching, my solution is based upon creating the doublet pattern for both the search string and the strings to be evaluated. The pattern for the search string is then converted to an expression that gives the allowances for the position variation. The example perl is:

    #!/usr/bin/perl  
    $starting = shift;
    $search = create_doublet( $starting ); 
    $search = create_match( $search ); 
    
    print "Input string\n     $starting\n     $search\n\n";
    
    $infile = shift;  
    open( INFILE, $infile );
    
    while (<INFILE>) {    
      chomp;
      my $line = $_;
      my $str = create_doublet( $line );
      print "check  $line\n";
    
      if( $str =~ /$search/ ) {
        print "Match\n\n";
      } else {
        print "No Match\n\n";
      }
    
    }
    close( INFILE );
    
    sub create_doublet {
      $new = shift;
      $new =~ s/[A-I]/A/g;
      $new =~ s/[J-S]/J/g;
      $new =~ s/[T-Z]/t/g;
      $new =~ s/^Jt/tt/g;
      $new =~ s/tJ\s*$/tt/g;
      $new =~ s/^At/tt/g;
      $new =~ s/tA\s*$/tt/g;
      $new =~ s/tAt/ttt/g;
      $new =~ s/tJt/ttt/g;
      return $new;
    }
    
    sub create_match {
      my $pattern;
      $_ = shift;
    
      @fields = /A+|J+|t+/g;
      foreach (@fields) {
        $fld = $_;
        $char = substr( $fld, 0, 1);
        $pattern .= $char;
        $cnt = length "$fld";
        if( $fld =~ /t/ ) {
          $low = $cnt - 3;
          $high = $cnt + 3;
          if( $low < 0 ) {
            $low = 0;
          }
          $pattern .= "{" . "$low" . "," . "$high" . "}";
        } else {
          $pattern .= "{" . "$cnt" . "}";
        }
    
      }
    
      if( $pattern =~ /^(A|J)/ ) {
        $pattern = "t{0,4}" . $pattern;
      }
      if( $pattern =~ /(A|J)\{[0-9]+\}\s*$/ ) {
        $pattern .= "t{0,4}";
      }
    
    
      return $pattern;
      
    }

    When this is run using an input file as follows:

    AASTQWIULKONHBGDUXXIOL
    TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
    WVWAASTQWXXIUKONHBDUXXIOL
    AASTQWIUKIONHBDUXXIOLWWX
    AASTQWIUKONHBDUXXIOLWWX
    AASVTQWVTIUKONHBDUXXWIOL
    AASTQWIUKONHBDUXXIOL

    The result is:

    Input string
         AASTQWIUKONHBDUXXIOL
         t{0,4}A{2}J{1}t{2,8}J{3}A{3}t{0,6}A{1}J{2}t{0,4}
    
    check  AASTQWIULKONHBGDUXXIOL
    No Match
    
    check  TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
    No Match
    
    check  WVWAASTQWXXIUKONHBDUXXIOL
    Match
    
    check  AASTQWIUKIONHBDUXXIOLWWX
    No Match
    
    check  AASTQWIUKONHBDUXXIOLWWX
    Match
    
    check  AASVTQWVTIUKONHBDUXXWIOL
    Match
    
    check  AASTQWIUKONHBDUXXIOL
    Match

    Check the next post for an ABAP style solution.

    Add comment
    10|10000 characters needed characters exceeded

  • avatar image
    Former Member
    Jun 29, 2010 at 02:39 PM

    The ABAP won't paste into the editor correctly so I will send you an email copy.

    Add comment
    10|10000 characters needed characters exceeded

    • John u2013 First, let me thank you for what youu2019ve done here u2013 I cannot tell you how much itu2019s appreciated. Second, if I could lean on you for just a little more of your creativity, Iu2019d like to ask if youu2019d consider doing some revisions that will really add to the utility of what youu2019ve already done. These revisions do not involve the core of what youu2019ve done u2013 the final program will still try to match a search string to corresponding strings in a file, depending again on the position of certain doublets in the search string. What needs to change are: 1) the way in which u201Capprovedu201D doublets are initially identified and the way in which the group representation of these u201Capprovedu201D doublets is determined; 2) the way in which the input file is processed (the one that is searched against) 3) the way in which the matches are ouput. The best way to spec these changes is to propose the formal parametes for an actual Perl command line and then explain what processing should result from each of these parameters. The command line we need can be illustrated as follows:

      matchem u2013core core.txt u2013xpnd xpnd.txt u2013prox n u2013targ db.txt u2013s u2018u2026.u2019
      
      where: matchem is the name of the program core.txt is a file with an initial list of u201Capprovedu201D doublets and their group representation xpnd.txt is a file with instructions as to how to expand the initial u201Ccoreu201D list of approved doublets into a final list of u201Capprovedu201D doublets (this file may be empty, in which case u201Ccore.txtu201D is the final list of approved doublets) the n value of prox sets the u201Cproximity valueu201D for the match (in the original spec, this was 4) db.txt is the file of strings which the program will match the input string against. the u201C-su201Dstring is the search string u2013 the one the program will try to find matches for in db.txt. Details on the core.txt file Here is the u201Ccore.txtu201D file containing the initial list of u201Capprovedu201D doublets and their group representation:
      FF bb
      FL bb 
      LL bb
      FA bb
      LK bl
      LS bl
      LT bl
      LY bl
      LG bb
      IP bb
      IA bb
      ML bb
      VL bb
      VS bl
      VT bl
      VG bb
      SL lb
      SP lb
      PV bb
      PG bb
      TL lb
      TP lb
      AL bb
      AV bb
      AP bb 
      AN bl
      AS bl
      YP lb
      HP lb
      QR ll
      KN ll
      KK ll
      DL ll
      EN ll
      EK ll
      ER ll 
      WR bl
      RV lb
      RY ll
      RW lb
      RR ll
      SG lb
      RR ll
      GF bb
      GL bb
      GY bl
      GK bl
      GW bb
      GR bl
      GS bl
      
      
      As you can see, the group representation of the u201Capprovedu201D doublets uses two letters u2013 a lower-case B and a lower-case L. Details on the xpnd.txt file Here is the xpnd.txt file:
      F LIVPA
      L FIMVPAW
      I FLMVPAG
      M LIVPAG
      V FLIMPAG
      P FLIMVAW
      A FLIMVPG
      W LPG
      G IMVAW
      S TYHQNKDECR
      T SNKDER
      Y SHQNDCR
      H SYQNDCR
      Q SYHKECR
      N STYHKDECR
      K STQNDER
      D STYHNKECR
      E STQNKDR
      C SYHQNDR
      R STYHQNKDEC
      
      And here is an example of the way this file works in conjunction with the u201Ccore.txtu201D file to generate the final list of u201Capprovedu201D doublets. 1) VT is an approved doublet in the core.txt file 2) V is associated with the letters FLIMPAG in the xpnd.txt file 3) T is associated with the letters SNKDER in the xpnd.txt file 4) Therefore, new approved doublets should be generated by taking each element of with each element of , in that order (for example, FS, FN, FK, FD, FE, FR) 5) For each new doublet formed in this way, the group representation should be the same as that of the original doublet in the core.text file (in this case, the group representation for each new doublet would be u201Cblu201D, since u201Cblu201D is the group representation of the original doublet VT 6) Note that the final u201Capprovedu201D doublet list should be u201Cde-duplicatedu201D so that no doublet appears more than once. Details on db.txt file Each string in the db.txt file will be on one line of the file, and each string will be preceded by a u201Cstring idu201D line with information about the string itself. Like this:
      gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)}
      MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGC
      
      Use of the u201Cstring idsu201D in the output file of matches The u201Cstring idu201D for each string in the db.txt file will be used in the output of the program u2013 in particular, when matches are reported in the output of the program, each match must be preceded (on the same line) by the sequence ID of the string in which the match was found. For example:
      gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)} MGKDYYQTLGLARGASDEEIKRAYRRQA
      
      Multiple matches within the same string in the db.txt file We want to look for ALL matches to a given search string within each string of the db.txt file. For shorter search strings, there may be many matches within a single string in the db.txt file. Again, many many thanks, John ...

  • avatar image
    Former Member
    Jun 30, 2010 at 08:46 PM

    I did a quick look at this and the expanded requirements appear to have a conflict with the example information. If the doublets allow an individual letter to compose part of two adjacent doublets, I would expect the representation for that letter would have to be consistent.

    Take sequence RVT as a simple example. I would expect that to convert to lbl. This works fine with the samples based on

    RV   => lb
     VT =>   bl

    However, if I look at sequence DLK

    DL  => ll
     LK =>  bl

    This does not make sense as I read it because the L translates to l in the first doublet and b in the second doublet. This would work if the sequence expanded but you said there would not be an expansion due to the doublet mapping.

    Can you explain how this should be handled?

    Thanks,

    Add comment
    10|10000 characters needed characters exceeded

    • Hi John -

      Those results look good, at least for contiguous search patterns.

      One clarifiicatiion and one question:

      1) Clarification: there will only ever be 20 letters total, corresponding to the 20 core amino acids that occur in proteins. So, there will never be search patterns or DB strings with the letters B, J,O,U,X,Z. I don't think the program will have to be adjusted to take this into account, I just wanted you to be aware of why some letters are "missing". Also, the fact that the expansion rules produce 399 out of the 400 possible combinations of 20x20 letters means that I have to "cut back" the productivity of the expansion rules - I thought this might be the case and I do have a principled way to cut back the expansion rules so that far less than 399 final doublets are produced by them.

      2) Question: Will the program as it stands now allow me to enter a non-contiguous search pattern like:

      CY. . . ARL . . . . . . . . . . MT
      

      with the expectation of finding matches on:

      ll. . . blb . . . . . . . . . . bl  (where "." stands for anything ...
      

      Finally, regarding testing of the PERL as it now stands - if you provide a final copy, I will test thoroughly against the real database with search strings of actual empirical interest, where I know the correct answers in advance.

      Again, let me thank you very much again for what you've done here ... almost there. I think.

      Best

      djh

  • avatar image
    Former Member
    Jul 06, 2010 at 12:36 PM

    In the original definition, the "." character was described as a letter from the set T-Z. That has been carried forward as the sets were redefined to be characters from the input set (described as the alphabet originally) that did not positionally map as a portion of a doublet.

    With that view, the matching of the doublet string was consistent between the search and db string.

    In the clarification just added:

    CY. . . ARL . . . . . . . . . . MT

    with the expectation of finding matches on:

    ll. . . blb . . . . . . . . . . bl (where "." stands for anything ...

    The addition 'where "." stands for anything' makes the intent unclear. Since 'anything' can include mapped characters, the 'greedy' tendency of the matching alogorithm now comes into play. Say the string to check for matches is:

    ll. . . blb . . . . . . . . blblbl

    Using the expression default greedy tendency, the string that matches will be:

    ll. . . blb . . . . . . . . blblbl

    Disabling the greedy tendency, the string that matches will be:

    ll. . . blb . . . . . . . . bl

    Under a single expression, the following possible match string will not be detected:

    ll. . . blb . . . . . . . . blbl

    This will be a problem in any scenario where you have multiple matches allowed at a single offset point. Since the requirements desire all possible matches with overlaps, detection of all possible matches with multiple matches at an anchor point requires generation of a match string per possible combination based on proximity size otherwise only a single greedy tendency can be supported. With the example above, I think that corresponds to 72 patterns with a proximity of 4. This expands greatly as the number of blocks of non-significant match strings increase.

    Can you clarify the matching rules you are looking for?

    Thanks,

    John Benson

    Add comment
    10|10000 characters needed characters exceeded

    • John -

      ZIP of target DB and reduced expansion rules comng at you shortly.

      Regarding example search patterns, I have a suggestion that will reduce the chance of human error (by me) if I attemot to construct the example search patterns manually.

      In particular, when you receive the target DB, could you write a one-off that simply translates all the sequences in this DB into sequences containing just "."s and doublets of interest?

      That way, it would be easy for you to select search patterns of varying length and see if they find themselves, and then, by altering spacing in the selectd patterns, you could easily check your proximity logic.

      I realize this is yet more code - believe me, I really do - but inasmuch as you've truly professionalized this effort so far, it would be great if you could find the time to write the one-off and thereby keep the example search pattern generation technique equally professional ...

      Best

      djh