Link to home
Start Free TrialLog in
Avatar of paulieomeara
paulieomeara

asked on

Perl script

Hi ,

I need a perl script to do the following cause I'm having real problems with it!  I have a file which contains text- see example  below.  I need to parse this file and create an output for the 2 chains in the file( A and D). The numbers I'm interesting in capturing are, for example in the first line below : 98 and 146. I need all the numbers for every A and every D in the file. For the entire file I need my output to look like this:
eg
Chain A :34,35,38,39,41
ChainD :37,38,141,143       with every position captured. I'm having major problems with it so any help whould be great!


Example Text:

PRO  38(A)( 291)   - HIS 146(D)(4409)   :   3.840
THR  39(A)( 298)   - PRO 100(D)(4058)   :   3.748
LYS  41(A)( 315)   - HIS 146(D)(4409)   :   3.787
THR  42(A)( 320)   - HIS  97(D)(4030)   :   3.683
THR  42(A)( 322)   - ASP  99(D)(4044)   :   3.780
ARG 142(A)(1084)   - TYR  35(D)(3553)   :   3.788
ARG 142(A)(1079)   - PRO  36(D)(3564)   :   3.809
ARG 142(A)(1081)   - TRP  37(D)(3578)   :   4.002
PRO  38(A)(CA)   - PHE  34(A)(CA)   :   5.092
PRO  38(A)(CA)   - LEU  35(A)(CA)   :   5.761
PRO  38(A)(CA)   - PHE  37(A)(CA)   :   3.800
PRO  38(A)(CA)   - THR  39(A)(CA)   :   3.775
ARG 142(A)(CA)   - TYR 141(A)(CA)   :   3.792
VAL  34(D)(CA)   - ARG  30(D)(CA)   :   5.766
VAL  34(D)(CA)   - LEU  31(D)(CA)   :   5.397
VAL  34(D)(CA)   - LEU  32(D)(CA)   :   5.830
VAL  34(D)(CA)   - VAL  33(D)(CA)   :   3.816
VAL  34(D)(CA)   - TYR  35(D)(CA)   :   3.823
VAL  34(D)(CA)   - PRO  36(D)(CA)   :   5.936
TYR  35(D)(CA)   - LEU  31(D)(CA)   :   5.682
TYR  35(D)(CA)   - LEU  32(D)(CA)   :   5.579
Avatar of ozo
ozo
Flag of United States of America image

Could you please explain the relationship between your example output and the example text
Avatar of paulieomeara
paulieomeara

ASKER

The example text above contains a list of interactions between between (A) and (D), (A) and (A) and also (D) and (D).  I want to capture the numbers before all the (A)'s and all the (D)'s in the 2 lists of the input file, as shown above.  What I want to do is remove redundancy so if one of these numbers occurs more than once, I will only hav eit once in my output file.  My output file should be split in two parts (A) and (D) with each number that occurs in (A) and (D) found once in my output file and with these numbers occuring in order like :
EG
Chain (A) : 34,35,38,39,41
Chain (D) : 37,38,141,143      
 

Does this make it clearer?
#!/usr/bin/perl
use strict;
my $datafile = '/path/to/data.dat';
my %chain;

open FILE, $datafile or die "Can not open $datafile $!\n";

while (<FILE>) {
        if (/(\d+).(.)/) {
                $chain{$2}->{$1} = $1;
        }
        else {
                print "Incorrect format: $_\n";
        }
}

close FILE;

foreach my $chain (keys %chain) {
        print "Chain $chain: ";
        print join(',',keys %{$chain{$chain}}) . "\n";
}

The output from your sample data is:

Chain A: 142,38,39,41,42
Chain D: 34,35

You don't specify if you want the output sorted or not, so let me know if this is a requirement.
Ok, then why doesn't chain (A) contain 42,141,142
and why doesn't chain (D) contain 31,32,34,35,36,97,99,100,146
Tintin's sample output does not match your sample output.
Which, if any, is correct?
Hi Tintin,

Thanks for that but I do need the data sorted so its in order..

Ozo...that was just an example output...it will contain all the numbers in both the chains
I misunderstood the requirements.

Change the if test to:

        if (/(\d+)\((.).*-.*\s+(\d+)\((.)/) {
                $chain{$2}->{$1} = $1;
                $chain{$4}->{$3} = $3;
My output will include all the numbers before all the (A)'s and all the (D)s in the file.  I want only one copy of the number and I need the out put to occur in order

EG
Chain (A): 34,35,37,38,39, 41,42,141,142
chain (D) : 30,31,32,33, 34 ,35, 36,37,99,100,146     from the above file sample
ASKER CERTIFIED SOLUTION
Avatar of Tintin
Tintin

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Tintin,

That works good....but I need them to work in order?  Is this possible?
Thank you...that works great.....eased my mind!
Sorry, I was trying to reproduce your sample output from your sample input
Without that requirement the program is simple:

#!/usr/bin/perl
#!/usr/bin/perl
use strict;
my $datafile = shift || "Example.text";
my %chain;

open FILE, $datafile or die "Can not open $datafile $!\n";

while( <FILE> ){
        $chain{$2}->{$1}++ while /(\d+)\((\w)\)/g;
}

close FILE;

print "Chain ($_): ",join(",",sort{$a<=>$b}keys %{$chain{$_}}),"\n" for sort keys %chain;

__DATA__
This produces
Chain (A) : 34,35,37,38,39,41,42,141,142
Chain (D) : 30,31,32,33,34,35,36,37,97,99,100,146
from
PRO  38(A)( 291)   - HIS 146(D)(4409)   :   3.840
THR  39(A)( 298)   - PRO 100(D)(4058)   :   3.748
LYS  41(A)( 315)   - HIS 146(D)(4409)   :   3.787
THR  42(A)( 320)   - HIS  97(D)(4030)   :   3.683
THR  42(A)( 322)   - ASP  99(D)(4044)   :   3.780
ARG 142(A)(1084)   - TYR  35(D)(3553)   :   3.788
ARG 142(A)(1079)   - PRO  36(D)(3564)   :   3.809
ARG 142(A)(1081)   - TRP  37(D)(3578)   :   4.002
PRO  38(A)(CA)   - PHE  34(A)(CA)   :   5.092
PRO  38(A)(CA)   - LEU  35(A)(CA)   :   5.761
PRO  38(A)(CA)   - PHE  37(A)(CA)   :   3.800
PRO  38(A)(CA)   - THR  39(A)(CA)   :   3.775
ARG 142(A)(CA)   - TYR 141(A)(CA)   :   3.792
VAL  34(D)(CA)   - ARG  30(D)(CA)   :   5.766
VAL  34(D)(CA)   - LEU  31(D)(CA)   :   5.397
VAL  34(D)(CA)   - LEU  32(D)(CA)   :   5.830
VAL  34(D)(CA)   - VAL  33(D)(CA)   :   3.816
VAL  34(D)(CA)   - TYR  35(D)(CA)   :   3.823
VAL  34(D)(CA)   - PRO  36(D)(CA)   :   5.936
TYR  35(D)(CA)   - LEU  31(D)(CA)   :   5.682
TYR  35(D)(CA)   - LEU  32(D)(CA)   :   5.579