Solved

Perl script

Posted on 2004-03-30
12
253 Views
Last Modified: 2010-03-04
Hi ,

I need a perl script to do the following cause I'm having real problems with it!  I have a file which contains text- see example  below.  I need to parse this file and create an output for the 2 chains in the file( A and D). The numbers I'm interesting in capturing are, for example in the first line below : 98 and 146. I need all the numbers for every A and every D in the file. For the entire file I need my output to look like this:
eg
Chain A :34,35,38,39,41
ChainD :37,38,141,143       with every position captured. I'm having major problems with it so any help whould be great!


Example Text:

PRO  38(A)( 291)   - HIS 146(D)(4409)   :   3.840
THR  39(A)( 298)   - PRO 100(D)(4058)   :   3.748
LYS  41(A)( 315)   - HIS 146(D)(4409)   :   3.787
THR  42(A)( 320)   - HIS  97(D)(4030)   :   3.683
THR  42(A)( 322)   - ASP  99(D)(4044)   :   3.780
ARG 142(A)(1084)   - TYR  35(D)(3553)   :   3.788
ARG 142(A)(1079)   - PRO  36(D)(3564)   :   3.809
ARG 142(A)(1081)   - TRP  37(D)(3578)   :   4.002
PRO  38(A)(CA)   - PHE  34(A)(CA)   :   5.092
PRO  38(A)(CA)   - LEU  35(A)(CA)   :   5.761
PRO  38(A)(CA)   - PHE  37(A)(CA)   :   3.800
PRO  38(A)(CA)   - THR  39(A)(CA)   :   3.775
ARG 142(A)(CA)   - TYR 141(A)(CA)   :   3.792
VAL  34(D)(CA)   - ARG  30(D)(CA)   :   5.766
VAL  34(D)(CA)   - LEU  31(D)(CA)   :   5.397
VAL  34(D)(CA)   - LEU  32(D)(CA)   :   5.830
VAL  34(D)(CA)   - VAL  33(D)(CA)   :   3.816
VAL  34(D)(CA)   - TYR  35(D)(CA)   :   3.823
VAL  34(D)(CA)   - PRO  36(D)(CA)   :   5.936
TYR  35(D)(CA)   - LEU  31(D)(CA)   :   5.682
TYR  35(D)(CA)   - LEU  32(D)(CA)   :   5.579
0
Comment
Question by:paulieomeara
  • 5
  • 4
  • 3
12 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 10720667
Could you please explain the relationship between your example output and the example text
0
 

Author Comment

by:paulieomeara
ID: 10720735
The example text above contains a list of interactions between between (A) and (D), (A) and (A) and also (D) and (D).  I want to capture the numbers before all the (A)'s and all the (D)'s in the 2 lists of the input file, as shown above.  What I want to do is remove redundancy so if one of these numbers occurs more than once, I will only hav eit once in my output file.  My output file should be split in two parts (A) and (D) with each number that occurs in (A) and (D) found once in my output file and with these numbers occuring in order like :
EG
Chain (A) : 34,35,38,39,41
Chain (D) : 37,38,141,143      
 

Does this make it clearer?
0
 
LVL 48

Expert Comment

by:Tintin
ID: 10720775
#!/usr/bin/perl
use strict;
my $datafile = '/path/to/data.dat';
my %chain;

open FILE, $datafile or die "Can not open $datafile $!\n";

while (<FILE>) {
        if (/(\d+).(.)/) {
                $chain{$2}->{$1} = $1;
        }
        else {
                print "Incorrect format: $_\n";
        }
}

close FILE;

foreach my $chain (keys %chain) {
        print "Chain $chain: ";
        print join(',',keys %{$chain{$chain}}) . "\n";
}

The output from your sample data is:

Chain A: 142,38,39,41,42
Chain D: 34,35

You don't specify if you want the output sorted or not, so let me know if this is a requirement.
0
Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
LVL 84

Expert Comment

by:ozo
ID: 10720793
Ok, then why doesn't chain (A) contain 42,141,142
and why doesn't chain (D) contain 31,32,34,35,36,97,99,100,146
0
 
LVL 84

Expert Comment

by:ozo
ID: 10720828
Tintin's sample output does not match your sample output.
Which, if any, is correct?
0
 

Author Comment

by:paulieomeara
ID: 10720830
Hi Tintin,

Thanks for that but I do need the data sorted so its in order..

Ozo...that was just an example output...it will contain all the numbers in both the chains
0
 
LVL 48

Expert Comment

by:Tintin
ID: 10720836
I misunderstood the requirements.

Change the if test to:

        if (/(\d+)\((.).*-.*\s+(\d+)\((.)/) {
                $chain{$2}->{$1} = $1;
                $chain{$4}->{$3} = $3;
0
 

Author Comment

by:paulieomeara
ID: 10720869
My output will include all the numbers before all the (A)'s and all the (D)s in the file.  I want only one copy of the number and I need the out put to occur in order

EG
Chain (A): 34,35,37,38,39, 41,42,141,142
chain (D) : 30,31,32,33, 34 ,35, 36,37,99,100,146     from the above file sample
0
 
LVL 48

Accepted Solution

by:
Tintin earned 500 total points
ID: 10720887
Putting the whole thing together:

#!/usr/bin/perl
use strict;
my $datafile = 'data';
my %chain;

open FILE, $datafile or die "Can not open $datafile $!\n";

while (<FILE>) {
        if (/(\d+)\((.).*-.*\s+(\d+)\((.)/) {
                $chain{$2}->{$1} = $1;
                $chain{$4}->{$3} = $3;
        }
        else {
                print "Incorrect format: $_\n";
        }
}

close FILE;

foreach my $chain (sort keys %chain) {
        print "Chain $chain: ";
        print join(',',sort {$a <=> $b} keys %{$chain{$chain}} ) . "\n";
}


Output is now:

Chain A: 34,35,37,38,39,41,42,141,142
Chain D: 30,31,32,33,34,35,36,37,97,99,100,146
0
 

Author Comment

by:paulieomeara
ID: 10720889
Hi Tintin,

That works good....but I need them to work in order?  Is this possible?
0
 

Author Comment

by:paulieomeara
ID: 10720910
Thank you...that works great.....eased my mind!
0
 
LVL 84

Expert Comment

by:ozo
ID: 10720992
Sorry, I was trying to reproduce your sample output from your sample input
Without that requirement the program is simple:

#!/usr/bin/perl
#!/usr/bin/perl
use strict;
my $datafile = shift || "Example.text";
my %chain;

open FILE, $datafile or die "Can not open $datafile $!\n";

while( <FILE> ){
        $chain{$2}->{$1}++ while /(\d+)\((\w)\)/g;
}

close FILE;

print "Chain ($_): ",join(",",sort{$a<=>$b}keys %{$chain{$_}}),"\n" for sort keys %chain;

__DATA__
This produces
Chain (A) : 34,35,37,38,39,41,42,141,142
Chain (D) : 30,31,32,33,34,35,36,37,97,99,100,146
from
PRO  38(A)( 291)   - HIS 146(D)(4409)   :   3.840
THR  39(A)( 298)   - PRO 100(D)(4058)   :   3.748
LYS  41(A)( 315)   - HIS 146(D)(4409)   :   3.787
THR  42(A)( 320)   - HIS  97(D)(4030)   :   3.683
THR  42(A)( 322)   - ASP  99(D)(4044)   :   3.780
ARG 142(A)(1084)   - TYR  35(D)(3553)   :   3.788
ARG 142(A)(1079)   - PRO  36(D)(3564)   :   3.809
ARG 142(A)(1081)   - TRP  37(D)(3578)   :   4.002
PRO  38(A)(CA)   - PHE  34(A)(CA)   :   5.092
PRO  38(A)(CA)   - LEU  35(A)(CA)   :   5.761
PRO  38(A)(CA)   - PHE  37(A)(CA)   :   3.800
PRO  38(A)(CA)   - THR  39(A)(CA)   :   3.775
ARG 142(A)(CA)   - TYR 141(A)(CA)   :   3.792
VAL  34(D)(CA)   - ARG  30(D)(CA)   :   5.766
VAL  34(D)(CA)   - LEU  31(D)(CA)   :   5.397
VAL  34(D)(CA)   - LEU  32(D)(CA)   :   5.830
VAL  34(D)(CA)   - VAL  33(D)(CA)   :   3.816
VAL  34(D)(CA)   - TYR  35(D)(CA)   :   3.823
VAL  34(D)(CA)   - PRO  36(D)(CA)   :   5.936
TYR  35(D)(CA)   - LEU  31(D)(CA)   :   5.682
TYR  35(D)(CA)   - LEU  32(D)(CA)   :   5.579

0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

861 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question