paulieomeara
asked on
Perl script
Hi ,
I need a perl script to do the following cause I'm having real problems with it! I have a file which contains text- see example below. I need to parse this file and create an output for the 2 chains in the file( A and D). The numbers I'm interesting in capturing are, for example in the first line below : 98 and 146. I need all the numbers for every A and every D in the file. For the entire file I need my output to look like this:
eg
Chain A :34,35,38,39,41
ChainD :37,38,141,143 with every position captured. I'm having major problems with it so any help whould be great!
Example Text:
PRO 38(A)( 291) - HIS 146(D)(4409) : 3.840
THR 39(A)( 298) - PRO 100(D)(4058) : 3.748
LYS 41(A)( 315) - HIS 146(D)(4409) : 3.787
THR 42(A)( 320) - HIS 97(D)(4030) : 3.683
THR 42(A)( 322) - ASP 99(D)(4044) : 3.780
ARG 142(A)(1084) - TYR 35(D)(3553) : 3.788
ARG 142(A)(1079) - PRO 36(D)(3564) : 3.809
ARG 142(A)(1081) - TRP 37(D)(3578) : 4.002
PRO 38(A)(CA) - PHE 34(A)(CA) : 5.092
PRO 38(A)(CA) - LEU 35(A)(CA) : 5.761
PRO 38(A)(CA) - PHE 37(A)(CA) : 3.800
PRO 38(A)(CA) - THR 39(A)(CA) : 3.775
ARG 142(A)(CA) - TYR 141(A)(CA) : 3.792
VAL 34(D)(CA) - ARG 30(D)(CA) : 5.766
VAL 34(D)(CA) - LEU 31(D)(CA) : 5.397
VAL 34(D)(CA) - LEU 32(D)(CA) : 5.830
VAL 34(D)(CA) - VAL 33(D)(CA) : 3.816
VAL 34(D)(CA) - TYR 35(D)(CA) : 3.823
VAL 34(D)(CA) - PRO 36(D)(CA) : 5.936
TYR 35(D)(CA) - LEU 31(D)(CA) : 5.682
TYR 35(D)(CA) - LEU 32(D)(CA) : 5.579
I need a perl script to do the following cause I'm having real problems with it! I have a file which contains text- see example below. I need to parse this file and create an output for the 2 chains in the file( A and D). The numbers I'm interesting in capturing are, for example in the first line below : 98 and 146. I need all the numbers for every A and every D in the file. For the entire file I need my output to look like this:
eg
Chain A :34,35,38,39,41
ChainD :37,38,141,143 with every position captured. I'm having major problems with it so any help whould be great!
Example Text:
PRO 38(A)( 291) - HIS 146(D)(4409) : 3.840
THR 39(A)( 298) - PRO 100(D)(4058) : 3.748
LYS 41(A)( 315) - HIS 146(D)(4409) : 3.787
THR 42(A)( 320) - HIS 97(D)(4030) : 3.683
THR 42(A)( 322) - ASP 99(D)(4044) : 3.780
ARG 142(A)(1084) - TYR 35(D)(3553) : 3.788
ARG 142(A)(1079) - PRO 36(D)(3564) : 3.809
ARG 142(A)(1081) - TRP 37(D)(3578) : 4.002
PRO 38(A)(CA) - PHE 34(A)(CA) : 5.092
PRO 38(A)(CA) - LEU 35(A)(CA) : 5.761
PRO 38(A)(CA) - PHE 37(A)(CA) : 3.800
PRO 38(A)(CA) - THR 39(A)(CA) : 3.775
ARG 142(A)(CA) - TYR 141(A)(CA) : 3.792
VAL 34(D)(CA) - ARG 30(D)(CA) : 5.766
VAL 34(D)(CA) - LEU 31(D)(CA) : 5.397
VAL 34(D)(CA) - LEU 32(D)(CA) : 5.830
VAL 34(D)(CA) - VAL 33(D)(CA) : 3.816
VAL 34(D)(CA) - TYR 35(D)(CA) : 3.823
VAL 34(D)(CA) - PRO 36(D)(CA) : 5.936
TYR 35(D)(CA) - LEU 31(D)(CA) : 5.682
TYR 35(D)(CA) - LEU 32(D)(CA) : 5.579
Could you please explain the relationship between your example output and the example text
ASKER
The example text above contains a list of interactions between between (A) and (D), (A) and (A) and also (D) and (D). I want to capture the numbers before all the (A)'s and all the (D)'s in the 2 lists of the input file, as shown above. What I want to do is remove redundancy so if one of these numbers occurs more than once, I will only hav eit once in my output file. My output file should be split in two parts (A) and (D) with each number that occurs in (A) and (D) found once in my output file and with these numbers occuring in order like :
EG
Chain (A) : 34,35,38,39,41
Chain (D) : 37,38,141,143
Does this make it clearer?
EG
Chain (A) : 34,35,38,39,41
Chain (D) : 37,38,141,143
Does this make it clearer?
#!/usr/bin/perl
use strict;
my $datafile = '/path/to/data.dat';
my %chain;
open FILE, $datafile or die "Can not open $datafile $!\n";
while (<FILE>) {
if (/(\d+).(.)/) {
$chain{$2}->{$1} = $1;
}
else {
print "Incorrect format: $_\n";
}
}
close FILE;
foreach my $chain (keys %chain) {
print "Chain $chain: ";
print join(',',keys %{$chain{$chain}}) . "\n";
}
The output from your sample data is:
Chain A: 142,38,39,41,42
Chain D: 34,35
You don't specify if you want the output sorted or not, so let me know if this is a requirement.
use strict;
my $datafile = '/path/to/data.dat';
my %chain;
open FILE, $datafile or die "Can not open $datafile $!\n";
while (<FILE>) {
if (/(\d+).(.)/) {
$chain{$2}->{$1} = $1;
}
else {
print "Incorrect format: $_\n";
}
}
close FILE;
foreach my $chain (keys %chain) {
print "Chain $chain: ";
print join(',',keys %{$chain{$chain}}) . "\n";
}
The output from your sample data is:
Chain A: 142,38,39,41,42
Chain D: 34,35
You don't specify if you want the output sorted or not, so let me know if this is a requirement.
Ok, then why doesn't chain (A) contain 42,141,142
and why doesn't chain (D) contain 31,32,34,35,36,97,99,100,1 46
and why doesn't chain (D) contain 31,32,34,35,36,97,99,100,1
Tintin's sample output does not match your sample output.
Which, if any, is correct?
Which, if any, is correct?
ASKER
Hi Tintin,
Thanks for that but I do need the data sorted so its in order..
Ozo...that was just an example output...it will contain all the numbers in both the chains
Thanks for that but I do need the data sorted so its in order..
Ozo...that was just an example output...it will contain all the numbers in both the chains
I misunderstood the requirements.
Change the if test to:
if (/(\d+)\((.).*-.*\s+(\d+)\ ((.)/) {
$chain{$2}->{$1} = $1;
$chain{$4}->{$3} = $3;
Change the if test to:
if (/(\d+)\((.).*-.*\s+(\d+)\
$chain{$2}->{$1} = $1;
$chain{$4}->{$3} = $3;
ASKER
My output will include all the numbers before all the (A)'s and all the (D)s in the file. I want only one copy of the number and I need the out put to occur in order
EG
Chain (A): 34,35,37,38,39, 41,42,141,142
chain (D) : 30,31,32,33, 34 ,35, 36,37,99,100,146 from the above file sample
EG
Chain (A): 34,35,37,38,39, 41,42,141,142
chain (D) : 30,31,32,33, 34 ,35, 36,37,99,100,146 from the above file sample
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi Tintin,
That works good....but I need them to work in order? Is this possible?
That works good....but I need them to work in order? Is this possible?
ASKER
Thank you...that works great.....eased my mind!
Sorry, I was trying to reproduce your sample output from your sample input
Without that requirement the program is simple:
#!/usr/bin/perl
#!/usr/bin/perl
use strict;
my $datafile = shift || "Example.text";
my %chain;
open FILE, $datafile or die "Can not open $datafile $!\n";
while( <FILE> ){
$chain{$2}->{$1}++ while /(\d+)\((\w)\)/g;
}
close FILE;
print "Chain ($_): ",join(",",sort{$a<=>$b}ke ys %{$chain{$_}}),"\n" for sort keys %chain;
__DATA__
This produces
Chain (A) : 34,35,37,38,39,41,42,141,1 42
Chain (D) : 30,31,32,33,34,35,36,37,97 ,99,100,14 6
from
PRO 38(A)( 291) - HIS 146(D)(4409) : 3.840
THR 39(A)( 298) - PRO 100(D)(4058) : 3.748
LYS 41(A)( 315) - HIS 146(D)(4409) : 3.787
THR 42(A)( 320) - HIS 97(D)(4030) : 3.683
THR 42(A)( 322) - ASP 99(D)(4044) : 3.780
ARG 142(A)(1084) - TYR 35(D)(3553) : 3.788
ARG 142(A)(1079) - PRO 36(D)(3564) : 3.809
ARG 142(A)(1081) - TRP 37(D)(3578) : 4.002
PRO 38(A)(CA) - PHE 34(A)(CA) : 5.092
PRO 38(A)(CA) - LEU 35(A)(CA) : 5.761
PRO 38(A)(CA) - PHE 37(A)(CA) : 3.800
PRO 38(A)(CA) - THR 39(A)(CA) : 3.775
ARG 142(A)(CA) - TYR 141(A)(CA) : 3.792
VAL 34(D)(CA) - ARG 30(D)(CA) : 5.766
VAL 34(D)(CA) - LEU 31(D)(CA) : 5.397
VAL 34(D)(CA) - LEU 32(D)(CA) : 5.830
VAL 34(D)(CA) - VAL 33(D)(CA) : 3.816
VAL 34(D)(CA) - TYR 35(D)(CA) : 3.823
VAL 34(D)(CA) - PRO 36(D)(CA) : 5.936
TYR 35(D)(CA) - LEU 31(D)(CA) : 5.682
TYR 35(D)(CA) - LEU 32(D)(CA) : 5.579
Without that requirement the program is simple:
#!/usr/bin/perl
#!/usr/bin/perl
use strict;
my $datafile = shift || "Example.text";
my %chain;
open FILE, $datafile or die "Can not open $datafile $!\n";
while( <FILE> ){
$chain{$2}->{$1}++ while /(\d+)\((\w)\)/g;
}
close FILE;
print "Chain ($_): ",join(",",sort{$a<=>$b}ke
__DATA__
This produces
Chain (A) : 34,35,37,38,39,41,42,141,1
Chain (D) : 30,31,32,33,34,35,36,37,97
from
PRO 38(A)( 291) - HIS 146(D)(4409) : 3.840
THR 39(A)( 298) - PRO 100(D)(4058) : 3.748
LYS 41(A)( 315) - HIS 146(D)(4409) : 3.787
THR 42(A)( 320) - HIS 97(D)(4030) : 3.683
THR 42(A)( 322) - ASP 99(D)(4044) : 3.780
ARG 142(A)(1084) - TYR 35(D)(3553) : 3.788
ARG 142(A)(1079) - PRO 36(D)(3564) : 3.809
ARG 142(A)(1081) - TRP 37(D)(3578) : 4.002
PRO 38(A)(CA) - PHE 34(A)(CA) : 5.092
PRO 38(A)(CA) - LEU 35(A)(CA) : 5.761
PRO 38(A)(CA) - PHE 37(A)(CA) : 3.800
PRO 38(A)(CA) - THR 39(A)(CA) : 3.775
ARG 142(A)(CA) - TYR 141(A)(CA) : 3.792
VAL 34(D)(CA) - ARG 30(D)(CA) : 5.766
VAL 34(D)(CA) - LEU 31(D)(CA) : 5.397
VAL 34(D)(CA) - LEU 32(D)(CA) : 5.830
VAL 34(D)(CA) - VAL 33(D)(CA) : 3.816
VAL 34(D)(CA) - TYR 35(D)(CA) : 3.823
VAL 34(D)(CA) - PRO 36(D)(CA) : 5.936
TYR 35(D)(CA) - LEU 31(D)(CA) : 5.682
TYR 35(D)(CA) - LEU 32(D)(CA) : 5.579