Avatar of StephenMcGowan
StephenMcGowan asked on

Modifying a perl script

Hi there,

I'm trying to modify an existing perl script as the one of the text files it uses has been modified.

The original script uses SpeciesId.txt, a file in the following format:

African Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Indian Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Open in new window


However, I'd like it to use SpeciesID3.txt

A file in the following format (basically no letters [A =, B = etc] ano no "AND" instances):

African Elephant

826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
etc....

Alpaca

785.4
809.4
836.4
837.4
840.5
841.4
852.4
853.4
868.5
869.4
886.4
892.5
898.5
899.5
908.5
etc....

Open in new window


Would this be an easy modification to make to the existing script (Script.txt)?

Thanks,

Stephen.
SpeciesId.txt
SpeciesId3.txt
Script.txt
Perl

Avatar of undefined
Last Comment
StephenMcGowan

8/22/2022 - Mon
ozo

sub _init {
    open my $in, '<', 'SpeciesId3.txt' or die "could not open SpeciesId3.txt: $!";
    my $spec;
    while (<$in>) {
        chomp;
        next if /^\s*$/; # skip blank lines
        if (m{^([A-Z]?)\s*=?\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
            # handle letter = lines
            push @{$species{$spec}{$1}}, $2;
            push @{$species{$spec}{$1}}, $3 if $3;
        } else {
            # handle species name lines
            $spec = $_;
            $len = length($spec) if (length($spec) > $len);
        }
    }
    close $in;
}
ozo

The modification in http:#a39469769 assumed that
African Elephant

826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
etc....
should be treated like
African Elephant

 = 826.4 AND 836.4 AND 840.4 AND 852.4 AND 858.4 AND 886.4 AND 892.5 AND 898.5 AND 904.5 AND 920.5 AND 950.5 AND 1001.5 AND 1015.5
etc....
Is that correct?
If not, can a SpeciesId3.txt line be translated to an equivalent SpeciesId.txt line?
What would a correct translation of a SpeciesId3.txt line be?
wilcoxon

Looking at your code change (and leaving the AND regex alone), I think you are actually treating it as:
African Elephant
A = 826.4
B = 836.4
C = 840.4
...

Open in new window

because the optional AND portion of the regex will never match so $3 will never be anything.

I'm assuming is the correct treatment...
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
ASKER
StephenMcGowan

Hey,

Sorry, I should have explained better.

So, the original file had data which looked like this:

African Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Open in new window


The difference with the new, updated file is that:

1) the letters for each mass have been removed (i.e. B = 1453.7, C = 1577.8), and instead it is just a long list of masses without identification letters preceding them for each line.
2) there are a lot more masses for each animal (where the above there are 6, the new file has roughly 100 per animal
3) the whole AND command is no longer required and can be written out of the script. There are no longer cases where two masses are required to generate a +1 score

So, in essence all the script is doing is taking a CSV data file, taking the masses from the CSV file and running through each of these animal lists looking for matches.
The top 5 matches will have match scores assigned and percentage % matches (calculated from the number of masses contained in the individual animal list).

Thanks for getting back to me, sorry if this has caused confusion!

Hope this is clearer!
ASKER CERTIFIED SOLUTION
ozo

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
ASKER
StephenMcGowan

Here's a simple example.

So the CSV file has 12 masses.

It then checks the masses against the SpeciesID text file:

Hippo

2
3
5
6
7
32
94
109

Rhino

2
3
5
12
24
33
58

Hyena

1
4
30
34
56
60
64


Penguin

1
2
4
9
12
30
32
33
36
50
58
60

Ox

1
3
4
13
70

Open in new window



Matching against the mass column in the CSV file, the scores would be like this:

Hippo

2 [+1 score]
3
5
6
7
32 [+1 score]
94
109

Rhino

2 [+1 score]
3
5
12 [+1 score]
24
33 [+1 score]
58 [+1 score]

Hyena

1 [+1 score]
4 [+1 score]
30 [+1 score]
34
56
60 [+1 score]
64


Penguin

1 [+1 score]
2 [+1 score]
4 [+1 score]
9 [+1 score]
12 [+1 score]
30 [+1 score]
32 [+1 score]
33 [+1 score]
36 [+1 score]
50 [+1 score]
58 [+1 score]
60 [+1 score]

Ox

1 [+1 score]
3
4 [+1 score]
13
70

Open in new window


So... the masses in CSV file A5 are most likely to be penguin, as it's matched up 100%, followed by Rhino and Hyena with 4 scores each (both have lists of 7 masses long, so would score a match percentage of 57% each).

So top five for A5 would be:

1) Penguin with 13 matches and a score of 100%
2) Rhino   4 matches   57%
3) Hyena  4 matched  57%
4) Ox
5) Hippo


Just a simplified example of what I'm trying to achieve out of this script.
20130730-p12-A5-EXAMPLE.csv
SpeciesID-EXAMPLE.txt