Solved

perl soundex function help

Posted on 2011-02-26
5
389 Views
Last Modified: 2012-05-11
I am attempting to write a soundex function.
With the below code here is the result when i enter Euler:
U
1
1000
It is grabbing the the last number rather than the corresponding letter to number
Any help woild be appreciated

THE SOUNDEX ALGORITHM
 
The "Soundex" method of transforming names to a code that groups names/words that sound alike was originally  
1.  Retain the first letter of the name for use in step 5.
 
2.  Assign the following numbers to the letters in the word (case insensitive):
 
LETTERS TO CONVERT  CODE TO USE
B F P V b f p v  1
C G J K Q S X Z c g j k q s x z  2
D T d t  3
L l  4
M N m n  5
R r  6
 
3.  If two or more letters with the same code were adjacent in the original name (before step 2), including the
first and any subsequent letters (see "Lloyd” example below), omit all but the first.
 
4.  Drop all occurrences of A, E, H, I, O, U W, and Y in other positions.   Also drop any non-alphabetic
characters such as the dash (-) and apostrophe (').
 
5.  Convert to the form "letter, digit, digit, digit, digit" by adding trailing zeros (if there are less than four digits), or
by dropping rightmost digits (if there are more than four) where the "letter” comes from step 1.  Your “letter”
should be capitalized.
 
For example, the names Euler, Gauss, Hilbert, Knuth, Lloyd, and Lukasieicz have the respective codes E4600, G2000, H4163, K5300, L3000, L2220.  Notice especially how Lloyd is converted to L3000 and how Euler converted to E4600 and not E6000.
my $temp = uc shift;    # gets the name passed to the function
#  chomp $temp;        # and uppercases it

    $temp =~ s/^[^a-z]//; # remove first character if not letter
  
    my ($f, $fc) = ($temp =~ /(\w)(\w+)/); #save the 1st char
  
 
    my $xx = ($fc =~ s/AEHIOUWY//);#remove letters

    my $temp2 = ( $xx =~ tr/BFPVCGJKQSXZDTLMNR/111122222222334556/); #replace remaining letters with numbers
 
     my $voo = ($temp2 =~ /^(.)/);  # Grab the numbers
     my $foo = ($voo =~  s/^$fc+//); #replace any leading spaces
     my $boo = ($foo =~  tr///cs);  #tran non-alpha chars to spaces
     my $zoo = ($boo =~  tr/0//d); #delete all zeros
     my $sdxV = ($zoo =~ $f . $zoo . '000'); #concatinate the 1st letter with corresponding number and add 3 zeros if needed
     
    my $sdxValue =  ($sdxV =~ s/^(.{4}).*/$1/);
   print "$sdxValue\n";

print "$f\n";
print "$voo\n";
print "$sdxValue\n";

Open in new window

0
Comment
Question by:fac66
  • 3
  • 2
5 Comments
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
This may not be the only bug, but you have alrady upcased $temp on input, and so you need to either use a /i on the search or use upper-case A-Z, like this:

 $temp =~ s/^[^A-Z]//; # remove first character if not letter

Open in new window

0
 

Author Comment

by:fac66
Comment Utility
Yes you are correct i caught that early on.
Maybe you can help me with another problem.

I need to replace duplicate letters with only one for example Lloyd is returning 4300
where is thould return 3000. If you read above explaining the soundex conversion rules.
None of these work:

my $temp2 = ($fc =~ s/(\w)\1/$1/g);
 #$fc =~ s/(\w)\1/$1/;
 #my $temp2 = ($fc =~ s/(.)(?=.*?\1)/$1/g);
 #my $temp2 = ($fc =~ tr/a-zA-Z/a-zA-Z/s);
 
my $temp = uc shift;    # gets the name passed to the function

    $temp =~ s/^[^A-Z]//; # remove first character if not letter
    my ($f, $fc) = ($temp =~ /(\w)(\w+)/);
    print "$f\n";
    #print "$fc\n";
    $fc =~ tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/;
    $fc =~  s/0+//g; #gets rid of extra 0's

#this section I am trying to eliminate duplicate letters
    my $temp2 = ($fc =~ s/(\w)\1/$1/g);
    #$fc =~ s/(\w)\1/$1/;
    #my $temp2 = ($fc =~ s/(.)(?=.*?\1)/$1/g);
    #my $temp2 = ($fc =~ tr/a-zA-Z/a-zA-Z/s);
    #print "$temp2\n";
     print "$fc\n";
    if ($fc =~ /^(.{4}).*/g)
        {
          print "$fc\n";
       }
          else 
        {
                $fc = $fc."0"x(4 - length($fc));
                print "$fc\n";
         }

Open in new window

0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
Comment Utility
I made several changes.  You were trying to get rid out doubled letters too late.

Plus there is no regex (that I could find) that would match doubled letters anywhere in the string.  The \1 hack only gets rid of double letters at the start of the string, I believe.

my $temp = uc shift;    # gets the name passed to the function

$temp =~ s/^[^A-Z]//;   # remove first character if not letter

# Eliminate doubled-letters

@x = split(//,$temp);
$temp = '';
$lastx = '';

foreach $x ( @x)
{
    if ( ( $lastx eq '' ) || ( $x ne $lastx ) )
    {
        $temp .= $x;
        $lastx = $x;
    }
}
print STDERR "$temp\n";

# Split into first letter and the rest

my ($f, $fc) = ($temp =~ /(\w)(\w+)/);
print "$f\n";

# soundex weighting

$fc =~ tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/;
$fc =~  s/0+//g; #gets rid of extra 0's
print "$fc\n";

# just the first four

if ($fc =~ /^(.{4}).*/g)
{
    print "$fc\n";
}
else 
{
    $fc = $fc."0"x(4 - length($fc));
    print "$fc\n";
}

Open in new window

0
 
LVL 16

Expert Comment

by:sjklein42
Comment Utility
I think this:

$fc = substr($fc.'000',0,4);    # first four digits, pad with zeroes on right if needed
print "$fc\n";

Open in new window


can replace this:

if ($fc =~ /^(.{4}).*/g)
{
    print "$fc\n";
}
else 
{
    $fc = $fc."0"x(4 - length($fc));
    print "$fc\n";
} 

Open in new window

0
 

Author Comment

by:fac66
Comment Utility
Thank you that worked
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
String Substitution 4 65
Reading fields from the text file. 4 85
delete query using perl dbi 3 92
remove duplicates from the csv file 13 95
I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now