Link to home
Start Free TrialLog in
Avatar of justmorri
justmorriFlag for United States of America

asked on

Help with Perl While Loop

Is there any way to make this script run faster? Right now it is taking almost an hour to
process a 91,000 flat file. Here is the while loop code which runs through each line in the
file and insert it into an oracle database table.  

my $line = 0;
my $c = 0;

while (<IN>) {
  chomp($_);
  $line += 1;

  my ($field1,
    $field2,
    $field3,
    $field4,
    $field5,
    $field6,
    $field7,
    $field8,
    $field9,
    $field10,
    $field11,
    $field12,
    $field13,
    $field14,
    $field15,
    $field16,
    $field17,
    $field18,
    $field19,
    $field20,
    $field21,
    $field22,
    $field23,
    $field24) = split(/ /,$_);

  &fix($field1);
  &fix($field2);
  &fix($field3);
  &fix($field4);
  &fix($field5);
  &fix($field6);
  &fix($field7);
  &fix($field8);
  &fix($field9);
  &fix($field10);
  &fix($field11);
  &fix($field12);
  &fix($field13);
  &fix($field14);
  &fix($field15);
  &fix($field16);
  &fix($field17);
  &fix($field18);
  &fix($field19);
  &fix($field20);
  &fix($field21);
  &fix($field22);                
  &fix($field23);
  &fix($field24);
     
  if ($field1 ne ' ') {
    $field1 =~ s/\///g;
    my ($month, $day, $year) = unpack "A2A2A4", $field1;  
    $field1 = $year.''.$month.''.$day;                    
  }
 
  $field23 = lc($field23);
  $field24 = lc($field24);

  if (($field15 ne "A") && ($field16 ne "B")) {
    eval {
      $sql = "insert into myTable values (\'$field1\',\'$field2\',\'$field3\',\'$field4\',\'$field5\',\'$field6\',\'$field7\',\
'$field8\',\'$field9\',\'$field10\',\'$field11\',\'$field12\',\'$field13\',\'$field14\',\'$field15\',\'$field16\',\
'$field17\',\'$field18\',\'$field19\',\'$field20\',\'$field21\',\'$field22\',\'$field23\',\'$field24\') ";
      $sth = $dbh->prepare($sql);
      $sth->execute();
    };
    if ($@) { print "$@\n"; next; }
    $dbh->commit();
    $c += 1;
  }
}


Subroutine that's called -

sub fix {
  $_[0] =~ s/^ *//;
  $_[0] =~ s/ *$//;              
  $_[0] =~ s/'/''/g;            
  if (! $_[0]) {
    $_[0] = ' ';
    }
  }
Avatar of ozo
ozo
Flag of United States of America image

The perl code can be streamlined, but I would expect that most of the time would be taken by the  $sth->execute();
You might try moving the prepare out of the loop and doing a bind in the loop, or only doing a commit after several inserts.
There are a number of inefficiencies and questionable coding practices in that code, but without knowing where the script is spending most of its time, we can't say for certain which of those inefficiencies you should focus on fixing.

You can and should use the Devel::NYTProf module to learn where the script is spending most of its time.  We can also make some educated guesses, like ozo did, and correct some of those inefficiencies.

If you use ' ' for the pattern in the split statement instead of / /, it will strip the leading and trailing spaces from the fields, so you won't need to do that in the fix() sub.  And, instead of creating 24 sequentially numbered scalars, it would be better to use an array.

Instead of calling the fix() sub 24 times for each line in the file, it would be better to alter the sub to have it accept an array reference.

my @fields = split(' ', $_);
fix(\@fields);

Open in new window

An even better option/approach would be to use the Text::CSV module to parse the lines of the file.

ozo has already suggested moving the prepare statement, which should defiantly be done.  When doing that, you need to use placeholders in the prepare statement instead of passing the $fieldX vars.  Those vars (or the array) would be passed in the execute statement.
Placeholders and Bind Values

There are additional improvements that I can suggest, but start with those and we'll cover the others as needed.
Avatar of justmorri

ASKER

Thanks, ozo and FishMonger! I'll try those suggestions and report back.
my $c = 0;
my @field;
$sth = $dbh->prepare("insert into myTable values (?,?,?,?,?,?,?,?,?,?,?,?.?,?,?,?,?,?,?,?,?,?,?,?)");
while (<IN>) {
  @field[1..24] = split;
  next unless  $field[15] ne "A" && $field[16] ne "B";
  tr/'/"/, $_||=' ' for @field;
  $_=join'',(split m{/*})[4..7,0..3] for $field[1];
  $_ = lc for @field[23,24];
   eval {
      $sth->execute(@field[1..24]);
   };
   if ($@) { print "$@\n"; next; }
   $dbh->commit();
   $c += 1;
}
my $line = $.;
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Do you need Perl at all?

When loading data from a text file into Oracle SQL Loader or External Tables is typically the best/fastest.
I think so. I need to process the file on a daily basis out of cron.

I'm having problems with the split splitting properly as some fields have spaces between their values.

Example of a line in the file:
01/01/2013      123456789       username01        P               abc@email.com    Alfa    B       Canine                   Alfa    B       Canine                   123 South 1st Street    New York        NY      10000   Some Park    Main            +1 212 555 1234 +1 212 555 4321 75 Some Blvd    New York   NY      10000   Some Really Long Org Name    Some Really Long Org Name   ABC+D Water   Gone      Status_Range                        

I'm trying -

my @field;
$sth = $dbh->prepare("insert into myTable values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)");

while (<IN>) {
  @field[1..37] = split(/ /, $_);
  next unless $field[29] ne "Some Really Long Org Name" && $field[30] ne "Some Really Long Org Name";
  tr/'/"/, $_||=' ' for @field;
  $_=join'',(split m{/*})[4..7,0..3] for $field[1];
  $_ = lc for @field[32,33];
   eval {
      $sth->execute(@field[1..37]);
   };
   if ($@) { print "$@\n"; next; }
   $dbh->commit();
}


There are actually 37 fields in this file not 24. :)  Also, for the second field (123456789, in this example) I'd like to substitute "None" (without the quotes) if the field is blank.
>>I'd like to substitute "None" (without the quotes) if the field is blank.

Databases allow null values for a reason.  Use them.  Why store data to say no data exists?
True, true. I'm in favor of that! Unfortunately, this is legacy table (which does not allow nulls -ech), and legacy code that I am trying to improve in small steps.
I hate bad designs that you are forced to live with.

Anyway, did you look into SQL Loader to replace the Perl Script?  SQL Loader was written to load data into Oracle from Text files and does this very efficiently.
> some fields have spaces between their values.
Your use of / / instead of ' ' suggested that you wanted multiple spaces to indicate empty fields.
If fields can be separated by multiple spaces, you probably want to use " ", which is the default:
   @field[1..37] = split;
Why wouldn't you be able to use SQL Loader with cron?
I'm not very familiar with using SQL Loader from the unix cmd line...
> if the field is blank.
If there can be multiple spaces between fields, how do you represent a blank field?
If the number of spaces between fields is always between 4 and 8,
so less than 4 spaces in a row are part of a field, and 19 spaces in a row, as between Canine and 123 indicates an empty field, then you might use
@field[1..37] = split/ {4,8}/;
sqlldr is only command line.

Setting up the control file is the hard part.

Once you have it set up the actual command is simple:
sqlldr username/password control=mycontrolfile.ctl


You've provided an example line from the file.  If you can provide the table description and the expected results I'm sure we can come up with a sample control file.
also, as ozo has suggested, the file layout.  Delimited or fixed width work best for sql loader.
With split/ / or split' ', the s/^ *// and s/ *$// would have been unnecessary, since the spaces would separate the fields instead of being part of the field, but with something like split/ {4,8}/, the s/^ *// may become necessary
  s/^ *//, tr/'/"/, $_||=' ' for @field;
And it may have to be done before the "Some Really Long Org Name" check,
or, depending on how often that case happens, it may be faster to make the check allow for leading space:
next if grep/ *Some Really Long Org Name$/,@field[29,30];
Is there any way to not have to pass the user/password from the command line?
>Is there any way to not have to pass the user/password from the command line?

Looks like you can store it in a PARFILE as well.  Never used PARFILEs with sql loader but the docs say you can.
Hmm.. I think I'll work on the sql loader way after I get this ancient way with a Perl script working. :)
Personally I would stop the ancient way and focus efforts on SQL Loader.  Trust me, it is pretty quick at what it does.

If you can:  Create a very simple test case.  Single table, single column.  Should be a simple tweak to your Perl script to handle it and a very simple sql loader script.

Set up some representative sample of rows that you expect in production and test them.
Ok, will tackle the SQL Loader way! :)
Here is the table description:

Name            Null     Type        
--------------- -------- ------------
FIELD1          NOT NULL VARCHAR2(8)    # date of birth in file is mm/dd/yyyy but need to reformat to yyyymmdd
FIELD2          NOT NULL VARCHAR2(9)    # replace blanks with None
FIELD3          NOT NULL VARCHAR2(16)
FIELD4          NOT NULL VARCHAR2(2)  
FIELD5          NOT NULL VARCHAR2(64)
FIELD6          NOT NULL VARCHAR2(31)
FIELD7          NOT NULL VARCHAR2(20)
FIELD8          NOT NULL VARCHAR2(32)
FIELD9          NOT NULL VARCHAR2(8)  
FIELD10         NOT NULL VARCHAR2(16)
FIELD11         NOT NULL VARCHAR2(31)
FIELD12         NOT NULL VARCHAR2(20)
FIELD13         NOT NULL VARCHAR2(32)
FIELD14         NOT NULL VARCHAR2(16)
FIELD15         NOT NULL VARCHAR2(16)
FIELD16         NOT NULL VARCHAR2(96)
FIELD17         NOT NULL VARCHAR2(32)
FIELD18         NOT NULL VARCHAR2(32)
FIELD19         NOT NULL VARCHAR2(16)
FIELD20         NOT NULL VARCHAR2(64)
FIELD21         NOT NULL VARCHAR2(24)
FIELD22         NOT NULL VARCHAR2(24)
FIELD23         NOT NULL VARCHAR2(32)
FIELD24         NOT NULL VARCHAR2(32)
FIELD25         NOT NULL VARCHAR2(96)
FIELD26         NOT NULL VARCHAR2(64)
FIELD27         NOT NULL VARCHAR2(32)
FIELD28         NOT NULL VARCHAR2(16)
FIELD29         NOT NULL VARCHAR2(64)   # Some Really Long Org Name
FIELD30         NOT NULL VARCHAR2(64)   # Some Really Long Org Name
FIELD31         NOT NULL VARCHAR2(64)
FIELD32         NOT NULL VARCHAR2(16)
FIELD33         NOT NULL VARCHAR2(24)
FIELD34         NOT NULL VARCHAR2(16)
FIELD35         NOT NULL VARCHAR2(1)  
FIELD36         NOT NULL VARCHAR2(1)  
FIELD37         NOT NULL VARCHAR2(1)  


Fields 6-8 are names and sometimes have apostrophes.
I looked at the file in vi with :set list and see that the fields are separated with ^I
and lines are terminated with ^M$  

I was removing the ctrl-Ms with: perl -pi -e 's/\cM//g' $1
>>I looked at the file in vi with :set list and see that the fields are separated with ^I

Tab delimited makes sql loader even easier.

I'm about ready to leave work.  I'll work on this later at home.  Shouldn't be difficult.  Give me a little commute time and I should have something for you in an hour or two.

Can you post some actual tab delimited data (in a code block) for me?

This is a code block

Open in new window


It will cut down on errors when I add them to what you already posted.
Wooo, thank you!! :)

Here's a line!

01/01/2013      123456789       username01      P       abc@email.com   Alfa    B       Canine                  Alfa    B       Canine                  123 South 1st Street    New York        NY
      10000   Some Park       Main            +1 212 555 1234 +1 212 555 4321 75 Some Blvd    New York        NY      10000   Some Really Long Org Name       Some Really Long Org Name       ABC+D Water     Gone    Status_Range                            Y               ^M

Open in new window

>>Here's a line!

This still doesn't have the tab separation.  I'll make some guesses.  It should be enough to get you off and running.
ASKER CERTIFIED SOLUTION
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
This test box is driving me nuts. Now I get this error with the sqlldr -

Message 2100 not found; No message file for product=RDBMS, facility=ULMessage 2100 not found; No message file for product=RDBMS
Make sure your ORACLE_HOME is properly set.  Typically in the past when I get message not found messages it is from an environment setup issue.
Getting there... :) How do you handle possible single quotes in the data?
>>How do you handle possible single quotes in the data?

SQL Loader handles them for you.

Look at my example above.  The second record I added has a single quote but it's hard to see:  Al'fa.
Ah, yeah! So this works but unfortunately when I test it out using a real data file, it only loads 124 of the 91k rows before it errors out with "MAXIMUM ERROR COUNT EXCEEDED - Above statistics reflect partial run." It seems some of the records sent have two tabs where there should be one and data is shifted over for some records. The old Perl script seems to have handled that better.
>>It seems some of the records sent have two tabs where there should be one

Two tabs side by side typically means a null value for a field.  This by itself is normal.

>> and data is shifted over for some records

Explain?  What it looks like in an editor (like vi) and what it actually is, is two different things.

There should be a 'log' file generated that will tell you why the records failed and a 'bad' file that has the ones that failed.
Actually, my mistake. A column was out of place! I fixed that and the data loads in 3 mins 1 second!  THANKS!!!
No problem.   Glad to help.  Told you it was very good at what it does!

You might be able to make it even faster but you need to understand what it actually doing.  It isn't for all situations.

Read up on Direct Path loading:
http://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_concepts.htm#SUTIL997
Ok, thank you! :)

I encountered a problem... how do I compare this new table that contains NULLS with another table that has ' ' instead of NULL? When I compare them using MERGE, all 91+k records having their fields with a ' ' value updated with a NULL value.  I don't want the field to be updated if it has ' ' and this new field has a NULL.
You have lost me.

In Oracle an empty string, '', is technically considered a null.

For example:
select 'hello' from dual where '' is null;

Going from your previous requirements, you wanted a 'None' string inserted when the field was null.  My posted test case does this.  Look at the NVL function on FIELD2.

You now mention a MERGE statement.  Where did this come into the picture?

Can you update your requirements or explain a lot more about what you are wanting to do?


If the loaded data is being loaded into a staging table, you might be better off using External Tables (mentioned above).  These use SQL Loader style syntax but allow you to use them like a 'normal' table without having to actually load the data into a 'normal' table.
Maybe this is something new? Should I start a new question with new points?

tab1 - the table populated by the data file with sqlLoader; contains current records
tab2 - existing table of all records - current and historical

I have a script that compares tab1's fields with tab2's field and makes updates to records
in tab2 that have changed.

My tables -

SQL> desc tab1
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 FIELD1                                             VARCHAR2(9)
 FIELD2                                             VARCHAR2(16)
 FIELD3                                             VARCHAR2(2)
 FIELD4                                             VARCHAR2(64)
 FIELD5                                             VARCHAR2(8)
 FIELD6                                             VARCHAR2(31)


SQL> desc tab2;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 FIELD1                                    NOT NULL VARCHAR2(9)
 FIELD2                                    NOT NULL VARCHAR2(16)
 FIELD3                                    NOT NULL VARCHAR2(2)
 FIELD4                                    NOT NULL VARCHAR2(64)
 FIELD5                                    NOT NULL VARCHAR2(8)
 FIELD6                                    NOT NULL VARCHAR2(31)
 
 
 sql part of my script -
 
 MERGE INTO tab2 a
  USING
    (select field1,field2,field3,field4,field5,field6 from tab1
     minus
     select field1,field2,field3,field4,field5,field6 from tab2) b
  ON (a.field2 = b.field2)
  WHEN MATCHED THEN
  UPDATE SET
    a.field1 = b.field1,
    a.field3 = b.field3,
    a.field4 = b.field4,
    a.field5 = b.field5,
    a.field6 = b.field6
   
The error message:
DBD::Oracle::db do failed: ORA-01407: cannot update ("TAB2"."FIELD1") to NULL
>> Should I start a new question with new points?

Yes.  You can post the link to this one in the new one as a reference.  The reason is that you asked how to make the load of records faster.  That was answered.  The merge piece is different.

When you ask the new question please provide sample data (for both tables) and expected results.  With what you posted, I believe External Tables is better than SQL Loader into a temp table.

As long as the new question is in the Oracle Zone, I'll see it (as will all the Oracle Experts that monitor that zone).