Read in data from 100,000+ files via command line

I have about 100,000 files scattered among about 20 directories.  For each file there is an associated .arf file, which references the actual file and has details about it.  What I would like to do is have a script be able to read these .arf files and create a MySQL insert statement from the fields in the .arf file to add them all into a single table.  I would like the MySQL code to be contained in a single .sql file once all of the .arf files have been read in.

Here is an example of an .arf file:
FILEID: "TIF490336"
PATH: "/optical/incoming/TIF490336"
TYPE: "TIF"
SECLEV: "10"
STATID: ""
USRID: "admin"
REQDATE: "08/02/2012"
REQTIME: "09:02:32"
GENDATE: "08/03/2012"
GENTIME: "09:02:32"
PROGID: ""
GROUPID: "Check Stubs"
DESC: "August"

Open in new window


The fields in each .arf file are all identical, so that should make things a little easier.  The path here is not important and will probably be ignored but it should be imported as well.

Any thoughts on how to tackle this problem?
LVL 1
bdhtechnologyAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

KimputerIT ManagerCommented:
I would make a VBscript.

It needs to go through all the files, including subfolders (recursive sub or function)
Each art file needs to be read. Because each art file is the same you know what each lines mean.
After reading it 13 times. also write to a specified .sql file 13 times, with the correct mysql insert command.

After this runs, you will have one .sql file filled with mysql insert commands (13 per art file).

From the cmd prompt, log into mysql, and imprt the sql file back.

If you need help with the actual code, let me know.
Duncan RoeSoftware DeveloperCommented:
There's no VBscript in Linux that I know of, but find and xargs should get you started. Can you clarify - are the .arf files in a separate directory or are they, for instance, paired with the target files such that there exists both /optical/incoming/TIF490336 and /optical/incoming/TIF490336.arf?
Do you have more .arf files than files to be cataloged in mysql? (otherwise, you can simply process all your .arf files).
I'd suggest
1. create a file which lists all .arf files (full pathnames)
2. Make a script which takes the path of a .arf file and outputs the corresponding INSERT SQL statement
3. run this script on every file in the list, appending output to another file (use >> redirection)
4. Run mysql to take command input from that file
gheistCommented:
find /optical/ -name \*/[AaWw][Rr][Ff] -print0 | xargs --0 P 2 -n 1000 sqldo some on 1000 files in 2 threads
Your Guide to Achieving IT Business Success

The IT Service Excellence Tool Kit has best practices to keep your clients happy and business booming. Inside, you’ll find everything you need to increase client satisfaction and retention, become more competitive, and increase your overall success.

Gerwin Jansen, EE MVETopic Advisor Commented:
What do the actual files look like? Are the values separated in some way? Please post an example.
Surranoapplication managerCommented:
find /optical -name "*.arf" | while read f; do
  cat "$f" | gawk '{ 
    attname=substr($1,1,length($1)-1); 
    attvalue= substr($2,2,length($2)-2); a[attname]=attvalue 
  }
  END { nlist=""; vlist=""; 
    for (n in a) { nlist=nlist","n; vlist=vlist",'\''"a[n]"'\''" }
    nlist=substr(nlist,2);vlist=substr(vlist,2);
    printf "insert into `mytable` (%s) values (%s);\n", nlist,vlist;
  }' >>myinsertfile.sql
done

Open in new window

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
bdhtechnologyAuthor Commented:
@Kimputer, @duncan_roe is correct, I do not have access to VB inside the Linux command shell.

@duncan_roe
The .arf files are in the same directory as the associated file they reference.  There directories are: 2012, 2013.1, STMT, TIF1, TIF2, TIF3, ... on up to TIF18
bdhtechnologyAuthor Commented:
@Surrano
That is almost exactly what I am looking for.  Actually it is what I had asked for but that is my fault for not clarifying.  The path is not /optical, that was an old path, but not a big deal I can modify to the new path accordingly.  I need to record the path the file was found in so I can process it later.  So if at the end of the SQL statement there was an additional column that contained the directory it was in, i.e. a column named NEWPATH, that had values in it like TIF1, TIF2 or TIF3, etc. depending on where the .arf file was located.

Also if there was a way to echo the SQL statement and print it out to the .sql file as well that would be helpful.  It takes quite a while to process the 100,000 files so that would give me an indication of the progress while it is running.
Surranoapplication managerCommented:
So one by one:
- modifying path is piece of cake indeed.
- you may pass the filename (or part of it if you wish) to gawk using -v flag
- to echo the sql statement (and not only the filename) you may use the "tee -a" command.
- assuming path is something like "/path/to/TIF1/somename.arf" use code below
- I also replaced the iteration with sequential append so that fields appear exactly at the same order as in the file. (it doesn't matter in sql though)

find $SOMEPATH -name "*.arf" | while read f; do
  newpath="$(basename $(dirname "$f"))"
  cat "$f" | gawk -v p="$newpath" '{ 
    attname=substr($1,1,length($1)-1); nlist=nlist","attname;
    attvalue= substr($2,2,length($2)-2); vlist=vlist",'\''"attvalue"'\''";
  }
  END { 
    printf "insert into `mytable` (newpath%s) values ('\''%s'\''%s);\n", nlist, p, vlist;
  }' | tee -a myinsertfile.sql
done

Open in new window

Surranoapplication managerCommented:
To write progress every 1000 lines only, use the append (>>) instead of tee -a and add the following snippet before "done" (end of cycle):

  cnt=$((cnt+1))
  [ $(($cnt%1000)) -eq 0 ] && echo "File #$cnt: $f"

Open in new window

bdhtechnologyAuthor Commented:
Awesome that is precisely what I am looking for.  One more quick question, how would I adjust it if I want 'newpath' to contain the relative path to the actual file (Not the .arf file).  I have the following but that adds the .arf extension obviously:
newpath="$(basename $(dirname "$f"))"/$(basename $f)"
Surranoapplication managerCommented:
For one, it sounds as if you'd like to use the FILEID field not the file name. that needs some more tweaking of the awk code.

For another, using basename:
newpath="$(basename $(dirname "$f"))/$(basename $f .arf)" 

Open in new window

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Shell Scripting

From novice to tech pro — start learning today.