• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 316
  • Last Modified:

Read in data from 100,000+ files via command line

I had originally asked the following question:

This at first seemed like it work exactly the way I needed, however I just discovered that lines with spaces were not correctly read in.

Below is the last iteration of the code:

echo `date`
find . -name "*.arf" | while read f; do
  newpath="$(basename $(dirname "$f"))"
#/$(basename $f)"
  cat "$f" | gawk -v p="$newpath" '{ 
    attname=substr($1,1,length($1)-1); nlist=nlist"`, `"attname;
    attvalue= substr($2,2,length($2)-2); vlist=vlist", '\''"attvalue"'\''";
  END { 
    printf "insert into `mydatabase`.`archives` (`NEWPATH%s`) values ('\''%s'\''%s);\n", nlist, p, vlist;
  }' >> myinsertfile.sql
#| tee -a myinsertfile.sql
  [ $(($cnt%100)) -eq 0 ] && echo "File #$cnt: $f"

echo "Total Files: $cnt"

echo `date`

Open in new window

For the following .arf file:
FILEID: "TIF490336"
PATH: "/optical/incoming/TIF490336"
SECLEV: "10"
USRID: "admin"
REQDATE: "08/02/2012"
REQTIME: "09:02:32"
GENDATE: "08/03/2012"
GENTIME: "09:02:32"
GROUPID: "Check Stubs"
DESC: "August"

Open in new window

It produced the following SQL statement:
insert into `mydatabase`.`archives` (NEWPATH,FILEID,PATH,TYPE,SECLEV,STATID,USRID,REQDATE,REQTIME,GENDATE,GENTIME,PROGID,GROUPID,`DESC`) values ('TIF18','TIF490336','/optical/incoming/TIF490336','TIF','10','','admin','08/02/2012','09:02:32','08/03/2012','09:02:32','','Chec','August');

Open in new window

Which resulted in the GROUPID column containing incorrect values.  How can the code above be adjusted to process spaces as well?
1 Solution
The problem is that gawk is splitting fields on spaces (its default).  If you change it to split on : or " characters, you can change 4 lines of the code to:
  cat "$f" | gawk -F'[:"]' -v p="$newpath" '{ 
      nlist=nlist "`, `" $1;
      vlist=vlist ", '\''" $3 "'\''";

Open in new window

The rest of the script stays the same.  This assumes that the format of the input file is consistent (in particular, that there is one colon immediately after the first name, and the value field is always enclosed by double quotes).
bdhtechnologyAuthor Commented:
Perfect, that's exactly what I needed!
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now