Solved

Read in data from 100,000+ files via command line

Posted on 2014-01-02
11
325 Views
Last Modified: 2014-01-03
I have about 100,000 files scattered among about 20 directories.  For each file there is an associated .arf file, which references the actual file and has details about it.  What I would like to do is have a script be able to read these .arf files and create a MySQL insert statement from the fields in the .arf file to add them all into a single table.  I would like the MySQL code to be contained in a single .sql file once all of the .arf files have been read in.

Here is an example of an .arf file:
FILEID: "TIF490336"
PATH: "/optical/incoming/TIF490336"
TYPE: "TIF"
SECLEV: "10"
STATID: ""
USRID: "admin"
REQDATE: "08/02/2012"
REQTIME: "09:02:32"
GENDATE: "08/03/2012"
GENTIME: "09:02:32"
PROGID: ""
GROUPID: "Check Stubs"
DESC: "August"

Open in new window


The fields in each .arf file are all identical, so that should make things a little easier.  The path here is not important and will probably be ignored but it should be imported as well.

Any thoughts on how to tackle this problem?
0
Comment
Question by:bdhtechnology
11 Comments
 
LVL 35

Expert Comment

by:Kimputer
Comment Utility
I would make a VBscript.

It needs to go through all the files, including subfolders (recursive sub or function)
Each art file needs to be read. Because each art file is the same you know what each lines mean.
After reading it 13 times. also write to a specified .sql file 13 times, with the correct mysql insert command.

After this runs, you will have one .sql file filled with mysql insert commands (13 per art file).

From the cmd prompt, log into mysql, and imprt the sql file back.

If you need help with the actual code, let me know.
0
 
LVL 34

Expert Comment

by:Duncan Roe
Comment Utility
There's no VBscript in Linux that I know of, but find and xargs should get you started. Can you clarify - are the .arf files in a separate directory or are they, for instance, paired with the target files such that there exists both /optical/incoming/TIF490336 and /optical/incoming/TIF490336.arf?
Do you have more .arf files than files to be cataloged in mysql? (otherwise, you can simply process all your .arf files).
I'd suggest
1. create a file which lists all .arf files (full pathnames)
2. Make a script which takes the path of a .arf file and outputs the corresponding INSERT SQL statement
3. run this script on every file in the list, appending output to another file (use >> redirection)
4. Run mysql to take command input from that file
0
 
LVL 61

Expert Comment

by:gheist
Comment Utility
find /optical/ -name \*/[AaWw][Rr][Ff] -print0 | xargs --0 P 2 -n 1000 sqldo some on 1000 files in 2 threads
0
 
LVL 37

Expert Comment

by:Gerwin Jansen
Comment Utility
What do the actual files look like? Are the values separated in some way? Please post an example.
0
 
LVL 8

Accepted Solution

by:
Surrano earned 500 total points
Comment Utility
find /optical -name "*.arf" | while read f; do
  cat "$f" | gawk '{ 
    attname=substr($1,1,length($1)-1); 
    attvalue= substr($2,2,length($2)-2); a[attname]=attvalue 
  }
  END { nlist=""; vlist=""; 
    for (n in a) { nlist=nlist","n; vlist=vlist",'\''"a[n]"'\''" }
    nlist=substr(nlist,2);vlist=substr(vlist,2);
    printf "insert into `mytable` (%s) values (%s);\n", nlist,vlist;
  }' >>myinsertfile.sql
done

Open in new window

0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 1

Author Comment

by:bdhtechnology
Comment Utility
@Kimputer, @duncan_roe is correct, I do not have access to VB inside the Linux command shell.

@duncan_roe
The .arf files are in the same directory as the associated file they reference.  There directories are: 2012, 2013.1, STMT, TIF1, TIF2, TIF3, ... on up to TIF18
0
 
LVL 1

Author Comment

by:bdhtechnology
Comment Utility
@Surrano
That is almost exactly what I am looking for.  Actually it is what I had asked for but that is my fault for not clarifying.  The path is not /optical, that was an old path, but not a big deal I can modify to the new path accordingly.  I need to record the path the file was found in so I can process it later.  So if at the end of the SQL statement there was an additional column that contained the directory it was in, i.e. a column named NEWPATH, that had values in it like TIF1, TIF2 or TIF3, etc. depending on where the .arf file was located.

Also if there was a way to echo the SQL statement and print it out to the .sql file as well that would be helpful.  It takes quite a while to process the 100,000 files so that would give me an indication of the progress while it is running.
0
 
LVL 8

Expert Comment

by:Surrano
Comment Utility
So one by one:
- modifying path is piece of cake indeed.
- you may pass the filename (or part of it if you wish) to gawk using -v flag
- to echo the sql statement (and not only the filename) you may use the "tee -a" command.
- assuming path is something like "/path/to/TIF1/somename.arf" use code below
- I also replaced the iteration with sequential append so that fields appear exactly at the same order as in the file. (it doesn't matter in sql though)

find $SOMEPATH -name "*.arf" | while read f; do
  newpath="$(basename $(dirname "$f"))"
  cat "$f" | gawk -v p="$newpath" '{ 
    attname=substr($1,1,length($1)-1); nlist=nlist","attname;
    attvalue= substr($2,2,length($2)-2); vlist=vlist",'\''"attvalue"'\''";
  }
  END { 
    printf "insert into `mytable` (newpath%s) values ('\''%s'\''%s);\n", nlist, p, vlist;
  }' | tee -a myinsertfile.sql
done

Open in new window

0
 
LVL 8

Expert Comment

by:Surrano
Comment Utility
To write progress every 1000 lines only, use the append (>>) instead of tee -a and add the following snippet before "done" (end of cycle):

  cnt=$((cnt+1))
  [ $(($cnt%1000)) -eq 0 ] && echo "File #$cnt: $f"

Open in new window

0
 
LVL 1

Author Comment

by:bdhtechnology
Comment Utility
Awesome that is precisely what I am looking for.  One more quick question, how would I adjust it if I want 'newpath' to contain the relative path to the actual file (Not the .arf file).  I have the following but that adds the .arf extension obviously:
newpath="$(basename $(dirname "$f"))"/$(basename $f)"
0
 
LVL 8

Expert Comment

by:Surrano
Comment Utility
For one, it sounds as if you'd like to use the FILEID field not the file name. that needs some more tweaking of the awk code.

For another, using basename:
newpath="$(basename $(dirname "$f"))/$(basename $f .arf)" 

Open in new window

0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within…
I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now