Solved

Read in data from 100,000+ files via command line

Posted on 2014-01-02
11
332 Views
Last Modified: 2014-01-03
I have about 100,000 files scattered among about 20 directories.  For each file there is an associated .arf file, which references the actual file and has details about it.  What I would like to do is have a script be able to read these .arf files and create a MySQL insert statement from the fields in the .arf file to add them all into a single table.  I would like the MySQL code to be contained in a single .sql file once all of the .arf files have been read in.

Here is an example of an .arf file:
FILEID: "TIF490336"
PATH: "/optical/incoming/TIF490336"
TYPE: "TIF"
SECLEV: "10"
STATID: ""
USRID: "admin"
REQDATE: "08/02/2012"
REQTIME: "09:02:32"
GENDATE: "08/03/2012"
GENTIME: "09:02:32"
PROGID: ""
GROUPID: "Check Stubs"
DESC: "August"

Open in new window


The fields in each .arf file are all identical, so that should make things a little easier.  The path here is not important and will probably be ignored but it should be imported as well.

Any thoughts on how to tackle this problem?
0
Comment
Question by:bdhtechnology
11 Comments
 
LVL 35

Expert Comment

by:Kimputer
ID: 39752892
I would make a VBscript.

It needs to go through all the files, including subfolders (recursive sub or function)
Each art file needs to be read. Because each art file is the same you know what each lines mean.
After reading it 13 times. also write to a specified .sql file 13 times, with the correct mysql insert command.

After this runs, you will have one .sql file filled with mysql insert commands (13 per art file).

From the cmd prompt, log into mysql, and imprt the sql file back.

If you need help with the actual code, let me know.
0
 
LVL 34

Expert Comment

by:Duncan Roe
ID: 39752986
There's no VBscript in Linux that I know of, but find and xargs should get you started. Can you clarify - are the .arf files in a separate directory or are they, for instance, paired with the target files such that there exists both /optical/incoming/TIF490336 and /optical/incoming/TIF490336.arf?
Do you have more .arf files than files to be cataloged in mysql? (otherwise, you can simply process all your .arf files).
I'd suggest
1. create a file which lists all .arf files (full pathnames)
2. Make a script which takes the path of a .arf file and outputs the corresponding INSERT SQL statement
3. run this script on every file in the list, appending output to another file (use >> redirection)
4. Run mysql to take command input from that file
0
 
LVL 62

Expert Comment

by:gheist
ID: 39753194
find /optical/ -name \*/[AaWw][Rr][Ff] -print0 | xargs --0 P 2 -n 1000 sqldo some on 1000 files in 2 threads
0
Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 39753225
What do the actual files look like? Are the values separated in some way? Please post an example.
0
 
LVL 8

Accepted Solution

by:
Surrano earned 500 total points
ID: 39753476
find /optical -name "*.arf" | while read f; do
  cat "$f" | gawk '{ 
    attname=substr($1,1,length($1)-1); 
    attvalue= substr($2,2,length($2)-2); a[attname]=attvalue 
  }
  END { nlist=""; vlist=""; 
    for (n in a) { nlist=nlist","n; vlist=vlist",'\''"a[n]"'\''" }
    nlist=substr(nlist,2);vlist=substr(vlist,2);
    printf "insert into `mytable` (%s) values (%s);\n", nlist,vlist;
  }' >>myinsertfile.sql
done

Open in new window

0
 
LVL 1

Author Comment

by:bdhtechnology
ID: 39753560
@Kimputer, @duncan_roe is correct, I do not have access to VB inside the Linux command shell.

@duncan_roe
The .arf files are in the same directory as the associated file they reference.  There directories are: 2012, 2013.1, STMT, TIF1, TIF2, TIF3, ... on up to TIF18
0
 
LVL 1

Author Comment

by:bdhtechnology
ID: 39753570
@Surrano
That is almost exactly what I am looking for.  Actually it is what I had asked for but that is my fault for not clarifying.  The path is not /optical, that was an old path, but not a big deal I can modify to the new path accordingly.  I need to record the path the file was found in so I can process it later.  So if at the end of the SQL statement there was an additional column that contained the directory it was in, i.e. a column named NEWPATH, that had values in it like TIF1, TIF2 or TIF3, etc. depending on where the .arf file was located.

Also if there was a way to echo the SQL statement and print it out to the .sql file as well that would be helpful.  It takes quite a while to process the 100,000 files so that would give me an indication of the progress while it is running.
0
 
LVL 8

Expert Comment

by:Surrano
ID: 39753595
So one by one:
- modifying path is piece of cake indeed.
- you may pass the filename (or part of it if you wish) to gawk using -v flag
- to echo the sql statement (and not only the filename) you may use the "tee -a" command.
- assuming path is something like "/path/to/TIF1/somename.arf" use code below
- I also replaced the iteration with sequential append so that fields appear exactly at the same order as in the file. (it doesn't matter in sql though)

find $SOMEPATH -name "*.arf" | while read f; do
  newpath="$(basename $(dirname "$f"))"
  cat "$f" | gawk -v p="$newpath" '{ 
    attname=substr($1,1,length($1)-1); nlist=nlist","attname;
    attvalue= substr($2,2,length($2)-2); vlist=vlist",'\''"attvalue"'\''";
  }
  END { 
    printf "insert into `mytable` (newpath%s) values ('\''%s'\''%s);\n", nlist, p, vlist;
  }' | tee -a myinsertfile.sql
done

Open in new window

0
 
LVL 8

Expert Comment

by:Surrano
ID: 39753605
To write progress every 1000 lines only, use the append (>>) instead of tee -a and add the following snippet before "done" (end of cycle):

  cnt=$((cnt+1))
  [ $(($cnt%1000)) -eq 0 ] && echo "File #$cnt: $f"

Open in new window

0
 
LVL 1

Author Comment

by:bdhtechnology
ID: 39753628
Awesome that is precisely what I am looking for.  One more quick question, how would I adjust it if I want 'newpath' to contain the relative path to the actual file (Not the .arf file).  I have the following but that adds the .arf extension obviously:
newpath="$(basename $(dirname "$f"))"/$(basename $f)"
0
 
LVL 8

Expert Comment

by:Surrano
ID: 39753686
For one, it sounds as if you'd like to use the FILEID field not the file name. that needs some more tweaking of the awk code.

For another, using basename:
newpath="$(basename $(dirname "$f"))/$(basename $f .arf)" 

Open in new window

0

Featured Post

The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
SQL Server group with two values 4 34
list of sudo access for date range 5 29
IP 10.0.1.2 / 255.0.0.0 61 56
Select question from MySQL 1 12
If you have heard of RFC822 date formats, they can be quite a challenge in SQL Server. RFC822 is an Internet standard format for email message headers, including all dates within those headers. The RFC822 protocols are available in detail at:   ht…
I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question