AWK: Pytthagoras bp script Part deux

I have amazing results from running the unconstrained code.

Can I put an IF statement in this?

AND $WHEN > $BEFORE , $WHEN < AFTER type of thing.

Not got anywhere trying with SQL and databases, although Filemaker is helping, but it's .. Filemaker which is peculiar to itself.

I need to enter a constraint such that $WHEN in ACC.CSV falls between $BEFORE and $AFTER in SITES.CSV, if not, skip to next.

I was going to attempt to put the code in myself, but it has struck me it may involve setting variables so I haven't.

After running this I found the second file contents appended to the end of the output file, still not sure I believed it but any thoughts about that welcome.

# ACC.csv
#CODE,ACC_EC,ACC_NC,Long,Lat,POL,Severity,No_Vehicles,No_Casualties,Date,DayofWeek,Time,LA_District,LA_Highway,1st_Road_Class,1st_Road_Number,Road_Type,LIMIT,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Zebra-Human,Zebra_Machine,Visibility,Weather_Conditions,Surface,Special_Conditions,Road_Hazards,Urban_Rural,Police_Attended,LSOAofLocation,ACCREF,Month,Minute,ACCDAY,Hour,YEAR,WHEN,ACC_EC5,ACC_NC5
#   a = ACC_EC ($2)
#   b = ACC_NC ($3)
#
# SITES.csv
# SORT,POL,REF,CAM ETG,CAM NTG,INSTALL,MONTH,WHEN,TYPE,CAM_ETGkm,CAM_NTGkm OLD HEADS
# SORT,POL,REF,SITE_ETG,SITE_NTG,INSTALL,MONTH,WHEN,TYPE,BEFORE,AFTER  - NEW HEADS
#   r = REF ($3)
#   c = SITE_ETG ($4)
#   d = SITE_NTG ($5)

# One time logic before records are read from input file
BEGIN {
   # Change field separator from space to comma for our file
   FS = ","
   file2 ="SITES.CSV"
}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,SITE_ETG,SITE_NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found DOUBLE QUOTES PREVIOUSLY STRIPPED FROM FILE
         a = $2         # ACC_EC 
         b = $3         # ACC_NC
         gsub("\"", "", a)
         gsub("\"", "", b)

         Bestx = 999999999
         # Add new columns at the end
         while ((getline < file2 ) > 0 ) {
            if ($1 != "SORT") {
               r = $3
               c = $4      # SITE_ETG
               d = $5      # SITE_NTG
               x = sqrt(((a-c)^2)+((b-d)^2))
               if (x < Bestx) {
                  Bestx = x
                  Bestr = r
                  Bestc = c
                  Bestd = d
               }
            }
         }
         printf("%s,%s,%s,%s,%f\n", source, Bestr, Bestc, Bestd, Bestx)
         close(file2)
      }
   }
}

Open in new window

LVL 9
Anthony MellorChartered AccountantAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Bill PrewCommented:
I don't understand this:

After running this I found the second file contents appended to the end of the output file, still not sure I believed it but any thoughts about that welcome.

The script finds the closest match and adds the key data elements from that to the end of the main file record.  Isn't that what you expected?

~bp
0
Bill PrewCommented:
Can you post up test files for this please.

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
Ignore my comment about what was appended, I don't want to interfere with what you have written as I am not sure about what I was doing at the time and I now have working data which is great.

 test data attached
I have added a text file with the column numbers, just for reference.


I'm exploring all sorts of other solutions, but meanwhile AWK is simply doing it. AND I can follow what it does which is great, as well as now writing my own simpler one liners which I store in my process tracking doc.
Please use the file names apart.csv (prevously acc.csv and here uploaded as AccFMExporteExtract.csv) and sites.csv
or just leave them as they were and I'll edit to taste.

Apparently "we" think this would take Filemaker a week to do, so AWK in 21 hours is pretty good, be interested to see what the constraint does to that.
Sites.csv
ACC_Col_Nos.txt
AccFMExportExtract.csv
0
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

Bill PrewCommented:
Okay, a few things:

  1. I added the logic related to WHEN, BEFORE and AFTER you mentioned.

  2. I noticed that the apart.csv data file you attached already had the results of the prior AWK execution, with the new columns at the end.  Wouldn't we be starting with a file without those there and adding them again?  If not then we are going to double up on them when this execution of the updated AWK program adds those columns back.

  3. Remind me again how large the sites.csv file will be in reality, I'm trying to decide if there is any way we might load that once into memory and then loop over that array rather than rereading the whole file millions of times (for each apart.csv record).  But it might be too big...

~bp
0
Bill PrewCommented:
Okay, I have a working version that uses an array for the sites.csv file, but with that smaller dataset hard to judge performance.  Once I hear back on the prior 3 questions I can do further testing here or send on to you for testing there.

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
1. I am curious what the constraint will do, slow it down or speed it up?
2. Yes I want to add the new results doubled up at the end of the first results please.
3. The sites.csv file you have is the complete file, 3336 records only. It's the other one that's 6m.

I think loading sites into ram is a very good idea (I think that's what you mean) - I'm running all this now from a RAM disk which helps significantly, not yet run AWK on it hadn't thought about that, I guess it may be much faster.. maybe 21 hours will be history.. it only struck me yesterday, not used one for ..decades, makes SSD look slow.

Yes sure, ready to test here .

Come to think of it, can you load the big file as well as an array? It's big as a records count, but only a 700mb (roughly) file.
I just checked and it's showing as 1.47 gb. That's because everything is spaced with double quotes, I'll be stripping those out, in fact your script does anyway. They add a lot of fat to the file size. I guess your extract has double quotes too. there is only one column that actually needs them and that's only if it is being loaded in to Excel (ignore this it's just chat, no action requested), which with 6m records it won't be, though extracts may be; the "Code/Accident Index" columns. There are some columns that appear (do) duplicate data. This is because of the historic files layouts and the modern files layouts; I have used AWK to make both old and new compatible with each other, so there is duplication I need to keep for backwards compatibility and forwards new data (years)  additions.

Anthony
0
Bill PrewCommented:
Okay, here's the next version, give this a test please.  This loads the sites.csv array in memory and iterates over it there.  Also, I removed the logic to remove the double quotes for the apart.csv file since they aren't there anymore, figured it might save a tiny amount.

I don't think there would be any value to loading the apart.csv file into memory, since we only read it once now.

# One time logic before records are read from input file
BEGIN {
   # Change field separator from space to comma for our file
   FS = ","

   # Names for index numbers to store fields from sites file in arrray
   sitesRef = 1
   sitesEtg = 2
   sitesNtg = 3
   sitesBefore = 4
   sitesAfter = 5
   sitesMax = 0

   # Load entire sites file into memoryu array
   sitesFile = "sites.csv"
   while ((getline < sitesFile ) > 0 ) {
      # Skip header line
      if ($1 != "SORT") {
         # Save to next arrary elelment
         sitesMax += 1
         sites[sitesMax, sitesRef] = $3
         sites[sitesMax, sitesEtg] = $4+0
         sites[sitesMax, sitesNtg] = $5+0
         sites[sitesMax, sitesBefore] = $10+0
         sites[sitesMax, sitesAfter] = $11+0
      }
   }
   close(sitesFile)

}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,SITE_ETG,SITE_NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found DOUBLE QUOTES PREVIOUSLY STRIPPED FROM FILE
         ACC_EC = $2+0
         ACC_NC = $3+0
         WHEN = $39+0

         # Remove double quotes (may not be needed anyu more?)
         # gsub("\"", "", ACC_EC)
         # gsub("\"", "", ACC_NC)
         # gsub("\"", "", WHEN)

         BestDistance = 999999999.0
         # Find closest point in sites data (array)
         for (i = 1; i <= sitesMax; i++) {
            # Get times and check if we should check distance
            BEFORE = sites[i, sitesBefore]
            AFTER = sites[i, sitesAfter]
            if ((WHEN > BEFORE) && (WHEN < AFTER)) {
               # Times okay, calculate distance, find smallest
               REF = sites[i, sitesRef]
               SITE_ETG = sites[i, sitesEtg]
               SITE_NTG = sites[i, sitesNtg]
               Distance = sqrt(((ACC_EC-SITE_ETG)^2)+((ACC_NC-SITE_NTG)^2))

               # If this is the best distance so far save it's info
               if (Distance < BestDistance) {
                  BestDistance = Distance
                  BestRef = REF
                  BestEtg = SITE_ETG
                  BestNtg = SITE_NTG
               }
            }
         }

         # Write output line, add new columns at the end
         printf("%s,%s,%s,%s,%f\n", source, BestRef, BestEtg, BestNtg, BestDistance)
      }
   }
}

# ACC.csv
# $1 = CODE
# $2 = ACC_EC
# $3 = ACC_NC
# $4 = Long
# $5 = Lat
# $6 = POL
# $7 = Severity
# $8 = No_Vehicles
# $9 = No_Casualties
# $10 = Date
# $11 = DayofWeek
# $12 = Time
# $13 = LA_District
# $14 = LA_Highway
# $15 = 1st_Road_Class
# $16 = 1st_Road_Number
# $17 = Road_Type
# $18 = LIMIT
# $19 = Junction_Detail
# $20 = Junction_Control
# $21 = 2nd_Road_Class
# $22 = 2nd_Road_Number
# $23 = Zebra-Human
# $24 = Zebra_Machine
# $25 = Visibility
# $26 = Weather_Conditions
# $27 = Surface
# $28 = Special_Conditions
# $29 = Road_Hazards
# $30 = Urban_Rural
# $31 = Police_Attended
# $32 = LSOAofLocation
# $33 = ACCREF
# $34 = Month
# $35 = Minute
# $36 = ACCDAY
# $37 = Hour
# $38 = YEAR
# $39 = WHEN 
# $40 = ACC_EC5
# $41 = ACC_NC5
## $42 = REF                                   
## $43 = SITE_ETG                              
## $44 = SITE_NTG                              
## $45 = Distance                              
## $46 = Radius                                
#
# SITES.csv
# $1 = SORT
# $2 = POL
# $3 = REF
# $4 = SITE_ETG
# $5 = SITE_NTG
# $6 = INSTALL
# $7 = MONTH
# $8 = WHEN
# $9 = TYPE
# $10 = BEFORE
# $11 = AFTER

Open in new window

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
yay! I'll run it now...
0
Bill PrewCommented:
After that if you are feeling adventurous, you could try MAWK.  Make sure it's not installed on your system already, just type it in on the command line and see if it errors.  I don't think MacOS ships with it native, but I could be wrong.

If it's not there then you could try installing a port of MAWK to MacOS, here are a couple of links to one version I found:


MAWK is a version of AWK tweaked for faster performance, not sure if it will work with this script, or perform any better but you could try it.

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
what's the command line for running this script ?
I can't figure out the input files names .. odd as that may sound

awk -f constraint.awk file1 sites.csv > apartdat.txt

Open in new window


started 21.53 process awk 1741 input file 985.4mb so I guess we expect say 1.1gb output...
0
Bill PrewCommented:
I used the following.  The first CSV is the larger of the data files, you said rename it to this.  The second file name is the new output CSV produced.  sites.csv is assumed in the code and read in during the BEGIN{ } logic.

awk -f EE29008005.awk apart.csv > out3.csv

Open in new window

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
ok that's running as

awk -f constraint.awk apart.csv > out3.csv

Open in new window


start 21:58 process 1927 input file 985.4mb so I guess we expect say 1.1gb output...

and stopped, file is too big I forgot to strip the quotes.. of course it's after 9pm, my iq drops to even lower than usual...

start 22:06 process 2087  input file 985.4mb so I guess we expect say 1.1gb output...

finished.. before 22:27.. seems too quick out3,csv is the same size as the input file, exactly.


Extract of the top of out3.csv

,REF,SITE_ETG,SITE_NTG,Distance
2000010SU0982,522270,200330,,,1,3,4,3,21/3/2000,3,8:20,33,,1,25,4,70,0, , ,0,0,1,1,1,0,0,,1,,010SU0982,3,20,21,8,2000,24003,52227,20033,135,523655,201833,2043.828271,9999
2000010SU0983,536010,204970,,,1,3,2,1,9/3/2000,5,10:25,430,,3,10,2,70,1,4,3,0,0,1,1,1,0,0,,1,,010SU0983,3,25,9,10,2000,24003,53601,20497,177,536450,205168,482.497668,500
2000010SU0984,519480,204200,,,1,3,4,1,23/3/2000,5,13:06,33,,5,175,6,60,0, , ,0,0,2,1,1,0,0,,1,,010SU0984,3,6,23,13,2000,24003,51948,20420,111,518266,201875,2622.865037,9999
2000010SU0985,520760,202280,,,1,3,3,1,15/3/2000,4,8:40,33,,1,25,4,70,0, , ,0,0,2,1,1,0,0,,1,,010SU0985,3,40,15,8,2000,24003,52076,20228,111,518266,201875,2526.669943,9999
2000010SU0986,523250,199890,,,1,3,3,1,6/3/2000,2,11:20,33,,1,25,4,70,5,5,1,0,0,1,1,1,0,0,,1,,010SU0986,3,20,6,11,2000,24003,52325,19989,136,523686,201824,1982.536759,9999
2000010SU0987,521490,201610,,,1,3,2,1,6/3/2000,2,16:55,33,,1,25,4,70,0, , ,0,0,2,1,1,0,0,,1,,010SU0987,3,55,6,16,2000,24003,52149,20161,135,523655,201833,2176.454456,9999
2000010SU0988,521680,201190,,,1,3,4,2,3/3/2000,6,14:46,33,,1,25,4,70,0, , ,0,0,2,1,1,0,0,,1,,010SU0988,3,46,3,14,2000,24003,52168,20119,135,523655,201833,2077.034906,9999
2000010SU0989,511930,197690,,,1,3,3,1,24/2/2000,5,7:30,33,,3,4008,3,40,0, , ,0,0,2,1,1,0,0,,1,,010SU0989,2,30,24,7,2000,24002,51193,19769,80,511755,197738,181.463495,250
2000010SU0990,512140,195650,,,1,3,1,1,21/2/2000,2,9:45,33,,4,462,6,30,3,4,6,0,0,1,1,2,0,0,,1,,010SU0990,2,45,21,9,2000,24002,51214,19565,88,512534,195435,448.844071,500
2000010SU0991,511840,197370,,,1,3,4,1,8/2/2000,3,17:40,33,,3,4008,3,70,0, , ,0,0,4,5,2,0,0,,1,,010SU0991,2,40,8,17,2000,24002,51184,19737,80,511755,197738,377.689026,500

Open in new window


(source) $0 is missing .. it refers to file1 - isn' that different in this version? edit: but it's just a comment and anyway it's the input file.

and we have no results appended, even after print $0 at line 38 - at that line how does it know which file whose header row it's printing?  edit: because it's the input file...


Here's what I ran:

awk -f constraint.awk apart.csv > out3.csv

Open in new window


and the contents of constraint.awk, which I pasted from here directly in to a terminal editor (called pico):

# One time logic before records are read from input file
BEGIN {
   # Change field separator from space to comma for our file
   FS = ","

   # Names for index numbers to store fields from sites file in arrray
   sitesRef = 1
   sitesEtg = 2
   sitesNtg = 3
   sitesBefore = 4
   sitesAfter = 5
   sitesMax = 0

   # Load entire sites file into memoryu array
   sitesFile = "sites.csv"
   while ((getline < sitesFile ) > 0 ) {
      # Skip header line
      if ($1 != "SORT") {
         # Save to next arrary elelment
         sitesMax += 1
         sites[sitesMax, sitesRef] = $3
         sites[sitesMax, sitesEtg] = $4+0
         sites[sitesMax, sitesNtg] = $5+0
         sites[sitesMax, sitesBefore] = $10+0
         sites[sitesMax, sitesAfter] = $11+0
      }
   }
   close(sitesFile)

}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,SITE_ETG,SITE_NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found DOUBLE QUOTES PREVIOUSLY STRIPPED FROM FILE
         ACC_EC = $2+0
         ACC_NC = $3+0
         WHEN = $39+0

         # Remove double quotes (may not be needed anyu more?)
         # gsub("\"", "", ACC_EC)
         # gsub("\"", "", ACC_NC)
         # gsub("\"", "", WHEN)

         BestDistance = 999999999.0
         # Find closest point in sites data (array)
         for (i = 1; i <= sitesMax; i++) {
            # Get times and check if we should check distance
            BEFORE = sites[i, sitesBefore]
            AFTER = sites[i, sitesAfter]
            if ((WHEN > BEFORE) && (WHEN < AFTER)) {
               # Times okay, calculate distance, find smallest
               REF = sites[i, sitesRef]
               SITE_ETG = sites[i, sitesEtg]
               SITE_NTG = sites[i, sitesNtg]
               Distance = sqrt(((ACC_EC-SITE_ETG)^2)+((ACC_NC-SITE_NTG)^2))

               # If this is the best distance so far save it's info
               if (Distance < BestDistance) {
                  BestDistance = Distance
                  BestRef = REF
                  BestEtg = SITE_ETG
                  BestNtg = SITE_NTG
               }
            }
         }

         # Write output line, add new columns at the end
         printf("%s,%s,%s,%s,%f\n", source, BestRef, BestEtg, BestNtg, BestDistance)
      }
   }
}

# ACC.csv
# $1 = CODE
# $2 = ACC_EC
# $3 = ACC_NC
# $4 = Long
# $5 = Lat
# $6 = POL
# $7 = Severity
# $8 = No_Vehicles
# $9 = No_Casualties
# $10 = Date
# $11 = DayofWeek
# $12 = Time
# $13 = LA_District
# $14 = LA_Highway
# $15 = 1st_Road_Class
# $16 = 1st_Road_Number
# $17 = Road_Type
# $18 = LIMIT
# $19 = Junction_Detail
# $20 = Junction_Control
# $21 = 2nd_Road_Class
# $22 = 2nd_Road_Number
# $23 = Zebra-Human
# $24 = Zebra_Machine
# $25 = Visibility
# $26 = Weather_Conditions
# $27 = Surface
# $28 = Special_Conditions
# $29 = Road_Hazards
# $30 = Urban_Rural
# $31 = Police_Attended
# $32 = LSOAofLocation
# $33 = ACCREF
# $34 = Month
# $35 = Minute
# $36 = ACCDAY
# $37 = Hour
# $38 = YEAR
# $39 = WHEN 
# $40 = ACC_EC5
# $41 = ACC_NC5
## $42 = REF                                   
## $43 = SITE_ETG                              
## $44 = SITE_NTG                              
## $45 = Distance                              
## $46 = Radius                                
#
# SITES.csv
# $1 = SORT
# $2 = POL
# $3 = REF
# $4 = SITE_ETG
# $5 = SITE_NTG
# $6 = INSTALL
# $7 = MONTH
# $8 = WHEN
# $9 = TYPE
# $10 = BEFORE
# $11 = AFTER

Open in new window

0
Anthony MellorChartered AccountantAuthor Commented:
i'm just checking my input file, based on previous experience it's been my files that caused your script to fail...
0
Bill PrewCommented:
Yes, I suspect a file problem...
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Anthony MellorChartered AccountantAuthor Commented:
Regrettably not just the file, whole disk just died. I have a copy, but not up to the minute so I will recreate the files from the "original data", a useful if unplanned test of my documented process and what I had in mind as an extra backup, though not one I planned to use just yet.

This then means rerunning "script 1" again, so we will see if using a ram disk reduces the 21 hours.

So, a hold up while I recover, or rather rebuild. Diskwarrior found sweet nothing on the drive, so it's all gone and it's back to my copy.  Think I'll run a backup overnight.. I have been switching  "off" everything I can find so as to release resources - a risk and it seems I chose red and the ball has landed on black. Such is the roulette of life and computing.

Happily all my AWK code is safe.

good night :-)
0
Bill PrewCommented:
Sorry to hear that, good luck.

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
thanks - I'll slog through, no problem - and having learned so much from these questions and answers it won't take me anything like as long. I shall consider it homework, or perhaps a mock exam... actually, not so "mock". 2.49 am here so start time will be delayed.
0
Anthony MellorChartered AccountantAuthor Commented:
script is running...
0
Bill PrewCommented:
Chug chug chug ...  traveling this week so maybe sparse responses but curious to see how this goes.

~bp
0
Anthony MellorChartered AccountantAuthor Commented:
bon voyage

watch this space :-)
0
Anthony MellorChartered AccountantAuthor Commented:
That and this

awk -f EE29008005.awk apart.csv > out3.csv

thanks.
0
Anthony MellorChartered AccountantAuthor Commented:
finished in 33 hours - running on a little laptop where I could leave it to it.

oh dear, identical results to the first script, I copied the wrong script code in to it.

starting over...
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
AWK

From novice to tech pro — start learning today.