Link to home
Start Free TrialLog in
Avatar of Anthony Mellor
Anthony MellorFlag for United Kingdom of Great Britain and Northern Ireland

asked on

AWK: Pythagoras bp script

Wondering if I broke the script as all the results are bestx 999999999:

# ACC.csv
#CODE,ACC_EC,ACC_NC,Long,Lat,POL,Severity,No_Vehicles,No_Casualties,Date,DayofWeek,Time,LA_District,LA_Highway,1st_Road_Class,1st_Road_Number,Road_Type,LIMIT,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Zebra-Human,Zebra_Machine,Visibility,Weather_Conditions,Surface,Special_Conditions,Road_Hazards,Urban_Rural,Police_Attended,LSOAofLocation,ACCREF,Month,Minute,ACCDAY,Hour,YEAR,WHEN,ACC_EC5,ACC_NC5
#   a = ACC_EC ($2)
#   b = ACC_NC ($3)
#
# SITES.csv
# SORT,POL,REF,CAM ETG,CAM NTG,INSTALL,MONTH,WHEN,TYPE,CAM_ETGkm,CAM_NTGkm OLD HEADS
# SORT,POL,REF,SITE_ETG,SITE_NTG,INSTALL,MONTH,WHEN,TYPE,BEFORE,AFTER  - NEW HEADS
#   r = REF ($3)
#   c = SITE_ETG ($4)
#   d = SITE_NTG ($5)

# One time logic before records are read from input file
BEGIN {
   # Change field separator from space to comma for our file
   FS = ","
   file2 ="SITES.CSV"
}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,SITE_ETG,SITE_NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found DOUBLE QUOTES PREVIOUSLY STRIPPED FROM FILE
         a = $2         # ACC_EC 
         b = $3         # ACC_NC
         gsub("\"", "", a)
         gsub("\"", "", b)

         Bestx = 999999999
         # Add new columns at the end
         while ((getline < file2 ) > 0 ) {
            if ($1 != "SORT") {
               r = $3
               c = $4      # SITE_ETG
               d = $5      # SITE_NTG
               x = sqrt(((a-c)^2)+((b-d)^2))
               if (x < Bestx) {
                  Bestx = x
                  Bestr = r
                  Bestc = c
                  Bestd = d
               }
            }
         }
         printf("%s,%s,%s,%s,%f\n", source, Bestr, Bestc, Bestd, Bestx)
         close(file2)
      }
   }
}

Open in new window


run time was 1 hour 48 minutes.
Avatar of Bill Prew
Bill Prew

Did you make any changes to the AWK script?

Are you sure the input file is compatible?

~bp
Avatar of Anthony Mellor

ASKER

I changed the names but not the script as it were. Mistakes quite possible. I am just uploading extracts of the two input files and the output file. I have added an x so as to differentiate them here at my end. Now attached  - you beat me to it responding so quickly.

the out.csv output file is 116mb which is a significant culling of the 6m records.

These extracts are copied and pasted from the original files using macos text edit in "text" mode.

I have not altered the field positions so all your field numbers remain good; so hopefully if I broke it you can see how.

and did I do the command line correctly?
 awk -f Pythagoras.awk file1  SITES.CSV >out.txt
I had a copy of acc.csv also named file1 as the lack of file1 being mentioned confuses me so I put both in the directory acc.csv and file1 being the same content.

1am here,  must sleep. It was a thrill to see your script running :-)
ACCx.csv
SITESx.CSV
Outx.csv
ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
thank you and good morning. I must have mixed up a before and an after, doing this too late at night, silly mistakes.

Now changed and just to prove it:

0002a500: 2c32 3339 3638 2c32 3431 3131 0a33 3333  ,23968,24111.333
0002a510: 302c 3930 2c33 3637 2c33 3932 3232 352c  0,90,367,392225,
0002a520: 3739 3736 3230 2c32 3030 332c 332c 3234  797620,2003,3,24
0002a530: 3033 392c 4d2c 3233 3935 352c 3234 3039  039,M,23955,2409
0002a540: 380a 3333 3331 2c39 302c 3336 382c 3339  8.3331,90,368,39
0002a550: 3233 3330 2c38 3038 3834 302c 3230 3034  2330,808840,2004
0002a560: 2c31 302c 3234 3035 382c 4d2c 3233 3937  ,10,24058,M,2397
0002a570: 342c 3234 3131 370a 3333 3332 2c39 302c  4,24117.3332,90,
0002a580: 3336 392c 3339 3236 3835 2c38 3038 3638  369,392685,80868
0002a590: 352c 3230 3034 2c34 2c32 3430 3532 2c4d  5,2004,4,24052,M
0002a5a0: 2c32 3339 3638 2c32 3431 3131 0a33 3333  ,23968,24111.333
0002a5b0: 332c 3930 2c33 3730 2c33 3932 3939 352c  3,90,370,392995,
0002a5c0: 3830 3239 3530 2c32 3030 332c 342c 3234  802950,2003,4,24
0002a5d0: 3034 302c 4d2c 3233 3935 362c 3234 3039  040,M,23956,2409
0002a5e0: 390a 3333 3334 2c39 302c 3337 312c 3339  9.3334,90,371,39
0002a5f0: 3336 3133 2c36 3631 3433 352c 3139 3934  3613,661435,1994
0002a600: 2c34 2c32 3339 3332 2c46 2c32 3338 3438  ,4,23932,F,23848
0002a610: 2c32 3339 3931 0a33 3333 352c 3930 2c33  ,23991.3335,90,3
0002a620: 3732 2c33 3934 3235 352c 3833 3636 3230  72,394255,836620
0002a630: 2c31 3939 342c 312c 3233 3932 392c 462c  ,1994,1,23929,F,
0002a640: 3233 3834 352c 3233 3938 380a 3333 3336  23845,23988.3336
0002a650: 2c39 302c 3337 332c 3339 3636 3438 2c36  ,90,373,396648,6
0002a660: 3539 3132 382c 3230 3035 2c31 302c 3234  59128,2005,10,24
0002a670: 3037 302c 4d2c 3233 3938 362c 3234 3132  070,M,23986,2412
0002a680: 39 

Open in new window


speaking of doing this while tired, I just ran your code and got nothing, opened the file and I had deleted the script last night!
Anyway, it's been running since 9am so we will see in a while.
this is a little off topic, but given the volume of data you may consider using a spatial index rather than calculating all possible distances.
if you have no db server, spatialite works with sqlite and might be a good pick.
currently I am thinking of applying every method that works - which is to say, every method I can learn how to apply or someone gives me in a box.
AWK script has been running for 10:38 hours to now, cpu 99,5% to 100%
I'l leave until tomorrow morning.
Yikes...

~bp
well if it works we will know how long it takes :-)
I suppose I could leave it and leave it, it's only taking a small corner to do its thing and oddly I am not being hampered by it's apparent 100% cpu.

can AWK update a visible counter when running? At least we would know it is actually doing something
AWK could write to the console, but you would slow it down some just in doing that.  Easiest way to check it is to see if the output file is growing in size...

~bp
yes I wondered if that was possible .. I'll go look
yay, 586 mb last updated a minute ago...
extract of output attached, program is still running
OutExt.csv
the growth of the output file should be linear, so if you know the expected size or number of line, you'll be at home...

given the new information indexing one way or another seems the right thing to do... by far, so i'm still keen on spatial indexes of some sort
EE seems to be a place where I demonstrate to myself how I make silly mistakes. I was so accustomed to looking at the blank line in terminal under the AWK command line , I failed to realise it has completed as evidenced by the return of the prompt.

Run time 9am yesterday to 6.15 am today. 21 and a quarter hours. Out file size 958.9mb compared to in size of 799.7mb

Last night I sent a small output extract for comparison with the legacy data and it's more accurate.

yeeha! Success, thanks bp :-)
you can figure out the duration by counting the lines : i assume the output file is supposed to produce the same number of lines as your first file. alternatively you can grab the size of both the first 1000 lines of file1 and the output file and figure out the expected size of the output file.

it does not seem that weird that you observe few writes : the awk script has to loop through file2 for each line it outputs. but 2 hours seem really long... maybe some bufferisation kicks in
2 hours? It was 21 hours.. :-)
Perfect solution: fixed it.