Start Free Trial

asked on

Awk & Pythagoras - applying csv file data to the problem

This is a follow on to here:
https://www.experts-exchange.com/questions/29004621/awk-and-Pythagoras.html

How to use awk to read two pairs of coordinates from files of this arrangement:

ACCYR,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,ACCDAY,Day_of_Week,A8H,Local_Authority_(District),Local_Authority_(Highway),1st_Road_Class,1st_Road_Number,Road_Type,Speed_limit,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,ACCREF,ACCMTH,A8M,Accident_Index,Date,Time
Accident_Index,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,Time,Local_Authority_(District),Local_Authority_(Highway),1st_Road_Class,1st_Road_Number,Road_Type,Speed_limit,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,ACCREF,ACCMTH,A8M,ACCYR,ACCDAY,A8H

Open in new window

square root of (((a-c)^2)+((b-d)^2))

where a b c & d are ordnance survey coordinates called Eastings and Northings that might look like these:

these are column headings from the two different csv files headers as above:
file1 (Row 1 above)
a =Location_Easting_OSGR
b= Location_Northing_OSGR
file2 (Row 2 above)
c =Location_Easting_OSGR
d =Location_Northing_OSGR

all coordinates are 6 digits 123456

file 1 is 3,300 odd.
file 2 is about 3 to 6 million records depending (on me).

and in case it's relevant I want to know which record in file 1 it is, that has the least distance from each record in file2; but that might be better in another question.

A few questions:

How are the records of the two files "matched up"? Do you want to pair every record in the second file with every record in the first file, or is there some other key(s) that are used to match?
What should the output be, a new delimited file and what columns should be in it?

~bp

ASKER

We are arriving at the crux of the matter. There is no relational connection (no key) between these two files at all, so we are creating one, a cross join in sql terms I think. We calculate the distance (Pythagoras) between every record in file 1 to every record in file 2, recording the shortest distance from every file2 record to nearest file1 record, and save that in the (now) related file 2 record.
This is what broke Excel.

Output is the same file2, plus the result of the shortest distance to a record in file1 (an sql sub query (?) and an Excel array formula).

I fear the awk solution will break my i.q.

Also, these feel like fairly large numbers of records, and I have to be honest and say I'm not sure how well AWK will scale for this take. I would suggest we try a test with a few thousand records in the files and see how that performs. Can you provide a couple of test files like that, is that possible?

Based on a few earlier questions I gather this data is on a legacy system somewhere (or used to be) that doesn't have the technology or horsepower to analyze the data there?

~bp

Anthony,

It's an interesting problem and I will be checking in a bit more this afternoon, but have some personal stuff to attend to also, may may be slower to respond. But I will spend some time on this and if you can provide some test data that would be helpful.

~bp

ASKER

Yes the legacy system is Silicon Office and takes three weeks to run these calcs.
I have replicated it in Excel for 500,000 records but Excel then can't save it. Then I used array formulae and that does work nicely, but only for 500,000 records. Then I shifted to SQLite, which handles it ok. A very few minutes. Once in SQLite I am faced with doing what we are doing here using SQL, which I know will do it, but again I am a novice with that and it struck me that I could have done this in DOS with Edlin (a while ago...) so maybe modern editors/scripts may also work, and hey presto! I fell over awk. Yes I understand scale may be an issue, that has been the issue throughout all of this little project.
Of course the data would not have fitted on the DOS machine in the first place.
The machine here is a MacPro with multiple processors (I forget) 128gb ram, so that is as scaled up as I am going to get - and of course presumably will help awk along. The data is in fact speed cameras and accidents.

So in short I am looking for the nicest solution and I do like the idea of awk, especially when I saw how it chewed through x million records today.
p.s. the output file also needs which record it is in the other file.

don't do anything, the headers are wrong. File 1 is not the right file. I will fix that now.

ASKER

ACCYR,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,ACCDAY,Day_of_Week,A8H,Local_Authority_(District),Local_Authority_(Highway),1st_Road_Class,1st_Road_Number,Road_Type,Speed_limit,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Special_Conditions_at_Site,Carriageway_Hazards,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,LSOA_of_Accident_Location,ACCREF,ACCMTH,A8M,Accident_Index,Date,Time
SORT,POL,REF,CAM ETG,CAM NTG,INSTALL,MONTH,WHEN,TYPE,CAM_ETGkm,CAM_NTGkm

Open in new window

square root of (((a-c)^2)+((b-d)^2))

where a b c & d are ordnance survey coordinates called Eastings and Northings that might look like these:

these are column headings from the two different csv files headers as above:
file1 (Row 1 above)
a =Location_Easting_OSGR
b= Location_Northing_OSGR
file2 (Row 2 above)
c =ETG
d =NTG

all coordinates are 6 digits 123456

in file2, the reference to pick up to go with the shortest distance is REF

file 2 is 3,300 odd.
file 1 is about 3 to 6 million records depending (on me).

Assuming you are still up for this, I am aware it's mission leap; I'm just curious whether awk can do it, because if it can I think for our purposes it may be an unexpected and apt solution - and I would be happy to become a student of it, having seen it's reference material. It looks kind of like the stuff I used to write a few decades ago.

ASKER

and now the extracts of the above:

camlist is file 2
csv-extract is file 1

this is real data, but only an extract of it, hopefully enough to get some "hits" (matches)
We have a couple of other constraints we apply, such as only matches up to say 1km from the site and installation date of site not more than 84 months before and 59 months after (i.e. 7 years before to 5 years after), but the real grunt work is the Pythagoras calcs - that said the time constraints are a second tier because they have to be built in for that measure. However, our current measure (in this Question) is "all time" for our records which I think is 23 years i.e. no time constraints.

Take your time, I'm happy to have some help with this. It's not work it's personal voluntary stuff.

I have been reading the links you gave me. I can see how awk might be being asked a bit much by the question in this Question.
That said I am fascinated by the 80's addition of user designed formulae (I forget the correct word), also a feature of Excel my main tool. I often use the same formulae over and over again. Also that it was designed as a quick two line command language, a bit like batch files of old. It looks like if I could get the hang of the (admittedly probably seen as "hostile") codes it would be easy to use with built in variables and field recognition. Wish I had come across it back in the day - 1984 to be exact.
CAMLISTEXTRACT.csv
CSV-EXTRACT.CSV

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER

I agree it's a case of at what point I import the data into a database (now Postgresql, previously MySQL and before that SQLite).
How do you think about I use awk to get the source data in the right order and with the right columns (so e.g. adding the year and reduced length 6 digit fields you have already done); if you agree then it leaves only the reordering of the columns in so far as I have 7 CSV files and 44 DAT (tab delimited I think) files with the same data but different headers, except some extra columns present in both data sets.

I've posted an AWK columns re-order question; I'll just find the link. (and no I regret it wasn't me wrote the code snippet, found it elsewhere.

https://www.experts-exchange.com/questions/29005085/AWK-How-do-I-reorder-the-columns-in-a-csv.html

ASKER

agreed, wrong app for the calcs.

ASKER

Bill any chance of a copy of the test code you tred?

Of course. Set the name of file2 in the BEGIN block, and then redirect to save the output to a file. I ran like this (after having to clean up end of line characters in one of the files, they were wrong for Windows):

gawk -f EE29004649.awk CSV-EXTRACT.csv >

# One time logic before records are read from input file
BEGIN {
   # Change field seperator from space to comma for our file
   FS = ","
   file2 = "CAMLISTEXTRACT.csv"
}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,CAM ETG,CAM NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found
         a = $2         # Location_Easting_OSGR
         b = $3         # Location_Northing_OSGR
         gsub("\"", "", a)
         gsub("\"", "", b)

         # Add new columns at the end
         while ((getline < file2 ) > 0 ) {
            if ($1 != "SORT") {
               c = $4      # CAM ETG
               d = $5      # CAM NTG
               x = sqrt(((a-c)^2)+((b-d)^2))
               printf("%s,%s,%s,%s,%f\n", source, $3, $4, $5, x)
            }
         }
         close(file2)
      }
   }
}

Open in new window

~bp

ASKER

thanks

<BEGIN BRAINSTORMING MODE>

Just curious, do you have a Windows computer you could run this on, and what length of time would be acceptable for runtime?

I did a small proof of concept in VBscript on Windows just for curiosity using the large sample dataset provided above. Without writing any output it can chuck through all the data and match each record in the larger file to each record in the smaller file and do all the math in about 1 minute. That's the good news. The bad news is when I add in outputting each line of the larger file with the resulting calculations from the smaller file the runtime goes way up. I didn't even let it finish, it didn't feel like the right approach.

You have 57,000 records in the large file here, and about 1,000 in the smaller file. That means if we write out each permutation we would have about 57,000,000 records. Not sure what you would do with them. And in addition the large file is about 11.5MB in size, and therefore the output file produced would be over 12GB likely. Feels like too much raw data to me.

I think we may be able to use this underlying calculation approach, and then if we can hone in on what data you really need out of this maybe we can shrink down the output to make it perform better. Help me understand better how you are going to use this, and if we can reduce the output by looking for the best choice, etc.

Just for completeness here is the quick VBS script I did, but I could likely get a little more performance out of it. I have the output commented out currently.

Const ForReading = 1
Const ForWriting = 2
Const ForAppending = 8
Const TristateTrue = -1
Const TristateFalse = 0
Const TristateUseDefault = -2

strFile1 = "B:\ee\EE29004649\CAMLISTEXTRACT.CSV"
strFile2 = "B:\ee\EE29004649\CSV-EXTRACT.CSV"
strOutFile = "B:\ee\EE29004649\out.txt"

Set objFSO = CreateObject("Scripting.FileSystemObject")

Set objFile1 = objFSO.OpenTextFile(strFile1, ForReading, False, TriStateUseDefault)
strData = objFile1.ReadAll()
arrFile1 = Split(strData, vbCrLf)
objFile1.Close

For i = LBound(arrFile1) To UBound(arrFile1)
    arrFile1(i) = Split(arrFile1(i), ",")
Next

Set objFile2 = objFSO.OpenTextFile(strFile2, ForReading, False, TriStateUseDefault)
Set objOutFile = objFSO.OpenTextFile(strOutFile, ForWriting, True, TriStateUseDefault)

FirstLine = True
Do Until objFile2.AtEndOfStream
    strData = objFile2.ReadLine
    If strData <> "" Then
        If FirstLine = True Then
            FirstLine = False
'            objOutFile.WriteLine strData & ",REF,CAM ETG,CAM NTG,Distance"
        Else
            arrFile2 = Split(strData, ",")
            a = Replace(arrFile2(1), """", "")
            b = Replace(arrFile2(2), """", "")
            For i = 1 To UBound(arrFile1)
                r = arrFile1(i)(2)
                c = arrFile1(i)(3)
                d = arrFile1(i)(4)
                x = sqr(((a-c)^2)+((b-d)^2))
'                objOutFile.WriteLine strData & "," & r & "," & c & "," & d & "," & x
            Next
        End If
    End If
Loop

objFile2.Close
objOutFile.Close

Open in new window

~bp

ASKER

the Pythagoras application IS the process to reduce the volume of raw data, so you are exactly right, too much raw data and we have to cull it, which we are doing (have done of course in reality) with Pythagoras' famous formula.

In case it's relevant to your thoughts here's aikimark's VBA solution I used in Excel:

https://www.experts-exchange.com/questions/28998243/Compare-3-000cells-with-each-one-of-500-000-cells-how.html?anchorAnswerId=41996653#a41996653

Okay, adjusted my code to find the closest coordinates and only output one record per input adding the desired fields from the closest point. Creating the for the closest point after calcs ran for 2.5 minutes on my desktop, which is decent power but not massive. Just to give you a benchmark for this approach. I know that isn't all the data, but the I/O is the largest driver, and it sounds like this test file was 57,000 records of about 60,000,000 in the real data set. So if I do the math, it would run for about (60,000,000/57,000)*2.5 minutes, which works out to about 44 hours.

~bp

Hi Anthony,

Bill said:
> "You have 57,000 records in the large file here, and about 1,000 in the smaller file. That means if we write out each permutation we would have about 57,000,000 records."

Then you said:
"the Pythagoras application IS the process to reduce the volume of raw data, so you are exactly right, too much raw data and we have to cull it, which we are doing (have done of course in reality) with Pythagoras' famous formula."

And you say that because all you want written out is:
> "...which record in file 1 it is, that has the least distance from each record in file2"?
i.e. pairs of closest records, right?
So if there are 57,000 accident records, then you will want 57,000 records in the output file regardless of the number of records in the camera file, right?

Edit: Looks as if Bill is already onto this, but I'd still like to confirm I'm correct pls Anthony.

tel2

Hi Bill,

> "...and it sounds like this test file was 57,000 records of about 60,000,000 in the real data set."
Do you mean "6,000,000", rather than "60,000,000"?

The original post says:
> "file 2 is about 3 to 6 million records depending (on me)."

Sounds to me as if comparing each record in file1 with each record in file2, could be a bit (or even a byte) slow because of the number of records in a brute-force way. But I don't see any point in measuring the separations of pairs of records which could not possibly be closer than ones we've already checked.

Here's one way I'm thinking that principle may be able to be implemented:
- Think of the map being a grid of (maybe square) blocks.
- Assign each camera location into a block in the grid.
- For each accident location:
- Check the distances to each of the cameras in the same block (in the grid), and the adjacent blocks. (You can't just look in the centre block because there could be a camera in an adjacent block which is closer than the closest one in the current block. By the same kind of logic, if you don't find any cameras in the centre block, you'd better look at the 16 blocks around the 8, and so on. *)

To detail the first part a bit more, describing the data structures:
- For each camera file record:
- Read it from the file.
- From the X & Y co-ordinates, generate a block#. (E.g. if X=12345 & Y=67890 you could put this camera in block# 123:678 by simply knocking 2 (or whatever) digits off the right of X & Y, or divide them by some number.)
- Push this camera record onto an array which is in a hash which has the block# as its key (e.g. '123:678'). (So, given a block# you can find an array of all cameras in that block.)
- Now work through the accident records, as described in the previous paragraph.

*However, let's say the blocks are 1km x 1km. If the accident was in near the top-left corner, and the closest camera in the 9 blocks (that same block + 8 adjacent blocks) is near the bottom-right corner, then the separation could be up to about 1.414km. This means there may be closer cameras which are outside the 9 blocks (e.g. 1.1 blocks to the left). So, you'd have to have a check that if the separation between the accident and closest camera is more than 1km (the length of a side of a block), then check blocks outside the 9 blocks. This principle also extends if the closest camera was found in the adjacent 8 blocks, with a separation of more than 2km, and so on.

That should speed things up considerably. Being optimistic, if there's 10,000 blocks (approx 100x100) in the grid it might speed it up about 1000 times, assuming only 9 blocks need to be checked in most cases (i.e. an average of 10 blocks searched). 10,000 blocks / 10 blocks = 1000 times faster.

I don't plan to code this myself, but if I did I'd use Perl. But then I use Perl for almost everything, so who am I to recommend a language?

Does anyone see any issues with this strategy?
I'm sure there will be faster methods, and I'm thinking about one, but it's far from complete.

Anthony:
Q1. How fast do you need this to run (maximum duration)?
Q2. How often is it likely to be run?

tel2

ASKER

yes Six Million is the big file.
I'm still reading everything else. - thanks for your interest, happy Saturday :-)

Six million is better, that means my code could run in 4 to 5 hours then...

~bp

ASKER

I am uploading what was my original working proof file; if Excel was capable of scaling up (and it has no chance of that), this file shows how it would look. The crucial formula is in AI13 .That contains Pythagoras and a time constraint based on months before and months after. This shows every distance from every accident to every camera. By having the calculation actually make that selection, we reduce the output results from 6m x 3300 to 6m x 1 so saving a vast amount of output. That process I will show in my following comment with the file I used to achieve that.

I am sharing these to give myself some chance of what I say by way of description being seen to actually match what I am doing. I.e. so you can see for yourselves. I'll gladly explain these files further. So here we have the first version attached.
SiteTest07-VariableCoordExtract.xlsx

ASKER

Light bulb moment: I forgot to mention. We cull by excluding all records where the matched site is more than (say) 1km distant. I say "say" because I need that to be a variable. In fact as regards culling, if above 1km distant it's whatever distance gives us a number of results manageable in Excel. I (think I) know we can manage 500,000. I gather about 80% fall away if that helps, being too distant. What defines "too" is another matter, hence 1km being a variable. Note kilometres not miles.

ASKER

Hi Tel, That's an interesting take on the distance question. Currently we are using Pythagoras (which I can't change) to reproduce the legacy results upon which all the final reports are based. The reason it's interesting is I have been wondering about the situation where there are multiple nearby ("nearby" to be defined) sites, all with different installation dates. The Excel file above has a variable for desensitising the coordinates pretty much as you envision, reducing 6 to 4 or even 4 etc digits. However, that still means iterating through each of the 3335 records in the small file - and then there is the time constraint (84 months before to 59 after install month number).

Yes you Perl me Excel.. (since I crunch only numbers, but not usually this many)

Anthony - I've found the "array method" file, I'll upload it when Excel stops crashing,.

ASKER

@Bill I am curious what that would look like in AWK? Can it do that?

@tel2 it will rarely be run, once a small set of variations has been saved.
Run time doesn't really matter, but a minute or two would be nice ;-)

Remember: this has to be transparent, meaning ANYONE can understand and see for themselves that the calculations are right.
Excel does provide validation of small sets of data, indeed whole regions, but a "black box" will fail the test. We are seeking a self-evident truth that detractors cannot pick apart with bs.

You might opine there may be more politics involved than tech.

I am still working on uploading the array solution. I had forgotten how on the edge of Excel's capability I have been. It's like swimming in syrup.

Anthony

ASKER

I just realised this Question already has a Solution. So our part two of this Question continues here:
https://www.experts-exchange.com/questions/29006698/Awk-Pythagoras-or-SQL-applying-csv-file-data-to-the-problem-Part-2-continuation.html