Avatar of Anthony Mellor
Anthony MellorFlag for United Kingdom of Great Britain and Northern Ireland

asked on 

AWK: Pytthagoras bp script Part REVISITED

edit to add: the attached files are not extracts, they are the actual files I used to produce the output file "out.txt"
That said, the acc.csv is a very small sample of the main file.
 

This is the unconstrained script.
The input files sites.csv is being appended to the end of the output file

Code is here with line 18 fixed (I had a strange double quote):

# ACC.csv
#CODE,ACC_EC,ACC_NC,Long,Lat,POL,Severity,No_Vehicles,No_Casualties,Date,DayofWeek,Time,LA_District,LA_Highway,1st_Road_Class,1st_Road_Number,Road_Type,LIMIT,Junction_Detail,Junction_Control,2nd_Road_Class,2nd_Road_Number,Zebra-Human,Zebra_Machine,Visibility,Weather_Conditions,Surface,Special_Conditions,Road_Hazards,Urban_Rural,Police_Attended,LSOAofLocation,ACCREF,Month,Minute,ACCDAY,Hour,YEAR,WHEN,ACC_EC5,ACC_NC5
#   a = ACC_EC ($2)
#   b = ACC_NC ($3)
#
# SITES.csv
#  OLD  SORT,POL,REF,CAM ETG,CAM NTG,INSTALL,MONTH,WHEN,TYPE,CAM_ETGkm,CAM_NTGkm
# SORT,POL,REF,SITE_ ETG,SITE_NTG,INSTALL,MONTH,WHEN,TYPE,BEFORE,AFTER
#   r = REF ($3)
#   c = SITE_ ETG ($4)
#   d = SITE_NTG ($5)

# One time logic before records are read from input file
BEGIN {
   # Change field separator from space to comma for our file
   FS = ","
   file2 ="SITES.CSV"
}

# Main processing loop handling each line of the file
{
   # Check if first line (header)
   if (NR == 1) {

      # If this is header just write it to output adding new column headers
      print $0 ",REF,CAM ETG,CAM NTG,Distance"

   } else {

      # skip blank lines
      if ($0 != "") {

         # Save input record from file1
         source = $0

         # Save needed fields for calcs below, remove double quotes if found DOUBLE QUOTES PREVIOUSLY STRIPPED FROM FILE
         a = $2         # ACC_EC 
         b = $3         # ACC_NC
         gsub("\"", "", a)
         gsub("\"", "", b)

         Bestx = 999999999
         # Add new columns at the end
         while ((getline < file2 ) > 0 ) {
            if ($1 != "SORT") {
               r = $3
               c = $4      # SITE_ETG
               d = $5      # SITE_NTG
               x = sqrt(((a-c)^2)+((b-d)^2))
               if (x < Bestx) {
                  Bestx = x
                  Bestr = r
                  Bestc = c
                  Bestd = d
               }
            }
         }
         printf("%s,%s,%s,%s,%f\n", source, Bestr, Bestc, Bestd, Bestx)
         close(file2)
      }
   }
}

Open in new window


I ran the above script like this:

MacPro:5-CombineBOTH ADM$ awk -f unlimited.awk ACC.CSV sites.csv > out.txt

Open in new window


The attached out.txt file shows the output, which (erroneously) includes the entire sites.csv file and some odd headers right at the end.

(Reconstruction is complete) 01:36
acc.csv
Sites.csv
out.txt
* AWKShell Scripting

Avatar of undefined
Last Comment
Anthony Mellor
ASKER CERTIFIED SOLUTION
Avatar of Bill Prew
Bill Prew

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of Anthony Mellor

ASKER

I am humbled by your (Bill's) patience with my repetition of the same error. Thank you.
Avatar of Anthony Mellor

ASKER

Notes to self for reference:

BOM

The BOM code (Byte order Mark) for UTF-8 is hex EF BB BF.
On Macos TEXTEDIT automatically deletes these upon save.
Excel does not add this when saving as csv.

When displayed (with a double click on the out.txt file uploaded above) The first line of the display of line 1 of the unwanted appended sites.csv file starts with the characters    the first of these is described here thus:

ï
Symbol Name:      Latin Small Letter I With Diaeresis
Html Entity:      &iuml;
Hex Code:      &#xef;
Decimal Code:      &#239;
Unicode Group:      Latin-1 Supplement

So should I see this character ï (or indeed ) I should look for this BOM code (first three bytes of the file)and can remove it with a simple open and save using Textedit instead of using a hex editor.

CR LF  Wiki

Extract:
LF:    Unix and Unix-like systems (Linux, MacOS, FreeBSD,
LF (Line feed, '\n', 0x0A, 10 in decimal) or CR (Carriage return, '\r', 0x0D, 13 in decimal, displayed as "^M" in some editors)

CR+LF, '\r\n', 0x0D0A  Microsoft Windows, DOS

I rather like this bit of history:

"The concepts of line feed (LF) and carriage return (CR) are closely associated, and can be either considered separately or together. In the physical media of typewriters and printers, two axes of motion, "down" and "across", are needed to create a new line on the page. Although the design of a machine (typewriter or printer) must consider them separately, the abstract logic of software can combine them together as one event. This is why a newline in character encoding can be defined as LF and CR combined into one (known variously as CR+LF, CRLF, LF+CR, or LFCR)."

Although "down and across" loses the the real life experience of the line feed which was a rather unexciting and quiet sort of "click" whereas the carriage return was a great big lever usually "levered" with a flourish by the typist (now able to rest for a moment) resulting in a ding (bell sound) after the whole typewriter carriage whizzes across the machine ending with a satisfying clunk, ready for the next characters at the beginning of the new line.

So a Line feed LF moves the type head (the edit cursor) down one line, but keeps typing from the same vertical axis point, to get back (return!) to the beginning of this new line, requires a carriage return CR.

So Windows replicates the ancient typewriter with LF CR whereas Unix feeds (in fact rolls) a line and throws in the carriage return for free. What happens if I do not want the carriage to return? That's where modern word processing takes over with indents and so on.

WHY 0X ?

It indicates Hex.
"Integer Literals. An integer literal can be a decimal, octal, or hexadecimal constant. A prefix specifies the base or radix: 0x or 0X for hexadecimal, 0 for octal, and nothing for decimal." So it's saying "unless I say otherwise it's Decimal".


The X - I reckon because it sounds like Hex.
The 0 ... it's complicated, but there are reasons.
It means "what follows is a hex constant".
Shell Scripting
Shell Scripting

The term 'shell' refers to a general class of text-based command interpreters most often associated with the UNIX and Linux operating systems. Popular shells include Bourne, Debian Almquist (dash), Korn (ksh), Bourne Again (bash) and the C shell family (csh). Some view the DOS 'cmd' prompt as a minimal shell of sorts. It is also possible to install Cygwin on Windows and emulate a full Unix environment with complete shell capabilities. Terminal emulators, such as xterm, GNOME Terminal and OS X Terminal, can be used to access shell.

11K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo