Shell script needed to parse fields of text files, fastest way possible.

I have a bunch of text files i need to parse only a few items from, for example:

A sample text file:
   X4066414  APPLES                    2.500      LB
1231      Acme Mart Co                                 36.1200      90.3000      90.3000
1414      Foliage Acres, Inc.                           37.9700      94.9250      94.9250

  Z1064411  ORANGES                  2.750      LB
1231      Acme Mart Co                                 33.2300      91.3825      91.3825
1414      Foliage Acres, Inc.                           34.4400      94.7100      94.7100

I am checking hundreds of these .txt files, the only thing i need from them is the list
of companies and their ID in the following format so I can import into a database or spreadsheet.

1231;Acme Mart Co
1414;Foliage Acres, Inc.

Some of these files have 10, 20, or more entries per .txt file like the above.

Some will also have more then 2 companies like:
 Z1064411  ORANGES                  2.750      LB
1231      Acme Mart Co                                 33.2300      91.3825      91.3825
1414      Foliage Acres, Inc.                           34.4400      94.7100      94.7100
1414      Grover Brand, Inc.                           34.4400      94.7100      94.7100

There is a space between each of the groups of items, ie (APPLES, ORANGES) as shown in the above example if that helps to just pull the main companies and their respective ID's out.

Any help would be appreciated, thanks in advance!



cybrthugAsked:
Who is Participating?
 
glassdConnect With a Mentor Commented:
Assuming the lines of interest always start with a digit, and always contain four value fields at the end:

awk '/^[0-9]/{
  printf("%s;",$1)
  for(i=2;i<=(NF-3);i++) {
    printf("%s ",$i)
  }
  print ""
}' <filename> | sort -u
0
 
ozoCommented:
What marks the end of "Foliage Acres, Inc." that tells you that "34.4400" is not part of the company name?
How do you know that "APPLES" and "ORANGES" are not company names?
0
 
cybrthugAuthor Commented:
The 4 Digit ID is the only thing that will help in locating the region to start with.
From there you'd have to check how many 4 digit ID's there are and then rip maybe up to 40 characters one space after it. Does that help any?
The first line can always be skipped as each file will start like this:
Z1064411  ORANGES                  2.750      LB

drop to the second line always and start to count how many company ID's there are maybe before it hits the next blank line to start the next group for APPLES, which I will not need. As long as I can get the first group from each file that is all that is important.
0
 
cybrthugAuthor Commented:
Excellent, thank you!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.