sorting a flat file in Unix

Posted on 2006-05-16
Last Modified: 2011-09-20
I asked this question before but did not get an answer I could use.
I have a flat file that looks somewhat like this

Field A    Field B   Field C
--------   --------   --------
A123B    RANDO  123
B120T    MRAND  567
M234K   OMRAN  678
A123B   DOMRA  999

I use the custom sort function to take all the needed fields and sort them correctly.  Up until now,  the only sort I needed was do a sort by the last digit of field A, then
if (tmp == 0)
  Compare Field C's
then return the tmp variable
However, now a new requirement has been added and I am not sure how to implement it.  
I need to be able to (hopefully without redoing the way I currently sort) sort the same way, but if Field A is TOTALLY the same, then it should all be together but still sub sorted by Field C.  Otherwise, it need to be sorted the way it has been sorted, Field A last number first, then Field C.

well, I am using simplified data for you, but here is a scetch of the simplified pseudocode.

$FIELDA_1 = substr($a,1,5);
$FIELDA_2 = subsr($b,1,5);

$FIELDC_1 = substr($a,14,3);
$FIELDC_2 = substr($b,14,3);

#reverse FIELD A
$Reverse_1 = ($1) if $FIELDA_1 =~ /(\d+)/;
$Reverse_2 = ($1) if $FIELDA_2 =~ /(\d+)/;

$FirstCharacter_1 = substr ($Reverse_1, length($FIELDA_1) - 1);
$FirstCharacter_2 = substr ($Reverse_2, length($FIELDA_2) - 1);

$tmp = $FirstCharacter_1 <=> $FirstCharacter_2;

if ($tmp == 0){
if ($FIELDC_1 > $FIELDC_2)
   $tmp = 1;
elseif($FIELDC_1 < $FIELDC_2)
  $tmp = 1;
$tmp = 0;

return $tmp;

I would prefer to keep my code the same for the most part, but if you suggest using something like map i can do that but will have to give more detail since I am not familiar with it.
Question by:feldmani
    LVL 6

    Expert Comment

    I think you should combine FieldA and FieldC into a string respectively.
    When you sort by $temp you will get the correct order you want.
    I hope I have aided you something.
    Phuoc H. Nguyen
    LVL 41

    Expert Comment

     sub byField {
        $a1 = substr( $a, 1, 3 );  # Field A (number) - record 1
        $c1 = substr( $a, 12 );    # Field C          - record 1
        $a2 = substr( $b, 1, 3 );  # Field A (number) - record 2
        $c2 = substr( $b, 12 );    # Field C          - record 1
        if ( $a1 == $a2 ) {        # Are A Fields numerically equal?
          $c1 <=> $c2;             #
        } else {                   #
          $a1 <=> $a2;             #
        }                          #

      $filename = "data.txt";

      open( DATA, "<$filename" ) or die "Unable to open $filename. $!";
      chomp( @data = <DATA> );
      close( DATA );
      print "----+----1----+----2\n";
      foreach $line ( @data ) {
        $line =~ s/  +/|/g;
        print "$line\n";
      print "\n\n----+----1----+----2\n";
      @info = sort byField @data;
      foreach $line ( @info ) {
         $line =~ s/\|/  /;
         $line =~ s/\|/ /;
         print "$line\n";
    LVL 1

    Expert Comment

    Unix has an awesome sort utility!

    The following C code runs the sort program, equivalent to the following from the command-line:
    sort -k 1.5,1.5d -k 3,3n FILENAME > sorted

    It sorts the file, FILENAME according to the 5th char af the first field (-k 1.5,1.5)  and resolves ties on the third field (-k 3,3n) then pipes the output to flat text file sorted. This might be a bit far fetched to use this method, you can also play around with the arguments, say if tmp==0 then change your args to do ... etc. Of course, it's probably a good idea to get the following code out of main and into a function.
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>

    main(int argc, char** argv)
    char * args[] = {"sort", "-k", "1.5,1.5d", "-k", "3,3n", "FILENAME", NULL};
    pid_t pid=0;
    int* status=0;
    int fd=0;

    fd = open("sorted", O_WRONLY | O_TRUNC | O_CREAT, 0666); /* pipe to here */
    if (!(pid = fork()))
            dup2(fd,1); /* change the childs stdout to the file - the pipe */
    /*    printf("In the child\n"); */
            execv("/bin/sort", args); /* run the sort command with the above args */

    waitpid(pid, &status, 0); /* wait for the sort to finish */

    /* continue on with your program */

    return 0;
    LVL 1

    Author Comment

    poid99, this is ALMOST what I am looking for.  However the character is not always constant.  If you read what I say afterwards, I specify it has to be by the LAST character of Field A, NOT by the fifth character.  Basically I have to take the substr, get rid of the zero's then sort.
    LVL 1

    Expert Comment

    say the datafile =
    A123B       RANDO     123
    A13B         DOMRA     999
    B120T       MRANA     67
    B120T       MRAND     223
    A13B         MRA           96
    M23M4B  OMRAN     78

    What should the output be?
    LVL 1

    Author Comment

    B120T       MRANA     67
    B120T       MRAND     223

    A13B    MRA         96
    A13B    DOMRA     999

    A123B  RANDO      123

    M23M4B  OMRAN     78

    That is what the output should look like.
    LVL 1

    Accepted Solution

    Is this also acceptable output?:
    M23M4B  OMRAN     78
    A13B    MRA         96
    A13B    DOMRA     999
    A123B  RANDO      123

    I think you first need to break it up. First sort by last character of first field (resolve ties on first field)

    turns into my above (from previous post) sample date into:
    B120T       MRANA     67
    B120T       MRAND     223
    A123B       RANDO     123
    A13B         DOMRA     999
    A13B         MRA           96
    M23M4B  OMRAN     78

    then sort each sub group:

    {B120T       MRANA     67,
    B120T       MRAND     223}
    {A123B       RANDO     123,
    A13B         DOMRA     999,
    A13B         MRA           96,
    M23M4B  OMRAN     78}

    What if you store the records in some sort of linked structure. The next ptr points to another record that has an identical field 1.

    typedef struct ARecord
       char* field1;
       char* field2;
       int field3;
       struct ARecord* next;
    } Record;
    Record table[N];

    table[0] = {A123B       RANDO     123   NULL}
    table[1] = {A13B         DOMRA     999   NULL}
    table[2] = {A13B         MRA           96   NULL}
    table[3] = {M23M4B  OMRAN     78   NULL}
    table[0] = {A123B       RANDO     123   NULL}
    table[1] = {A13B         DOMRA     999   {A13B      MRA     96   NULL}}
    table[2] = {M23M4B  OMRAN     78   NULL}

    start and N and work backwards:
    for (int i=N;i>1; i--)
      if table[i].field1 == table[i-1].field1
          table[i-1]->next = copy_entry(table[i]);
          delete_entry(table, i);

    then to sort each subgroup:
    for (0 to N)
       // sort each list so:
       // table[i].field3 < table[i].next->field3 ...
       sort table[i](from table[i].field3 to table[i].next->field3 ...)

    // table[1] = {A13B         DOMRA     999   {A13B      MRA     96   NULL}}
    // becomes:
    // table[1] = {A13B      MRA     96   {A13B         DOMRA     999   NULL}}

    for (0 to N)
       sort table[0 ..  N].field3

    table[0] = {M23M4B  OMRAN     78   NULL}
    table[1] = {A13B      MRA     96   {A13B         DOMRA     999   NULL}}
    table[2] = {A123B       RANDO     123   NULL}

    # a side note: following sed command gets the last char of the first field and paste's it onto
    # the end of the line: my experiments on the command line didn't quite work
    sed "s/^\w\+\(\w\)[ \t]\w\+[ \t]\w\+$/\1/" < DATAFILE | paste DATAFILE -

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Suggested Solutions

    Title # Comments Views Activity
    ShiftLeft challenge 21 49
    Using YubiKey with REST API application 2 50
    mapBully challenge 6 50
    WMI, model #, retrieving information 10 49
    The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
    This is an explanation of a simple data model to help parse a JSON feed
    In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
    In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

    779 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    12 Experts available now in Live!

    Get 1:1 Help Now