Link to home
Start Free TrialLog in
Avatar of superfly18
superfly18

asked on

Parsing Fixed Width Text into Delimited

I am having quite a bit of difficulty loading a fixed width file into a database.   For some reason, the text seems to wrap when importing it using DTS. The file has several thousand rows, and the width of an entire row is 1600 characters.  There are many columns, each defined by a fixed width. Some columns do run into one another.  How can I write a parser for this in C++ to convert my fixed width columns into | delimited, and ensure that each row is only 1600 characters?  The width of each column is:

4
8
1
5
5
12
3
80
3
20
9
30
160
35
12
80
3
20
9
30
160
35
1
1
8
8
1
8
8
15
6
8
200
12
12
20
3
2
9
30
40
120
1
8
8
20
3
20
8
15
2
35
35
2
205


Help!
Avatar of _corey_
_corey_

Are you asking how to validate the character length of each text row is 1600 including or not including spaces?

So then you'd want to trim all whitespace into a single | delimiter between text?

corey
Avatar of superfly18

ASKER

Here is an example...imagine the following row was 1600 characets long and had columns of 2,4,6,8,10,12

aabbbbcccccc88888888yyyyyyyyyy111111111111
hhnnnnlllllluuuuuuuuggggggggggmmmmmmmmmmmm
The output I am looking for is:

aa|bbbb|cccccc|88888888|yyyyyyyyyy|111111111111
hh|nnnn|llllll|uuuuuuuu|gggggggggg|mmmmmmmmmmmm

But I think it important to note that each row can only ne a maximum of 1600 characters.

Thanks!!!

Well, there's a couple ways to do it.

I assume you can read the column indicies into an array of integers.  For the example you had, something like an array with the values:

2, 6, 12, 20, 30, 42 ...

Then, for each row you read into a string perform something like:

for (int i = 0; i < numColumns; ++i)
{
  rowString.insert(columns[i] + i, 1, '|');
}

Adding i because you're adding extra characters to the string which modifies the column location in the string.

This method could cause a lot of string copying, but if it's a single pass then big-deal.

Get it?  You're calculating where the delimiter should be from the column widths and then looping through and inserting one there.

corey
ASKER CERTIFIED SOLUTION
Avatar of brettmjohnson
brettmjohnson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I wanted to change things so that I took input from a file and wrote back to a new file.  For some reason I screwed something up, and I don't compile.  Any thoughts?

Thanks!


/* This filter reads fixed-width 1600 byte records from stdin,
 * extracts the individual fixed-width columns, strips them of
 * leading and trailing spaces, and writes them to stdout as
 * '|' delimited records.
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>


static char * trimWhite(char *input);

#define ArrayCount(x) (sizeof(x)/sizeof(x[0]))
#define LINEBUF 16384
#define RECORDLEN 1600
#define DELIM "|"

int colWidths[] = {
  4, 8, 1, 5, 5, 12, 3, 80, 3, 20, 9, 30, 160, 35, 12, 80, 3, 20, 9, 30,
  160, 35, 1, 1, 8, 8, 1, 8, 8, 15, 6, 8, 200, 12, 12, 20, 3, 2, 9, 30,
  40, 120, 1, 8, 8, 20, 3, 20, 8, 15, 2, 35, 35, 2, 205
};

int main (int arc, char ** argv)
{
  int i, pos;
  char input[LINEBUF];
  char value[LINEBUF];
  FILE * inputFile;
  FILE *  outputFile;



   inputFile = fopen ("inputfile.txt" , "r");
   outputFile = fopen("outputFile", "w");

   if (inputFile == NULL) perror ("Error opening file");

   while (fgets(input, sizeof(input), inputFile)) {
    if (strlen(input) != RECORDLEN)
      continue; // Not a valid fixed width record

    /* Pull the individual fields out of the fixed length record.
     * Trim leading and trailing whitespace. Make sure the record
     * doesn't contain any embedded '|' delimiter characters.
     * Write the '|'-delimitted line to the output
     */
    for (i = pos = 0; i < ArrayCount(colWidths); i++) {
      int width = colWidths[i];
      assert(width < (sizeof(value)+1));        // don't overrun value buffer
      assert(strstr(input, DELIM));                     // don't have embedded '|' chars
      strncpy(value, input+pos, width);         // copy the fixed-width substring
      value[width] = '\0';
      pos += width;
      printf(outputFile,(i) ? DELIM : "", trimWhite(value));
    }
    puts("");   // add newline to the end of each record
  }

  return 0;
}


/* Trim leading and trailing whitespace from the input string.
 * This routine trims in-place, modifying the input string.
 * The address of the first non-white character is returned.
 * If the returned pointer points to a NUL byte, the whole string
 * was white.
 */
static char * trimWhite(char *input)
{
  char *start, *end;
  if ((start = input)) {
    while (*start && (*start <= ' ')) start++;
    if (*start) {
      for (end = start; *end; end++);
      while ((end > start) && (*end <= ' ')) end--;
      *(end+1) = '\0';
    }
  }
  return start;
}
> I wanted to change things so that I took input from a file and wrote back to a new file.  
> For some reason I screwed something up, and I don't compile.  Any thoughts?

The easy way would have been to use I/O redirection.  That is why I coded it as a filter:
myprog <inputFile.txt >outputFile.txt

However, to modify the program to write to other than stdout, replace the calls to
printf() and puts() with calls to fprintf() and fputs():

...
      fprintf(outputFile, "%s%s",(i) ? DELIM : "", trimWhite(value));
    }
    fputs("", outputFile);   // add newline to the end of each record


Ok, I tried it both ways as a command line filter and otherwise.  For some reason the output file is blank...?
The column widths you supplied add up to 1598, not 1600 as you previously stated,
so the line length check is failing.  It would have failed anyway, since fgets() leaves the
line terminator (NL or CRLF) in place in the input buffer.

Since the extraction loop using strncpy() will extract no more that 1598 characters
from the input line, you can change the line length test to only verify that the
input line has at least that number of characters, rather than exactly that number
of characters.  Any further characters in the input line (including the line terminators)
will be ignored.

Change
  #define RECORDLEN 1600
to
  #define RECORDLEN 1598

and change
   if (strlen(input) != RECORDLEN)
to
   if (strlen(input) < RECORDLEN)



And one last bug fix in the code that verifies that input doesn't have an embedded '|' delimiter:

Change
      assert(strstr(input, DELIM));       // don't have embedded '|' chars
to
      assert(strstr(input, DELIM) == NULL); // don't have embedded '|' chars
Brett:
for some reason, it is only taking the first 5 lines of some files....why would this be?

Please let me know!
I suspect your files have line length issues.  The only reason the code would skip a line
is if the line length doesn't satisfy the test:
   if (strlen(input) < RECORDLEN)
      continue; // Not a valid fixed width record

Have you spent any time reading the code to understand what it does and how it does it?

Time to do some debugging.  After reading each line, print its length to stderr (to distinguish
the debugging output from normal delimited line output):
   int linenum = 0;
   while (fgets(input, sizeof(input), inputFile)) {
      fprintf(stderr, "Line %d length: %ld\n", ++linenum, strlen(input));