We help IT Professionals succeed at work.

Handling Variable length record in C under z/os

benedictherold
benedictherold used Ask the Experts™
on
I am trying to read a variable length record in mainframe using C programming language. But i am not able to get the desired output. Generally in COBOL when we read a VB file, the first 4 bytes constitutes the length of the record and the remaining is the data. Similarly is there any means to read the length of the record and then the data? I had attached the code below.

The issue I am facing is  in C it data is read as stream and there is a last character x'15' is read at end of every line. I need to skip reading of x'15'. The catch here is my data also contains x'15' and i can't ignore all of them.

Let me know if any other information is required.
#include<stdio.h>
#include <stdlib.h>
 
int main()
{
    unsigned int str_len = 0;
    unsigned char *str;
    FILE *fp;
    if ( !(fp = fopen("dd:TEST", "rb, recfm=vb, lrecl=133") ) )
    {
         printf("Error in Opening Input File\n");
         return -1;
    }
    while(!feof(fp))
    {
         fread(&str_len, 4, 1, fp);
         str = calloc(str_len, sizeof(unsigned char));
         fread(str, str_len, 1, fp);
         printf("Data in the line %s\n", str);
         free(str);
    }
    return 0;
}

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2009

Commented:
If you could post an example of such a file, that would give us a better idea of what you're dealing with.

>> and there is a last character x'15' is read at end of every line.

I'm not sure what character you mean ... I'm guessing it's the octal value 15, which is the CR character (\r) in the ASCII set, and could indeed be used at the end of a line (usually together with the LF character (\n) with octal value 12).

If so, then more information on the format of the file would be nice, since it seems like it uses both the length and a line terminator to determine boundaries between records, which is a bit weird.

Author

Commented:
Hi Infinity08,
The file is simple with the below parameters.
LRECL = 133
BLKSIZE = 1330
RECFM = VB
The data in it is also simple. If i ftp and attach here most of the parameters will be lost.

Value 15 we get is at the end of every line. It is a EBCDIC data hex value. I have attached a test code which illustrates this.

The input data in the dataset is
ABCDZX
EFGHIJYU
KLMNOPQRSWT

The output of the program is
 Data c1 represents A
 Data c2 represents B
 Data c3 represents C
 Data c4 represents D
 Data e9 represents Z
 Data e7 represents X
 Data 15 represents
0Data c5 represents E
 Data c6 represents F
 Data c7 represents G
 Data c8 represents H
 Data c9 represents I
 Data d1 represents J
 Data e8 represents Y
 Data e4 represents U
 Data 15 represents
0Data d2 represents K
 Data d3 represents L
 Data d4 represents M
 Data d5 represents N
 Data d6 represents O
 Data d7 represents P
 Data d8 represents Q
 Data d9 represents R
 Data e2 represents S
 Data e6 represents W
 Data e3 represents T
 Data 15 represents

Please let me know if you require more inputs

#include<stdio.h>
#include <stdlib.h>
 
int main()
{
    unsigned int str_len = 0;
    int dat;
    FILE *fp;
    if ( !(fp = fopen("dd:TEST", "r, recfm=vb, lrecl=133") ) )
    {
         printf("Error in Opening Input File\n");
         return -1;
    }
    while((dat = fgetc(fp)) != EOF)
    {
         printf("Data %x represents %c\n", dat, dat);
    }
    return 0;
}

Open in new window

Top Expert 2009

Commented:
>> Value 15 we get is at the end of every line. It is a EBCDIC data hex value.

in EBCDIC, the character with value 0x15 is indeed the newline character.


>> I have attached a test code which illustrates this.

I thought you said that every record started with the size of the record ? I can't see that anywhere in your output. All I see are strings, separated by newline characters.
Top Expert 2009

Commented:
>> All I see are strings, separated by newline characters.

And if that's how the file format is, then why not just use fgets to read one line of input ?

        http://cplusplus.com/reference/clibrary/cstdio/fgets/

Author

Commented:
Hi Infinity08,
         Using fgets would work great if the file has all viewable characters. I face problem in the file since the actual data also has the new line value. In that case it fails pick that as a single like.

The data and its hex representation is below.
,--------------------------------------------------------------------
ABCDZX
CCCCEE
123497
,--------------------------------------------------------------------
EFGHIJYU
CCCCCDEE
56789184
,--------------------------------------------------------------------
KLMNOPQRSWT
DDDDDDDDEEE
23456789263
,--------------------------------------------------------------------
EFGH.JYU
CCCC1DEE
56785184
,--------------------------------------------------------------------

The last line of data(EFGH.JYU) is treated as 2 lines.

Please let me know if you require any inputs.
Top Expert 2009

Commented:
>> I face problem in the file since the actual data also has the new line value.

fgets reads data until it reaches the newline value. So, the newline is important for the correct working of fgets.


>> I face problem in the file since the actual data also has the new line value.

Well, then how do you distinguish between a newline character being part of the data or a delimiter between the data ?

And how about the length that was put in the file before the data, as you mentioned earlier ?

I must say, I'm getting more and more confused, since the file format seems to change with every post you make ... Can't you just post an example file ? (you can attach it to your post)


>> The data and its hex representation is below.

I'm not sure I understand your notation ... What are those 3 lines supposed to mean ? I assume the first line is the string data, but what are the second and third lines ? And how are they related to the first line ?

Author

Commented:
>>how do you distinguish between a newline character being part of the data or a delimiter between the data ?
The actual data will be visible. But the delimiter will not be visible in the editor.


>>I must say, I'm getting more and more confused, since the file format seems to change with every post you make ... Can't you just post an example file ? (you can attach it to your post)
 I have attached the file. It is EBCDIC file, I believe you will be more confused this time. Please open it in some EBCDIC editor or if you need see the hex values you can you hex editor.

>>What are those 3 lines supposed to mean ?I assume the first line is the string data, but what are the second and third lines ? And how are they related to the first line ?
1st line is the data
2nd first half of the hex value
3rd second half of the hex value.

Example
A - First line
C - Second line
1 - Third line

This means 'A' is represented as 'C1' in hex (EBCDIC)

TEST
Top Expert 2009

Commented:
>> >>how do you distinguish between a newline character being part of the data or a delimiter between the data ?
>> The actual data will be visible. But the delimiter will not be visible in the editor.

I meant : how do you know which newline characters are part of the data, and which ones are used as delimiters ? You need some more information to be able to make that distinction. If the file simply contains text data with newlines in it, you have not enough information to make that distinction.

That is where something like a length field comes in handy ...


>> 2nd first half of the hex value
>> 3rd second half of the hex value.

Wow, I would not have guessed that. Usually, when representing hex values, you represent them as separate values, without splitting them up. ie. :

         A   B   C    D   Z   X
        C1 C2 C3 C4 E9 E7

It's easier to read, and a lot clearer ;)


>> I have attached the file.

That's exactly what I needed ;) Now, that file contains just the characters :

        ABCDZXEFGHIJYUKLMNOPQRSWTEFGH\nJYU

(where \n represents a newline character). Is this a representative example of the kind of input file you'll deal with ?

You still haven't clarified why you mentioned that the length was encoded in the file ... Could you elaborate a bit on that ?

In any case, if the file you posted is representative, and data fields might contain newline characters, even though the newline character is used as the delimiter between data fields, then there's no way to know which newlines are part of the data, and which are used as delimiters ... Not without extra information. The length field you mentioned would give the kind of extra information you need to decide what kind of newline it is.

Author

Commented:
Hi Infinity08,

>>I meant : how do you know which newline characters are part of the data, and which ones are used >>as delimiters ? You need some more information to be able to make that distinction. If the file simply >>contains text data with newlines in it, you have not enough information to make that distinction.
>>That is where something like a length field comes in handy ...

In mainframes the length of the file is always delimited with the record size. In case of fixed byte record the length remains constant and it is easier to read the data by providing the fixed length. In case of variable length records, IBM says that the first 4 bytes has the length of record and then comes the data to that length. I am able to read that in COBOL. But i am not able to do it in C hence the problem is. As per the post ID 24894281, I have read the file byte by byte and i was not able to find the length; instead I was able to see a delimited x'15' at end of every line. Any suggestions on this?  


>>Wow, I would not have guessed that. Usually, when representing hex values, you represent them as separate values, without splitting them up. ie. :

You are right . I am sorry if i had confused you.. I took it right away from the editor and pasted here.


>>Is this a representative example of the kind of input file you'll deal with ?
Yes this is the test file I created with some of the characters in the actual file. The actual file is a IBM AFP (similar to PDF from Adobe) with the size of 200GB

>>You still haven't clarified why you mentioned that the length was encoded in the file ... Could you elaborate a bit on that ?
In variable length records, IBM says that the first 4 bytes has the length of record and then comes the data to that length. I am able to read that in COBOL. But i am not able to do it in C

>>In any case, if the file you posted is representative, and data fields might contain newline characters, >>even though the newline character is used as the delimiter between data fields, then there's no way >>to know which newlines are part of the data, and which are used as delimiters ... Not without extra >>information. The length field you mentioned would give the kind of extra information you need to >>decide what kind of newline it is.
In mainframes there is no such new lines I believe. All the data is treated as records and every data set defined on mainframe is delimited with the record length. If you define a dataset with 80 character with fixed size, and if you are trying to insert a data with 90 chars it will automatically shift to the next line/ record. So there can be new line characters (x'15') as a part of data which is not actually treated as new line character.

I hope this clarifies your question.
Top Expert 2009

Commented:
>> IBM says that the first 4 bytes has the length of record and then comes the data to that length. I am able to read that in COBOL.

From your explanation, it seems that the file used with the COBOL code is not the same file as the one you used with your C code. That would explain the loss of information between the two files.


>> The actual file is a IBM AFP

That's a format used primarily for printing, sometimes also for storing large amounts of data.

It's not likely that the sample file you posted is representative of such a file. There is likely to be at least some header data that adds meta information.


It all comes down to this, as I said earlier. If the file you are dealing with is similar in format as the one you posted, then you don't have enough information to determine which newline characters are part of the data, and which are delimiters.

So, either the file you posted is not representative of the actual file you are dealing with, or what you're trying to do is impossible without more information.


>> with the size of 200GB

Can't you generate a small sample file (in the same way as the 200GB file was generated) that you can then upload here.

Author

Commented:
>>From your explanation, it seems that the file used with the COBOL code is not the same file as the >>one you used with your C code. That would explain the loss of information between the two files.

For testing I had used same file. Where COBOL was able to pick the 4 bytes length and C was not.

>>It's not likely that the sample file you posted is representative of such a file. There is likely to be at >>least some header data that adds meta information.

Yes you are right, I have created the test file withe similar properties to know how it handles that file. I am aware the AFP data has the tags based on which the AFP reader reads the file. I have attached a sample AFP file for you reference.

afptech2.zip
Top Expert 2009
Commented:
>> For testing I had used same file. Where COBOL was able to pick the 4 bytes length and C was not.

That doesn't make sense ... You mean that for COBOL, that data magically appears, while for C it's not there ? Are you sure you are using the same files ? Are you sure COBOL isn't using a different file ? Or even more than one file ? (like an index file and a data file)


>> I have attached a sample AFP file for you reference.

I looked at a part of the file at random (offset 0x1600 to 0x16FF), and posted below what came up. You'll notice that there's quite a bit of extra information there apart from the strings, and it looks very logical and straightforward ... not like a simple newline delimited file.

Notice also that the value between the <0x90> and <0xdb> bytes right before the string data, contains the size of the string data (+ 2 bytes). The rest of the header data look to be type information, indexes, and such stuff.
                       <0xdb>1<SEL>L<0x0e><0xf7><SEL>G<HT><0xa0><SEL>G<RNL><0x90><0x19><0xdb>Converting Files to AFP
<SEL>G<0x10><0x26><ETX><0xdb>1<SEL>L<0x0f><0xe6><SEL>G<HT><0xa0><SEL>G<RNL><0x90><0x2f><0xdb>Indexing AFP Files for Enhanced Viewer Access
<SEL>G<0x18><0xc0><ETX><0xdb>2<SEL>L<0x10><0xd6><SEL>G<HT><0xa0><SEL>G<RNL><0x90><0x30><0xdb>Transferring AFP Data Files to the Workstation
<SEL>G<0x18><0x1e><ETX><0xdb>4<SEL>L<0x11><0xc5><SEL>G<HT><0xa0><LF>                   <0xdb>Using AFP Resources with the Viewer
<SEL>G<0x14><0x34><ETX><0xdb>4<SEL>L<0x12><0xb4><SEL>G<HT><0xa0><SEL>G<RNL><0x90><0x18><0xdb>Using Form ...

Open in new window

Author

Commented:
>>That doesn't make sense ... You mean that for COBOL, that data magically appears, while for C it's not >>there ? Are you sure you are using the same files ? Are you sure COBOL isn't using a different file ? Or >>even more than one file ? (like an index file and a data file)

That means in COBOL I know how to handle the first 4 bytes and in C I not aware how to handle it (I am a new to C programming in mainframes). I am sure that I am only using the data file in both C and COBOL(I want to make you sure that i am not doing any AFP processing)

>>Notice also that the value between the <0x90> and <0xdb> bytes right before the string data, >>contains the size of the string data (+ 2 bytes). The rest of the header data look to be type >>information, indexes, and such stuff.
I am not interested in the AFP processing. I am aware of the AFP processing and i aware what the above said data represents. I am not concerned about AFP processing here. My concern is to read a variable type record as there are files which we handle are non-AFP also.


I have got the desired output using the below open statement in the post 24894281.

fp = fopen("dd:TEST", "rb, recfm=vb, lrecl=133")

The end of line character x'15' was not read.

The output was

 Data c1 represents A
 Data c2 represents B
 Data c3 represents C
 Data c4 represents D
 Data e9 represents Z
 Data e7 represents X
 Data c5 represents E
 Data c6 represents F
 Data c7 represents G
 Data c8 represents H
 Data c9 represents I
 Data d1 represents J
 Data e8 represents Y
 Data e4 represents U
 Data d2 represents K
 Data d3 represents L
 Data d4 represents M
 Data d5 represents N
 Data d6 represents O
 Data d7 represents P
 Data d8 represents Q
 Data d9 represents R
 Data e2 represents S
 Data e6 represents W
 Data e3 represents T
 Data c5 represents E
 Data c6 represents F
 Data c7 represents G
 Data c8 represents H
 Data 15 represents  
0Data d1 represents J
 Data e8 represents Y
 Data e4 represents U


It would be great if some one can help me to read the variable record data set in C even though i got the desired output.
Top Expert 2009
Commented:
>> I am not concerned about AFP processing here.

Well, you posted the AFP file as a representative sample of the input file you're dealing with.
And now you say that you're not dealing with an AFP file.


>> It would be great if some one can help me to read the variable record data set in C even though i got the desired output.

We're going around in circles. Allow me to summarize it once more :

(a) if the file you have only contains text and newline characters, then there is NO way you can know which of these newline characters are used as separators, and which ones are part of the text.

(b) if your COBOL code is able to make that distinction, then it means it has access to extra information than what is in that file, or that it is using a different file.

There's nothing else to say ... Without the extra information, you cannot know which newlines are separators and which ones aren't.