• C

lex and csv file parsing

I'm trying to parse a comma separated value file with lex, and came upon some difficulties:

if i have 4 fields per line (field1, field2, field3, field4),

using this pattern, will only grab up to field3:
[a-zA-Z0-9]+/[,]

that means, when I want to print out the entire line (after making sure that line contains a valid value in say field3, I then can't get access to field4....cuz the above reg expr seems to act like a tokenizer, it'll only recognize fields "before" the comma. So in this case, field4, which has no comma after it, doesn't fall into the match range...

any suggestions on how I could also grab field four and print out the entire line?

thanx!
jade03Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

avizitCommented:
just offhand , i havent tried it yet .. but can you make two types of tokens

T1        [a-zA-Z0-9]+
T2        ,

and then when you are running it, just discard the T2

0
jade03Author Commented:
hmm...could you elaborate a bit on what you mean by "discard the T2"?

I'm new to lex...normally, when you have a bunch of patterns listed sequentially, will it run one after the other and apply as many as possible to the same input line?
0
jhshuklaCommented:
[a-zA-Z0-9]+/[,][a-zA-Z0-9]+/[,][a-zA-Z0-9]+/[,][a-zA-Z0-9]+/
0
Choose an Exciting Career in Cybersecurity

Help prevent cyber-threats and provide solutions to safeguard our global digital economy. Earn your MS in Cybersecurity. WGU’s MSCSIA degree program was designed in collaboration with national intelligence organizations and IT industry leaders.

avizitCommented:
Okay I created a very crude lex file to do what you want

+++++++++++++++++
D[0-9]
L[a-zA-Z_]
%{
#include <stdio.h>
#define FIELD 256
#define SEP 257
#define NL 258

void write_tok(int TOK);
%}

%%
{L}({L}|{D})*  { write_tok(FIELD); }
","            { write_tok(SEP); }
"\n"           { write_tok(NL); }
" " { write_tok(SEP);}
. { };  /*bad characters */
%%

void write_tok(int TOK){
  switch(TOK){
  case FIELD:
    printf("%s", yytext);
    break;
  case SEP:
    printf("\t");
    break;
  case NL:
    printf("\n");
    break;
  default:
    printf("Error");
    exit(1);
  }
}
int main(void) {
  yylex();
  return(0);
}
+++++++++++++++++++++++++++++++

try and see if the above works for you . then you can modify it tosuit your requirements.

abhijit
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jade03Author Commented:
avizit,

your soln looks good, but somehow, when I have a field that consists of all numbers, it doesn't get printed out...

ie: hello, how are, you-doing, 3000, today

prints out hello how are youdoing today

and the field with a hyphen is now missing a hyphen...

I tried adding an extra {D} following the first {L} you have there, but then nothing gets printed out...

Like I said, I'm a bit new to lex...so I may be slow in catching on to the syntax...

another question, I originally, did something like this:

where I malloc an array, and store each token into it each time I get a token ie:

array[count-1] = yytext

when I print out yytext: printf("%s\n", yytext) it shows just the toke without the "," after it
but when I print out the array indexed value: printf("%s\n", array[0])  /* or array[1]..etc
I get the token followed by a comma...why does yytext not leave out the comma when it sets it into the array?
0
jade03Author Commented:
also, what does the "_" after the Z mean?

L       [a-zA-Z_]
0
avizitCommented:
that was the underscore , I added that in case your filed contains the '_' thats is the underscore character , also i changed the field regex so that it starts only with alphabets i.e., {L}({L}|{D})*

you should be able to modify it to your requirements
0
jade03Author Commented:
Great! I get it now...oh, could you explain the case of the missing "hyphen" in the final printout when there is a hyphen in the input? I don't see where it's being chopped off...
0
jade03Author Commented:
avizit, I got it! I figured out the case of the missing hyphen! :)

Thank you sooo much for all your help!
0
jade03Author Commented:
oh, could you please tell me what 256, 257, 258 means?
0
avizitCommented:
oh those are just defines for

FIELD , SEP , NL
actually those are not really required . I just used it to let write_tok() know what to do ..
as in if its SEP just print a '\t' etc .

0
jade03Author Commented:
right...I guess I meant to say if there's any significance to those particular numbers..could you have used different numbers?
0
avizitCommented:
umm generaly try to use numbers beyond the normal ascii characters
0
jade03Author Commented:
oh ok...thanx! :)
0
jade03Author Commented:
avizit,

quick question for you...how do global variables work in lex? I declared  global variable inside {%  %}, and modified get_token to take an extra param, made it "pass-by-reference", and each call to get_token this "global variable" gets passed, so that it's value is retained throughout, bec I want to keep a count for the 3rd field in each line, checking to see if it equals a specific word, then I print out that line...but it seems like my global variable gets reset each time get-token is called...

for example, I have:

({alpha}|{digit})*      { get_token(FIELD, &count); }
({blank}*{comma}+{blank}*)*      { get_token(COMMA, &count);}

then inside the switch cases, I increment count accordingly, but it seems after it's incremented, when get_token is called in the 2nd line above, the value of count goes back to 0...

so I'm curious to know how to keep a global variable around w/o losing it's values...


I can add more points or make this into a new question for u if u want...

thanx!



0
avizitCommented:
I don't remember how global variables work .. but the following is an example of how do what you want.
D[0-9]
L[a-zA-Z_]
%{
#include <stdio.h>
#define FIELD 256
#define SEP 257
#define NL 258

void write_tok(int TOK);
void count(void);

%}

%%
{L}({L}|{D})*  { count(); write_tok(FIELD); }
","            { write_tok(SEP); }
"\n"           { write_tok(NL); }
" " { write_tok(SEP);}
. { };  /*bad characters */
%%

void write_tok(int TOK){
return;
}

void count(void){
        static int cnt = 0;
        cnt++;
        printf("count: %d", cnt);
}

int main(void) {
  yylex();
  return(0);
}


+++
similar example i just changed write_tok to do nothing and added a count function

0
jade03Author Commented:
thanx, avizit! You're very helpful! :)
0
avizitCommented:
you are welcome :)
0
jade03Author Commented:
avizit,

sorry to bother you, one more quick question:

how do I set up an expression to try to find all words that contain the following subsequence of letters: "c" "a" "t", ie: scratch, create etc?

I tried the following combo but didn't see to work:

{alpha}?c{alpha}?a{alpha}?t

({alpha}*"c"{alpha}*"a"{alpha}*"t" )

what am I doing wrong?
0
avizitCommented:
This is the third question you have asked in a single question .

try this

L       [a-zA-Z_]
C       [cC]
A       [aA]
T       [tT]
%{
#include <stdio.h>

%}

%%

{L}*{C}+{L}*{A}+{L}*{T}{L}*   { printf("cat found\n");}
%%

int main(void) {
  yylex();
  return(0);
}
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.