[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

lex and csv file parsing

Posted on 2004-11-22
20
Medium Priority
?
609 Views
Last Modified: 2012-06-27
I'm trying to parse a comma separated value file with lex, and came upon some difficulties:

if i have 4 fields per line (field1, field2, field3, field4),

using this pattern, will only grab up to field3:
[a-zA-Z0-9]+/[,]

that means, when I want to print out the entire line (after making sure that line contains a valid value in say field3, I then can't get access to field4....cuz the above reg expr seems to act like a tokenizer, it'll only recognize fields "before" the comma. So in this case, field4, which has no comma after it, doesn't fall into the match range...

any suggestions on how I could also grab field four and print out the entire line?

thanx!
0
Comment
Question by:jade03
  • 11
  • 8
20 Comments
 
LVL 11

Expert Comment

by:avizit
ID: 12651101
just offhand , i havent tried it yet .. but can you make two types of tokens

T1        [a-zA-Z0-9]+
T2        ,

and then when you are running it, just discard the T2

0
 

Author Comment

by:jade03
ID: 12651290
hmm...could you elaborate a bit on what you mean by "discard the T2"?

I'm new to lex...normally, when you have a bunch of patterns listed sequentially, will it run one after the other and apply as many as possible to the same input line?
0
 
LVL 9

Expert Comment

by:jhshukla
ID: 12651382
[a-zA-Z0-9]+/[,][a-zA-Z0-9]+/[,][a-zA-Z0-9]+/[,][a-zA-Z0-9]+/
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 11

Accepted Solution

by:
avizit earned 400 total points
ID: 12651390
Okay I created a very crude lex file to do what you want

+++++++++++++++++
D[0-9]
L[a-zA-Z_]
%{
#include <stdio.h>
#define FIELD 256
#define SEP 257
#define NL 258

void write_tok(int TOK);
%}

%%
{L}({L}|{D})*  { write_tok(FIELD); }
","            { write_tok(SEP); }
"\n"           { write_tok(NL); }
" " { write_tok(SEP);}
. { };  /*bad characters */
%%

void write_tok(int TOK){
  switch(TOK){
  case FIELD:
    printf("%s", yytext);
    break;
  case SEP:
    printf("\t");
    break;
  case NL:
    printf("\n");
    break;
  default:
    printf("Error");
    exit(1);
  }
}
int main(void) {
  yylex();
  return(0);
}
+++++++++++++++++++++++++++++++

try and see if the above works for you . then you can modify it tosuit your requirements.

abhijit
0
 

Author Comment

by:jade03
ID: 12651571
avizit,

your soln looks good, but somehow, when I have a field that consists of all numbers, it doesn't get printed out...

ie: hello, how are, you-doing, 3000, today

prints out hello how are youdoing today

and the field with a hyphen is now missing a hyphen...

I tried adding an extra {D} following the first {L} you have there, but then nothing gets printed out...

Like I said, I'm a bit new to lex...so I may be slow in catching on to the syntax...

another question, I originally, did something like this:

where I malloc an array, and store each token into it each time I get a token ie:

array[count-1] = yytext

when I print out yytext: printf("%s\n", yytext) it shows just the toke without the "," after it
but when I print out the array indexed value: printf("%s\n", array[0])  /* or array[1]..etc
I get the token followed by a comma...why does yytext not leave out the comma when it sets it into the array?
0
 

Author Comment

by:jade03
ID: 12651587
also, what does the "_" after the Z mean?

L       [a-zA-Z_]
0
 
LVL 11

Expert Comment

by:avizit
ID: 12651625
that was the underscore , I added that in case your filed contains the '_' thats is the underscore character , also i changed the field regex so that it starts only with alphabets i.e., {L}({L}|{D})*

you should be able to modify it to your requirements
0
 

Author Comment

by:jade03
ID: 12651657
Great! I get it now...oh, could you explain the case of the missing "hyphen" in the final printout when there is a hyphen in the input? I don't see where it's being chopped off...
0
 

Author Comment

by:jade03
ID: 12651683
avizit, I got it! I figured out the case of the missing hyphen! :)

Thank you sooo much for all your help!
0
 

Author Comment

by:jade03
ID: 12651714
oh, could you please tell me what 256, 257, 258 means?
0
 
LVL 11

Expert Comment

by:avizit
ID: 12651757
oh those are just defines for

FIELD , SEP , NL
actually those are not really required . I just used it to let write_tok() know what to do ..
as in if its SEP just print a '\t' etc .

0
 

Author Comment

by:jade03
ID: 12651892
right...I guess I meant to say if there's any significance to those particular numbers..could you have used different numbers?
0
 
LVL 11

Expert Comment

by:avizit
ID: 12651922
umm generaly try to use numbers beyond the normal ascii characters
0
 

Author Comment

by:jade03
ID: 12651977
oh ok...thanx! :)
0
 

Author Comment

by:jade03
ID: 12659354
avizit,

quick question for you...how do global variables work in lex? I declared  global variable inside {%  %}, and modified get_token to take an extra param, made it "pass-by-reference", and each call to get_token this "global variable" gets passed, so that it's value is retained throughout, bec I want to keep a count for the 3rd field in each line, checking to see if it equals a specific word, then I print out that line...but it seems like my global variable gets reset each time get-token is called...

for example, I have:

({alpha}|{digit})*      { get_token(FIELD, &count); }
({blank}*{comma}+{blank}*)*      { get_token(COMMA, &count);}

then inside the switch cases, I increment count accordingly, but it seems after it's incremented, when get_token is called in the 2nd line above, the value of count goes back to 0...

so I'm curious to know how to keep a global variable around w/o losing it's values...


I can add more points or make this into a new question for u if u want...

thanx!



0
 
LVL 11

Expert Comment

by:avizit
ID: 12661407
I don't remember how global variables work .. but the following is an example of how do what you want.
D[0-9]
L[a-zA-Z_]
%{
#include <stdio.h>
#define FIELD 256
#define SEP 257
#define NL 258

void write_tok(int TOK);
void count(void);

%}

%%
{L}({L}|{D})*  { count(); write_tok(FIELD); }
","            { write_tok(SEP); }
"\n"           { write_tok(NL); }
" " { write_tok(SEP);}
. { };  /*bad characters */
%%

void write_tok(int TOK){
return;
}

void count(void){
        static int cnt = 0;
        cnt++;
        printf("count: %d", cnt);
}

int main(void) {
  yylex();
  return(0);
}


+++
similar example i just changed write_tok to do nothing and added a count function

0
 

Author Comment

by:jade03
ID: 12662054
thanx, avizit! You're very helpful! :)
0
 
LVL 11

Expert Comment

by:avizit
ID: 12662081
you are welcome :)
0
 

Author Comment

by:jade03
ID: 12686711
avizit,

sorry to bother you, one more quick question:

how do I set up an expression to try to find all words that contain the following subsequence of letters: "c" "a" "t", ie: scratch, create etc?

I tried the following combo but didn't see to work:

{alpha}?c{alpha}?a{alpha}?t

({alpha}*"c"{alpha}*"a"{alpha}*"t" )

what am I doing wrong?
0
 
LVL 11

Expert Comment

by:avizit
ID: 12692787
This is the third question you have asked in a single question .

try this

L       [a-zA-Z_]
C       [cC]
A       [aA]
T       [tT]
%{
#include <stdio.h>

%}

%%

{L}*{C}+{L}*{A}+{L}*{T}{L}*   { printf("cat found\n");}
%%

int main(void) {
  yylex();
  return(0);
}
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were small…
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.
Suggested Courses

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question