Solved

Tokenizing a string without separators.  How to parse a string in C?

Posted on 2006-11-20
11
392 Views
Last Modified: 2010-04-15
I am writing a parser to handle an ASCII string representing voltages.  The intent is to have this string separated into seven tokens.  The string format is:

#FFFFFF0FF0<cr>

The # indicates to my program to select a "hex" case.  The <cr> at the end signifies a carraige return which should be treated as an "end of line" flag.  This last <CR> completes the string.

I want the string tokenized as follows:

# FF FF FF 0 FF 0 <CR>

There are no separators, only the position within the string is indicative of each token.  Remember that these are being sent as ASCII characters over a serial line and I'll need to convert them from ASCII to integers (subtract 48 from the byte.)

The goal is to be able to perform calculations on the individual tokens.  I am going to use the # as a start flag (not stored in a variable) and the <CR> as an end flag.

I realize this is a pretty basic issue for most programmers, but to me it's a hobby.  I'm trying to build a microcontroller-driven servo and send it commands with my PC.  I'm going to keep plugging away at the solution myself, but hope you can help.  I would like to get an answer as soon as possible.

Thanks,
Bill
0
Comment
Question by:wwward0
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +3
11 Comments
 
LVL 12

Accepted Solution

by:
rajeev_devin earned 200 total points
ID: 17985406
Can be extracted this way:

char str[] = "#FFFFFF0FF0\n";
char tokens[7][10];
sscanf(str, "%1s%2s%2s%2s%1s%2s%1s",tokens[0],
                              tokens[1],
                              tokens[2],
                              tokens[3],
                              tokens[4],
                              tokens[5],
                              tokens[6]);
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17985893
>> I want the string tokenized as follows:
>> 
>> # FF FF FF 0 FF 0 <CR>

A few questions :

1) is <CR> 1 character (ie. the carriage return character with ASCII value 13), or are these 4 characters (<, C, R and >) ?

2) will the first 3 values always be 2 characters wide, the fourth 1 character wide, the fifth 2 characters wide, and the last 1 character wide ? If not, what are the possible values, and how do you determine where to split (tokenize) them ?
0
 
LVL 45

Expert Comment

by:Kent Olsen
ID: 17987086

Hi rajeev,

All kinds of algorithms exist to tokenize that string.  I'll be glad to provide some code for you.

But I believe that the algorithm is in need of a tune up.  Given that the last 4 chars of '0FF0' represent 0, FF, and 0, what string would represent 0F F0?


Kent
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 150 total points
ID: 17987212
The following should do. You easily might add some checks if the string could contain invalid chars.

Regards, Alex

#include <malloc.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <memory.h>

int tokenize(const char *input, char*** ppszTokens)
{
     int count;
     size_t i;
     char c;
     char* p;
     int toggle;
     size_t len;
     char** pszTokens;


     len = strlen(input);
     if (len == 0 || input[0] != '#')
          return -1;   /* wrong input */

     // allocate space for max possible tokens
     *ppszTokens = (char**)malloc(len*sizeof(char*));
     pszTokens = *ppszTokens;
     toggle = 0;
     // allocate space for max possible output
     p = (char*) malloc(2*len + 5);
     memset(p, 0, 2*len + 5);
     count = 0;
     for (i = 0; i < len; i++)
     {
           c = input[i];
           *p++ = c;
           if (c == '#' || c == '0')
           {
               pszTokens[count++] = p-1;
               *p++ = 0;
           }
           else if (c == '\n')
           {
               pszTokens[count++] = --p;
               strcpy(p, "<CR>");
               p += 5;
               break;
           }
           else if (toggle)
           {
                toggle = 0;    
               pszTokens[count++] = p-2;
               *p++ = 0;
           }
           else
                toggle = 1;
     }
     return count;
}

int main()
{
   char** pszTokens;
   char   input[] = "#FFFFFF0FF0\n";
   int    count;
   int    i;

   count = tokenize(input, &pszTokens);
   for (i = 0; i < count; ++i)
   {
      printf("%s ", pszTokens[i]);
   }
   free(pszTokens[0]);
   free(pszTokens);
   return count;
}
0
 
LVL 2

Expert Comment

by:avsrivastava
ID: 17990030
Infinity and Kdo,
I am just guessing, but the way Bill has tokenised the string it appears only 2 values a high(FF) and a low(0) are allowed values. Would not make any sense otherwise.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17990059
avsrivastava,

that's what I thought, but it's best to be sure before offering the best solution.

On the other hand, if it's the case that there are only 2 values (FF and 0), then it's maybe better to use  F and 0 or 1 and 0 ?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17990093
>> then it's maybe better to use  F and 0 or 1 and 0 ?
especially because it seems that it's not always gonna be 6 values, but there might be more or less.
And if each token is only one character, it makes decoding a LOT easier.
0
 
LVL 45

Expert Comment

by:Kent Olsen
ID: 17990182

Welll, the original statement "a parser to handle an ASCII string representing voltages" isn't descript enough, though I doubt seriously that we've settled on the string's true representation.

If he's got exactly 6 voltages, the are exactly 10 characters between the '#' and the '<', and positions 4 and 6 are always 0, then a lot of methods work.

But we'll need a lot more answers.

-- Are the voltages for positions 1,2, 3, and 5 0/FF or can any value in the range 0/FF work?
-- Can the voltages for positions 2 and 6 be anything other than 0?
-- Does the poster believe that a single digit of zero represents a zero voltage?  If so, how are voltages of 1,2,3,4,5,6,7,8,9,A,B,C,D,E,F represented?

Perhaps these are masks.  (All bits on, all bits off.)  Then parsing the string is a snap.  Upon encountering the '#' just switch() on 'F', '0', and '<'.


Input.  Need Input.....   :~}
Kent
0
 

Author Comment

by:wwward0
ID: 17992201
Infinity08:
1.  The "<CR>" carriage return is in fact the ASCII value 13.  I am specifically looking for the string to terminate with a carriage return to signal the end of the string.
2.  The values must always be the same length.  This is intended to take ASCII characters representing hex values.  If the length is incorrect, the it's a bad command and should be ignored.

Kdo:
The string is an example.  The first three pairs of bytes will represent an eight-bit value when converted from hex to integer.  The single-digits are basically intended to signal one of 16 possible modes (0-f).  The string 0FF0 could exist if I selected mode 0, value FF, second mode 0.

avsrivastava:
For double-byte groups, the values can be anything from 00-FF in hex.  It should represent an eight bit integer once converted from hex to int (this is done elsewhere in the program.)

itsmeandnobodyelse:
Thanks, I'm checking this out!



0
 
LVL 53

Assisted Solution

by:Infinity08
Infinity08 earned 150 total points
ID: 17993905
This is an example of how you could extract all values :

    #include <stdio.h>
    #include <stdlib.h>
   
    int main(void) {
      char * test = "#FFFFFF0FF0\r";
      int val1 = 0;
      int val2 = 0;
      int val3 = 0;
      int mode1 = 0;
      int val4 = 0;
      int mode2 = 0;
     
      int ret = sscanf(test, "#%02x%02x%02x%01x%02x%01x\r", &val1, &val2, &val3, &mode1, &val4, &mode2);
      if (ret == 6) {
        fprintf(stdout, "val1 : %d\n", val1);
        fprintf(stdout, "val2 : %d\n", val2);
        fprintf(stdout, "val3 : %d\n", val3);
        fprintf(stdout, "mode1 : %d\n", mode1);
        fprintf(stdout, "val4 : %d\n", val1);
        fprintf(stdout, "mode2 : %d\n", mode2);
      }
      else {
        fprintf(stdout, "Invalid input !!\n");
      }
      return 0;
    }
0
 

Author Comment

by:wwward0
ID: 18042735
Thank you all for your help, you pointed me in the right direction, and I have a working setup!

:)

Bill
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

An Outlet in Cocoa is a persistent reference to a GUI control; it connects a property (a variable) to a control.  For example, it is common to create an Outlet for the text field GUI control and change the text that appears in this field via that Ou…
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use switch statements in the C programming language.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question