Tokenizing a string without separators.  How to parse a string in C?

Posted on 2006-11-20
Medium Priority
Last Modified: 2010-04-15
I am writing a parser to handle an ASCII string representing voltages.  The intent is to have this string separated into seven tokens.  The string format is:


The # indicates to my program to select a "hex" case.  The <cr> at the end signifies a carraige return which should be treated as an "end of line" flag.  This last <CR> completes the string.

I want the string tokenized as follows:

# FF FF FF 0 FF 0 <CR>

There are no separators, only the position within the string is indicative of each token.  Remember that these are being sent as ASCII characters over a serial line and I'll need to convert them from ASCII to integers (subtract 48 from the byte.)

The goal is to be able to perform calculations on the individual tokens.  I am going to use the # as a start flag (not stored in a variable) and the <CR> as an end flag.

I realize this is a pretty basic issue for most programmers, but to me it's a hobby.  I'm trying to build a microcontroller-driven servo and send it commands with my PC.  I'm going to keep plugging away at the solution myself, but hope you can help.  I would like to get an answer as soon as possible.

Question by:wwward0
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
  • 2
  • +3
LVL 12

Accepted Solution

rajeev_devin earned 800 total points
ID: 17985406
Can be extracted this way:

char str[] = "#FFFFFF0FF0\n";
char tokens[7][10];
sscanf(str, "%1s%2s%2s%2s%1s%2s%1s",tokens[0],
LVL 53

Expert Comment

ID: 17985893
>> I want the string tokenized as follows:
>> # FF FF FF 0 FF 0 <CR>

A few questions :

1) is <CR> 1 character (ie. the carriage return character with ASCII value 13), or are these 4 characters (<, C, R and >) ?

2) will the first 3 values always be 2 characters wide, the fourth 1 character wide, the fifth 2 characters wide, and the last 1 character wide ? If not, what are the possible values, and how do you determine where to split (tokenize) them ?
LVL 45

Expert Comment

by:Kent Olsen
ID: 17987086

Hi rajeev,

All kinds of algorithms exist to tokenize that string.  I'll be glad to provide some code for you.

But I believe that the algorithm is in need of a tune up.  Given that the last 4 chars of '0FF0' represent 0, FF, and 0, what string would represent 0F F0?

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 39

Assisted Solution

itsmeandnobodyelse earned 600 total points
ID: 17987212
The following should do. You easily might add some checks if the string could contain invalid chars.

Regards, Alex

#include <malloc.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <memory.h>

int tokenize(const char *input, char*** ppszTokens)
     int count;
     size_t i;
     char c;
     char* p;
     int toggle;
     size_t len;
     char** pszTokens;

     len = strlen(input);
     if (len == 0 || input[0] != '#')
          return -1;   /* wrong input */

     // allocate space for max possible tokens
     *ppszTokens = (char**)malloc(len*sizeof(char*));
     pszTokens = *ppszTokens;
     toggle = 0;
     // allocate space for max possible output
     p = (char*) malloc(2*len + 5);
     memset(p, 0, 2*len + 5);
     count = 0;
     for (i = 0; i < len; i++)
           c = input[i];
           *p++ = c;
           if (c == '#' || c == '0')
               pszTokens[count++] = p-1;
               *p++ = 0;
           else if (c == '\n')
               pszTokens[count++] = --p;
               strcpy(p, "<CR>");
               p += 5;
           else if (toggle)
                toggle = 0;    
               pszTokens[count++] = p-2;
               *p++ = 0;
                toggle = 1;
     return count;

int main()
   char** pszTokens;
   char   input[] = "#FFFFFF0FF0\n";
   int    count;
   int    i;

   count = tokenize(input, &pszTokens);
   for (i = 0; i < count; ++i)
      printf("%s ", pszTokens[i]);
   return count;

Expert Comment

ID: 17990030
Infinity and Kdo,
I am just guessing, but the way Bill has tokenised the string it appears only 2 values a high(FF) and a low(0) are allowed values. Would not make any sense otherwise.
LVL 53

Expert Comment

ID: 17990059

that's what I thought, but it's best to be sure before offering the best solution.

On the other hand, if it's the case that there are only 2 values (FF and 0), then it's maybe better to use  F and 0 or 1 and 0 ?
LVL 53

Expert Comment

ID: 17990093
>> then it's maybe better to use  F and 0 or 1 and 0 ?
especially because it seems that it's not always gonna be 6 values, but there might be more or less.
And if each token is only one character, it makes decoding a LOT easier.
LVL 45

Expert Comment

by:Kent Olsen
ID: 17990182

Welll, the original statement "a parser to handle an ASCII string representing voltages" isn't descript enough, though I doubt seriously that we've settled on the string's true representation.

If he's got exactly 6 voltages, the are exactly 10 characters between the '#' and the '<', and positions 4 and 6 are always 0, then a lot of methods work.

But we'll need a lot more answers.

-- Are the voltages for positions 1,2, 3, and 5 0/FF or can any value in the range 0/FF work?
-- Can the voltages for positions 2 and 6 be anything other than 0?
-- Does the poster believe that a single digit of zero represents a zero voltage?  If so, how are voltages of 1,2,3,4,5,6,7,8,9,A,B,C,D,E,F represented?

Perhaps these are masks.  (All bits on, all bits off.)  Then parsing the string is a snap.  Upon encountering the '#' just switch() on 'F', '0', and '<'.

Input.  Need Input.....   :~}

Author Comment

ID: 17992201
1.  The "<CR>" carriage return is in fact the ASCII value 13.  I am specifically looking for the string to terminate with a carriage return to signal the end of the string.
2.  The values must always be the same length.  This is intended to take ASCII characters representing hex values.  If the length is incorrect, the it's a bad command and should be ignored.

The string is an example.  The first three pairs of bytes will represent an eight-bit value when converted from hex to integer.  The single-digits are basically intended to signal one of 16 possible modes (0-f).  The string 0FF0 could exist if I selected mode 0, value FF, second mode 0.

For double-byte groups, the values can be anything from 00-FF in hex.  It should represent an eight bit integer once converted from hex to int (this is done elsewhere in the program.)

Thanks, I'm checking this out!

LVL 53

Assisted Solution

Infinity08 earned 600 total points
ID: 17993905
This is an example of how you could extract all values :

    #include <stdio.h>
    #include <stdlib.h>
    int main(void) {
      char * test = "#FFFFFF0FF0\r";
      int val1 = 0;
      int val2 = 0;
      int val3 = 0;
      int mode1 = 0;
      int val4 = 0;
      int mode2 = 0;
      int ret = sscanf(test, "#%02x%02x%02x%01x%02x%01x\r", &val1, &val2, &val3, &mode1, &val4, &mode2);
      if (ret == 6) {
        fprintf(stdout, "val1 : %d\n", val1);
        fprintf(stdout, "val2 : %d\n", val2);
        fprintf(stdout, "val3 : %d\n", val3);
        fprintf(stdout, "mode1 : %d\n", mode1);
        fprintf(stdout, "val4 : %d\n", val1);
        fprintf(stdout, "mode2 : %d\n", mode2);
      else {
        fprintf(stdout, "Invalid input !!\n");
      return 0;

Author Comment

ID: 18042735
Thank you all for your help, you pointed me in the right direction, and I have a working setup!



Featured Post


Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you thought about creating an iPhone application (app), but didn't even know where to get started? Here's how: ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Important pre-programming comments: I’ve never tri…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
The goal of this video is to provide viewers with basic examples to understand and use pointers in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use while-loops in the C programming language.
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question