Tokenizing a string without separators.  How to parse a string in C?

Posted on 2006-11-20
Last Modified: 2010-04-15
I am writing a parser to handle an ASCII string representing voltages.  The intent is to have this string separated into seven tokens.  The string format is:


The # indicates to my program to select a "hex" case.  The <cr> at the end signifies a carraige return which should be treated as an "end of line" flag.  This last <CR> completes the string.

I want the string tokenized as follows:

# FF FF FF 0 FF 0 <CR>

There are no separators, only the position within the string is indicative of each token.  Remember that these are being sent as ASCII characters over a serial line and I'll need to convert them from ASCII to integers (subtract 48 from the byte.)

The goal is to be able to perform calculations on the individual tokens.  I am going to use the # as a start flag (not stored in a variable) and the <CR> as an end flag.

I realize this is a pretty basic issue for most programmers, but to me it's a hobby.  I'm trying to build a microcontroller-driven servo and send it commands with my PC.  I'm going to keep plugging away at the solution myself, but hope you can help.  I would like to get an answer as soon as possible.

Question by:wwward0
  • 4
  • 2
  • 2
  • +3
LVL 12

Accepted Solution

rajeev_devin earned 200 total points
ID: 17985406
Can be extracted this way:

char str[] = "#FFFFFF0FF0\n";
char tokens[7][10];
sscanf(str, "%1s%2s%2s%2s%1s%2s%1s",tokens[0],
LVL 53

Expert Comment

ID: 17985893
>> I want the string tokenized as follows:
>> # FF FF FF 0 FF 0 <CR>

A few questions :

1) is <CR> 1 character (ie. the carriage return character with ASCII value 13), or are these 4 characters (<, C, R and >) ?

2) will the first 3 values always be 2 characters wide, the fourth 1 character wide, the fifth 2 characters wide, and the last 1 character wide ? If not, what are the possible values, and how do you determine where to split (tokenize) them ?
LVL 45

Expert Comment

ID: 17987086

Hi rajeev,

All kinds of algorithms exist to tokenize that string.  I'll be glad to provide some code for you.

But I believe that the algorithm is in need of a tune up.  Given that the last 4 chars of '0FF0' represent 0, FF, and 0, what string would represent 0F F0?

ScreenConnect 6.0 Free Trial

Want empowering updates? You're in the right place! Discover new features in ScreenConnect 6.0, based on partner feedback, to keep you business operating smoothly and optimally (the way it should be). Explore all of the extras and enhancements for yourself!

LVL 39

Assisted Solution

itsmeandnobodyelse earned 150 total points
ID: 17987212
The following should do. You easily might add some checks if the string could contain invalid chars.

Regards, Alex

#include <malloc.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <memory.h>

int tokenize(const char *input, char*** ppszTokens)
     int count;
     size_t i;
     char c;
     char* p;
     int toggle;
     size_t len;
     char** pszTokens;

     len = strlen(input);
     if (len == 0 || input[0] != '#')
          return -1;   /* wrong input */

     // allocate space for max possible tokens
     *ppszTokens = (char**)malloc(len*sizeof(char*));
     pszTokens = *ppszTokens;
     toggle = 0;
     // allocate space for max possible output
     p = (char*) malloc(2*len + 5);
     memset(p, 0, 2*len + 5);
     count = 0;
     for (i = 0; i < len; i++)
           c = input[i];
           *p++ = c;
           if (c == '#' || c == '0')
               pszTokens[count++] = p-1;
               *p++ = 0;
           else if (c == '\n')
               pszTokens[count++] = --p;
               strcpy(p, "<CR>");
               p += 5;
           else if (toggle)
                toggle = 0;    
               pszTokens[count++] = p-2;
               *p++ = 0;
                toggle = 1;
     return count;

int main()
   char** pszTokens;
   char   input[] = "#FFFFFF0FF0\n";
   int    count;
   int    i;

   count = tokenize(input, &pszTokens);
   for (i = 0; i < count; ++i)
      printf("%s ", pszTokens[i]);
   return count;

Expert Comment

ID: 17990030
Infinity and Kdo,
I am just guessing, but the way Bill has tokenised the string it appears only 2 values a high(FF) and a low(0) are allowed values. Would not make any sense otherwise.
LVL 53

Expert Comment

ID: 17990059

that's what I thought, but it's best to be sure before offering the best solution.

On the other hand, if it's the case that there are only 2 values (FF and 0), then it's maybe better to use  F and 0 or 1 and 0 ?
LVL 53

Expert Comment

ID: 17990093
>> then it's maybe better to use  F and 0 or 1 and 0 ?
especially because it seems that it's not always gonna be 6 values, but there might be more or less.
And if each token is only one character, it makes decoding a LOT easier.
LVL 45

Expert Comment

ID: 17990182

Welll, the original statement "a parser to handle an ASCII string representing voltages" isn't descript enough, though I doubt seriously that we've settled on the string's true representation.

If he's got exactly 6 voltages, the are exactly 10 characters between the '#' and the '<', and positions 4 and 6 are always 0, then a lot of methods work.

But we'll need a lot more answers.

-- Are the voltages for positions 1,2, 3, and 5 0/FF or can any value in the range 0/FF work?
-- Can the voltages for positions 2 and 6 be anything other than 0?
-- Does the poster believe that a single digit of zero represents a zero voltage?  If so, how are voltages of 1,2,3,4,5,6,7,8,9,A,B,C,D,E,F represented?

Perhaps these are masks.  (All bits on, all bits off.)  Then parsing the string is a snap.  Upon encountering the '#' just switch() on 'F', '0', and '<'.

Input.  Need Input.....   :~}

Author Comment

ID: 17992201
1.  The "<CR>" carriage return is in fact the ASCII value 13.  I am specifically looking for the string to terminate with a carriage return to signal the end of the string.
2.  The values must always be the same length.  This is intended to take ASCII characters representing hex values.  If the length is incorrect, the it's a bad command and should be ignored.

The string is an example.  The first three pairs of bytes will represent an eight-bit value when converted from hex to integer.  The single-digits are basically intended to signal one of 16 possible modes (0-f).  The string 0FF0 could exist if I selected mode 0, value FF, second mode 0.

For double-byte groups, the values can be anything from 00-FF in hex.  It should represent an eight bit integer once converted from hex to int (this is done elsewhere in the program.)

Thanks, I'm checking this out!

LVL 53

Assisted Solution

Infinity08 earned 150 total points
ID: 17993905
This is an example of how you could extract all values :

    #include <stdio.h>
    #include <stdlib.h>
    int main(void) {
      char * test = "#FFFFFF0FF0\r";
      int val1 = 0;
      int val2 = 0;
      int val3 = 0;
      int mode1 = 0;
      int val4 = 0;
      int mode2 = 0;
      int ret = sscanf(test, "#%02x%02x%02x%01x%02x%01x\r", &val1, &val2, &val3, &mode1, &val4, &mode2);
      if (ret == 6) {
        fprintf(stdout, "val1 : %d\n", val1);
        fprintf(stdout, "val2 : %d\n", val2);
        fprintf(stdout, "val3 : %d\n", val3);
        fprintf(stdout, "mode1 : %d\n", mode1);
        fprintf(stdout, "val4 : %d\n", val1);
        fprintf(stdout, "mode2 : %d\n", mode2);
      else {
        fprintf(stdout, "Invalid input !!\n");
      return 0;

Author Comment

ID: 18042735
Thank you all for your help, you pointed me in the right direction, and I have a working setup!



Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
logging Access violation 6 19
Problem to scan all sheets 3 103
What is atomic operation? 6 58
Resolve Dependency Issues 4 64
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
Examines three attack vectors, specifically, the different types of malware used in malicious attacks, web application attacks, and finally, network based attacks.  Concludes by examining the means of securing and protecting critical systems and inf…
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use nested-loops in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now