Solved

Tokenizing a string without separators.  How to parse a string in C?

Posted on 2006-11-20
11
383 Views
Last Modified: 2010-04-15
I am writing a parser to handle an ASCII string representing voltages.  The intent is to have this string separated into seven tokens.  The string format is:

#FFFFFF0FF0<cr>

The # indicates to my program to select a "hex" case.  The <cr> at the end signifies a carraige return which should be treated as an "end of line" flag.  This last <CR> completes the string.

I want the string tokenized as follows:

# FF FF FF 0 FF 0 <CR>

There are no separators, only the position within the string is indicative of each token.  Remember that these are being sent as ASCII characters over a serial line and I'll need to convert them from ASCII to integers (subtract 48 from the byte.)

The goal is to be able to perform calculations on the individual tokens.  I am going to use the # as a start flag (not stored in a variable) and the <CR> as an end flag.

I realize this is a pretty basic issue for most programmers, but to me it's a hobby.  I'm trying to build a microcontroller-driven servo and send it commands with my PC.  I'm going to keep plugging away at the solution myself, but hope you can help.  I would like to get an answer as soon as possible.

Thanks,
Bill
0
Comment
Question by:wwward0
  • 4
  • 2
  • 2
  • +3
11 Comments
 
LVL 12

Accepted Solution

by:
rajeev_devin earned 200 total points
ID: 17985406
Can be extracted this way:

char str[] = "#FFFFFF0FF0\n";
char tokens[7][10];
sscanf(str, "%1s%2s%2s%2s%1s%2s%1s",tokens[0],
                              tokens[1],
                              tokens[2],
                              tokens[3],
                              tokens[4],
                              tokens[5],
                              tokens[6]);
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17985893
>> I want the string tokenized as follows:
>>
>> # FF FF FF 0 FF 0 <CR>

A few questions :

1) is <CR> 1 character (ie. the carriage return character with ASCII value 13), or are these 4 characters (<, C, R and >) ?

2) will the first 3 values always be 2 characters wide, the fourth 1 character wide, the fifth 2 characters wide, and the last 1 character wide ? If not, what are the possible values, and how do you determine where to split (tokenize) them ?
0
 
LVL 45

Expert Comment

by:Kdo
ID: 17987086

Hi rajeev,

All kinds of algorithms exist to tokenize that string.  I'll be glad to provide some code for you.

But I believe that the algorithm is in need of a tune up.  Given that the last 4 chars of '0FF0' represent 0, FF, and 0, what string would represent 0F F0?


Kent
0
 
LVL 39

Assisted Solution

by:itsmeandnobodyelse
itsmeandnobodyelse earned 150 total points
ID: 17987212
The following should do. You easily might add some checks if the string could contain invalid chars.

Regards, Alex

#include <malloc.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <memory.h>

int tokenize(const char *input, char*** ppszTokens)
{
     int count;
     size_t i;
     char c;
     char* p;
     int toggle;
     size_t len;
     char** pszTokens;


     len = strlen(input);
     if (len == 0 || input[0] != '#')
          return -1;   /* wrong input */

     // allocate space for max possible tokens
     *ppszTokens = (char**)malloc(len*sizeof(char*));
     pszTokens = *ppszTokens;
     toggle = 0;
     // allocate space for max possible output
     p = (char*) malloc(2*len + 5);
     memset(p, 0, 2*len + 5);
     count = 0;
     for (i = 0; i < len; i++)
     {
           c = input[i];
           *p++ = c;
           if (c == '#' || c == '0')
           {
               pszTokens[count++] = p-1;
               *p++ = 0;
           }
           else if (c == '\n')
           {
               pszTokens[count++] = --p;
               strcpy(p, "<CR>");
               p += 5;
               break;
           }
           else if (toggle)
           {
                toggle = 0;    
               pszTokens[count++] = p-2;
               *p++ = 0;
           }
           else
                toggle = 1;
     }
     return count;
}

int main()
{
   char** pszTokens;
   char   input[] = "#FFFFFF0FF0\n";
   int    count;
   int    i;

   count = tokenize(input, &pszTokens);
   for (i = 0; i < count; ++i)
   {
      printf("%s ", pszTokens[i]);
   }
   free(pszTokens[0]);
   free(pszTokens);
   return count;
}
0
 
LVL 2

Expert Comment

by:avsrivastava
ID: 17990030
Infinity and Kdo,
I am just guessing, but the way Bill has tokenised the string it appears only 2 values a high(FF) and a low(0) are allowed values. Would not make any sense otherwise.
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 53

Expert Comment

by:Infinity08
ID: 17990059
avsrivastava,

that's what I thought, but it's best to be sure before offering the best solution.

On the other hand, if it's the case that there are only 2 values (FF and 0), then it's maybe better to use  F and 0 or 1 and 0 ?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 17990093
>> then it's maybe better to use  F and 0 or 1 and 0 ?
especially because it seems that it's not always gonna be 6 values, but there might be more or less.
And if each token is only one character, it makes decoding a LOT easier.
0
 
LVL 45

Expert Comment

by:Kdo
ID: 17990182

Welll, the original statement "a parser to handle an ASCII string representing voltages" isn't descript enough, though I doubt seriously that we've settled on the string's true representation.

If he's got exactly 6 voltages, the are exactly 10 characters between the '#' and the '<', and positions 4 and 6 are always 0, then a lot of methods work.

But we'll need a lot more answers.

-- Are the voltages for positions 1,2, 3, and 5 0/FF or can any value in the range 0/FF work?
-- Can the voltages for positions 2 and 6 be anything other than 0?
-- Does the poster believe that a single digit of zero represents a zero voltage?  If so, how are voltages of 1,2,3,4,5,6,7,8,9,A,B,C,D,E,F represented?

Perhaps these are masks.  (All bits on, all bits off.)  Then parsing the string is a snap.  Upon encountering the '#' just switch() on 'F', '0', and '<'.


Input.  Need Input.....   :~}
Kent
0
 

Author Comment

by:wwward0
ID: 17992201
Infinity08:
1.  The "<CR>" carriage return is in fact the ASCII value 13.  I am specifically looking for the string to terminate with a carriage return to signal the end of the string.
2.  The values must always be the same length.  This is intended to take ASCII characters representing hex values.  If the length is incorrect, the it's a bad command and should be ignored.

Kdo:
The string is an example.  The first three pairs of bytes will represent an eight-bit value when converted from hex to integer.  The single-digits are basically intended to signal one of 16 possible modes (0-f).  The string 0FF0 could exist if I selected mode 0, value FF, second mode 0.

avsrivastava:
For double-byte groups, the values can be anything from 00-FF in hex.  It should represent an eight bit integer once converted from hex to int (this is done elsewhere in the program.)

itsmeandnobodyelse:
Thanks, I'm checking this out!



0
 
LVL 53

Assisted Solution

by:Infinity08
Infinity08 earned 150 total points
ID: 17993905
This is an example of how you could extract all values :

    #include <stdio.h>
    #include <stdlib.h>
   
    int main(void) {
      char * test = "#FFFFFF0FF0\r";
      int val1 = 0;
      int val2 = 0;
      int val3 = 0;
      int mode1 = 0;
      int val4 = 0;
      int mode2 = 0;
     
      int ret = sscanf(test, "#%02x%02x%02x%01x%02x%01x\r", &val1, &val2, &val3, &mode1, &val4, &mode2);
      if (ret == 6) {
        fprintf(stdout, "val1 : %d\n", val1);
        fprintf(stdout, "val2 : %d\n", val2);
        fprintf(stdout, "val3 : %d\n", val3);
        fprintf(stdout, "mode1 : %d\n", mode1);
        fprintf(stdout, "val4 : %d\n", val1);
        fprintf(stdout, "mode2 : %d\n", mode2);
      }
      else {
        fprintf(stdout, "Invalid input !!\n");
      }
      return 0;
    }
0
 

Author Comment

by:wwward0
ID: 18042735
Thank you all for your help, you pointed me in the right direction, and I have a working setup!

:)

Bill
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode (http://en.wikipedia.org/wiki/Unicode)? They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.
The goal of this video is to provide viewers with basic examples to understand and use conditional statements in the C programming language.

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now