String splitter.

Posted on 2003-03-20
Medium Priority
Last Modified: 2012-06-27

If i had a string called sentence. And in that string there was a sentence, how would i split the string (sentence) up and store each word seperatley, in order for me to analyse that sentence.
Would i need loads of other strings such as word1, word2 word3 or is there a better way.

Thanks before hand for any help.
Question by:bigmit
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +3
LVL 22

Expert Comment

ID: 8173324
First of all.  Just so you understand.

We cannot do your schoolwork for you.  That is unethical and is grounds for removal from this site.  (for both you and the experts involved.)  

We can provide only limitied help in accademic assignments.    We can answer specific (direct) questions, like you might ask your teacher.  We can review your work and post suggestions, again, like your teacher might do.


(Note  you did the right thing.  You asked for very specific help on an assignment.  that's okay)

Unfortunately, the answer is "it depends".  

Sometimes a sentince can be processed 1 word at a time.  In that case you only need to read throught he string from beginning to end and find each word, extract it to some 2nd string and process that 2nd string.  

sometimes you need to find and extract every word and have copies of each word.   In that case, yes you need multiple strings to store the words.  Sometimes you coudl use explicitly declared strings, like word1, word2, word3, etc to store these words.  This would require you knowing ahead of time exactly how many words are in the string.  That is probably rare, but not impossible.   More often you can't do that.  In that case, you need some sort of container to store the words.  Usually this would be something like an array or a C++ orray object like a vector.

Author Comment

ID: 8173412
Regards the respons from nietod,

Thankyou very much for the help but this is not for schoolwork i am no longer at school and i am currently constructing a translater and have got the translater doing a direct translation, but the direct translation then needs to be correct with the grammar and that is what i am working on at the momement. I dont think that school children would be asked to do a complex program like this.


Expert Comment

ID: 8173428
Like nietod said, we're not going to do your HW, but here's some hints, pseudocode to follow.  

To break up the string you could use strtok() (a very useful function, check out msdn.microsoft.com for MS's documentation), or you could work through the string char by char until you find a space character and there's a word.

As far as creating new strings, it's like nietod said, it depends.  If you just want to look at one word at a time, you need just one extra temporary string:

while (still words in original sentence)
      temp_string = find_next_word();

See?  If you need to store all the strings somewhere, you can use an array of character strings.  Like:

// Array of 20 strings, 255 chars long each
char temp_string[20][255];

while (still words in original sentence)
      temp_string[cur_string] = find_next_word();

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

LVL 22

Expert Comment

ID: 8173503
>>  I dont think that school children would be asked to do a complex program like this.
This sort of work is typical introductory computer programming.   If you can't do this, I don't know how you will be able to do your translator program which must requore 100s of more difficult tasks.  I suspect you are going to need to do a lot more study before you can write your program.

>> To break up the string you could use strtok() (a very useful function
I would STRONGLY recommend you never use this function.   It is an old C, not C++ function and it has numeruous flaws.

LVL 22

Accepted Solution

nietod earned 80 total points
ID: 8173529
Thsi code will break a string of any number of words into  its seperate words.  If this is for homework however, there is no way a teacher would ever except it, since the STL does all the work for you.

#include <string>
#include <stringstream>
using namespace std;

int main()
   string sentince = "this is a sentince.";
   vector<string> WordList; // vector to store all the words extracted.
    istringstream Stream(sentince); // stream to read sentince from.
      string Word;

      Stream   >> Word;  // read 1 word.
      WordList.push_back(Word); // Save the word.

   // The words are now all stored in WordList.
   // Note they would still contain punctuation.  
   // You might need to remove that.  I don't know.
   return 0;
LVL 12

Assisted Solution

Salte earned 80 total points
ID: 8173592
I can just refer to what nietod and others have said.

Like nietod I will strongly advice against strtok, the function is not very useful at all, it is very useless. The one place where it can do useful work is the one place where strchr() does the job better.

Also, as nietod said, there are functions in std::string that does the job better.

Having said all that I suggest you take a look at how compilers work. I believe your translator is a translator for human languages and not programming languages like C++ etc. However, that just means that it is more complicated than, say, compiling C++ is.

The techniques will be very much the same, you should build a structure of the program where you have several components working together in order to understand the sentence. The fact that you are working on a 'string' level implies to me that you haven't even looked at the problem properly yet.

When you start talking about 'tokens' and 'grammar' and 'rules' we can get back to you and give you proper advice on how to proceed. Translating human languages have much in common with compiling a computer program - the grammar is just more complicated and the rules are very ad hoc and also require some heuristics.

Proper translation programs as you find in AI and so on will essentially first read the text much the same way a compiler read the text and tokenizes it into tokens. It then has a 'grammar' and applies the grammar rules in order to make sense of the tokens and then it has reached a point where you can apply a rule based system that actually attempts to understand what the words mean in such a way that it is possible to translate them.

If you have problems splitting a string into words I very strongly believe you won't even get past the first few feet of the marathon of making a translator program.

So I really consider your program to be an exercize in futility.


Expert Comment

ID: 8173888
I've been using strtok() with C++ for years, and I think it's a great function.  Could you guys explain a bit more why you don't like it?  Besides the fact that it's an old function.
LVL 12

Expert Comment

ID: 8174146
Several problems with it, but the main gripe I have with it is that it is useless.

You have two situations where strtok() can be used:

1. You are looking at a string and somewhere in that string is a specific character that terminates the current token. The token is assumed to be at the start of the string.

In this case you are looking at a specific character and strtok does essentially exactly the same as strchr() does. The only difference is that strchr() does this without modifying the original string (it can be const) and strchr() is more basic. In fact strtok() is implemented by using strchr() in most implementations. So it is faster to just use strchr() directly since strtok() doesn't really add any useful functionality.

2. You are looking at a string as in situation 1 but this time there are several possible characters that might terminate your token. This is a case not directly handled with strchr. As I said strtok() is typically implemented using strchr or some similar function so yes, strchr can handle it but not without some code around. However, in this case the strtok() function bluntly replaces that character that terminated the token with a null byte and whatever character that was there that terminated the token is lost forever.

So in this one case where strtok() did some added functionality it screws it up by providing a bulldozer solution and destroying all traces of what character it was that terminated the token.

So, in case 1 it does a useful job per se but strchr() does it better, in case 2 it does a job beyond what strchr() can offer but it moves on like a bulldozer and completely screw up everything. He's like that fat guy who sit on your chair and completely breaks it - "oops, what did I destroy this time?" - is something a personification of strtok() would typically say.

Now, added to the fact that it wrecks havoc because it modifies the string and you absolutely have a function you should avoid at all cost.

In addition, strtok() operates by modifying a static variable stored somewhere. This means that you cannot use strtok() in paralell on two different strings. This means that you cannot use strtok() in threaded software. At most one of the threads can use strtok at a time.

Especially in modern days with so much better alternatives there's no reason why strtok is of any use. A better alternative (but still useful for C) would be a function that did the following:

1. Returned the token without modifying the input string.
2. Kept the state in an explicit variable so that you could call several independent instances of it.
3. Always returned what character actually terminated the token whenever you wanted that info.

I don't see strtok() do any of those. A good suggestion would be something like (use struct so it would work on C also):

struct strok_t {
   const char * str; // the original string.
   const char * cur; // current position.

and then:

const char * better_strtok(struct strtok_t * p,
                           const char * str,
                           const char * delims,
                           int * toklen,
                           char * delim);

The idea is look for a token in str, it starts with str
and is terminated by any character in delims.
The length of the token is returned in toklen and
the termination character that terminated the token
is returned in delim.

If you're not interested in knowing which character terminated the token you can give a NULL pointer as the last argument (in C++ this would typically have a default value of NULL or 0).

In repeated calls to better_strtok the str is of course NULL as in old strtok.

The strtok_t object must always be given, it is used to store the state.

strtok_t k;
int len;

const char * p = better_strtok(&k,"hello there\n", " ",&len,0);

will return p == "hello there" and len == 5. Since delim pointer is 0 you don't get that delimiter.

If you want to set the string to a 0 to get only the string of the first token you could do a:

const_cast<char *>(p)[len] = 0;

however, that would break as the string given above is a literal string. If it wasn't you can set it to 0 to emulate old strtok() by that code above.

If you don't need that you can use the length to dup the string to heap:

char * tok1 = new char[len+1];
tok1[len] = 0;

Here the token is saved in a separate string. In C++ a simple:

string tok1(p,p+len);

would do the same.

Next you can call:

char c;
const char * q = better_strtok(& k, 0, " \n", & len, &c);

now q would be "there\n", len would be 5 again (since there is also 5 letters), c would be '\n'.

another call to better_strtok(&k, 0, ...);

would return 0 indicating no more tokens.

In any case, strtok() as given in the C library is a useless function that is to be avoided at all cost! True enough, it isn't as bad as gets() which is really truly horror but it is worse than scanf().

All those functions should be avoided at all cost, well, scanf() might be tolerated in certain specific cases, but they are more exceptions than the rule and since it is usually newbie C programmers who use scanf and it is exactly those programmers (the newbies) who shouldn't use it, it is in general a bad idea and also deserves its place on the list of functions you should avoid.

LVL 22

Expert Comment

ID: 8174185
>> I've been using strtok() with C++ for years, and I think it's a great function.  
>> Could you guys explain a bit more why you don't like it?  Besides the fact
>> that it's an old function.
First of all, any of the old C string functions are inheriently unsafe and inneficient.  They shouldn't be used except maybe when interfacing with C code., or C-like code.  (And even that can be avoided to some degree.)

But strtok() is even more error prone.  One of several problems is that it uses an internal static variable to manage parsing the string.  This makes it impossible to parse 2 or more strings simultaneously and can lead to errors when any reasonable complex parsing needs.  Worse, even if you are parsing one string at a time in your algorithm, if you are running multiple threads and each are parsing strings at the same time they can interfear with each other and lead to incomprehsible and hard to reproduce bugs.   The use of static variables for such a purpose is a clear mistake.  The use of any function that makes such a mistake is equally bad.  
LVL 12

Expert Comment

ID: 8174487
Btw, in C++ the obvious solution to the strtok() problem is to make a class. The class can keep track of state and hold the delimiter etc etc. YOu can run several simultaneous uses of 'strtok' by simply allowing each to have its own instance of those state data etc.

class token_parser {
   const char * orig_string;
   const char * cur;
   const char * next;
   int len;
   int delimch;
   token_parser(const char * str)
     : orig_string(str), cur(0), next(str), len(0), delimch(0)

   void reset()
   { next = orig_string; cur = 0; len = delimch = 0; }

   const char * cur_token() const { return cur; }
   int cur_token_length() const { return len; }
   int cur_delim() const { return delimch; }

   const char * next_token(const char * delims);

next_token() would be a function that worked much like strtok() does today but instead of null terminating the string it would just store the length in len.

one simple implementation of next_token() would be like this:

const char * token_parser::next_token(const char * delims)
   cur = next;
   while (strchr(delims,*next) == 0)
   delimch = *next;
   len = next++ - cur;
   return cur;

Such a class - even though it is very simple and close to what strtok does, is way better.

One drawback is that it uses pointers to specific strings, if those strings goes away the class object is bad.

Using std::string for the orig_string member will avoid that problem and actually makes the class useful as is.

Main problems such a simple class cannot handle:

It doesn't handle regexps and more complicated delimiters, some times you don't want to terminate on just a character but it is more like "but if there is a '.' then I also allow more characters until a '$' and actually terminate by that '$'".

You can fake it though by first do a next_token(" ."); and if the delim was . you do another next_token("$");

In this case you also see the beinifit in that this strtok class do not insert a null byte since the original return from next_token(" .") is now a pointer to the whole string and adding the two lengths together + 1 (for the . delimiter) is the combined length of the full token intact.

So, drop strtok() it just isn't worthy of being used.


Expert Comment

ID: 8209667
for a peer-reviewed, simple tokenizer,


[ about boost.org ] boost.org is an organization supported by many c++ standards committee members and provides 100% free, peer-reviewed, cross-platform libraries.  many of the boost libraries, such as their smart pointer library, are expected to end up in the next revision of the standard


Expert Comment

ID: 9510725
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Split points between Salte & nietod

Please leave any comments here within the next seven days.


EE Cleanup Volunteer

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
The viewer will learn how to use the return statement in functions in C++. The video will also teach the user how to pass data to a function and have the function return data back for further processing.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question