Program to count occurences of words in a given text

JavaInTheMorning
JavaInTheMorning used Ask the Experts™
on
Hi.
I need to write a program that counts occurences of words in a text inputed by the keyboard (or with the < token from a text file).

I thought about doing it with 2 arrays - one for the words and one for the count (same index as the words).
But I believe it's just too slow to do it that way, right?

Can someone suggest a better way (We have not yet learned to use structures so we're not supposed to do that...)
Thank you very much!
Itsik
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
It depends on whether you are given a word, and then count all occurances of it in a file? Or do you want to count the occurances of ALL the words in a certain file?

Author

Commented:
I need to count occurences of ALL the words in the file or inputeed text.

Commented:
I would say use two arrays..  words[i] and count[i] ..

then read word by word from the file.. let's say the file says:

"Something plus something else plus other stuff something"

Start reading from the file.. and at each occurance of a word, delete it and increase the count for it.. so after the first pass we will have

"        plus       else  plus   other stuff          "

words[1] will be "something" and count[1] will be 3

then we do it again

"                     else      other stuff          "

then words[2] will be plus and count[2] will be 2

so you repeat until you are done with all the words..


The actual code is still complex.. for example deleting words from a file isn't that easy if you're a beginner.. you should create a temp file and copy everything in it and do your counting there.

You won't need structures but you need to know how to handle strings (comparing and stuff) and a bit advanced file input/output
Ensure you’re charging the right price for your IT

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Hiya,

The slow bit of this sort of thing is in looking up each word you come across so you can add 1 to its count.  The cost of doing this increases disproportionately the more words you have.  (And no offence to kooheji, but with all that file i/o that's probably even slower!).

How about splitting into 3 phases (and I know this might seem daunting for a beginner, but these kind of techniques will serve you well) ...

Given the same input:
"something plus something else plus other stuff something"

1. Do one pass of your input, parsing it into separate words and building up a simple array:

something
plus
something
else
plus
other
stuff
something

2. Then sort your array.  Yeah, I know - this is the tough bit because a bad sorting algorithm can make the whole thing grind to a halt, but if you use something like QuickSort (I can give you an algorithm to use) you shouldn't have a problem.  (I'm assuming here that you have to hand-craft this code...? otherwise there are far easier and probably more efficient ways of doing it!).

Now you have:
else
other
plus
plus
something
something
something
stuff

3. Do a single pass through this array, counting each word until it changes, outputting each one with the final count:

1 else
1 other
2 plus
3 something
1 stuff

Ta-Dah!
¦¬)

There are several variations of this.  A good one to get around having to sort them separately is to build up a binary tree as you read them in, rather than a simple array, but that has other complications if you're not allowed to use structures yet.

Hope that helps.

Commented:
I see you point Dave, but I proposed incrementing the current word and getting rid of it so you don't need to search for it internally, otherwise if it's the cost of going through the file it could be argued that doing the sort would be the same as me looking for the occurances because you need to do that anyway while sorting. Anyway I think your message would be easier. What do you think Java?
"... it could be argued that doing the sort would be the same ..."

You're quite right, it's a case of finding and recognising a good balance, which is why I hesitated when I got to the sorting bit.  I think that's the point of the original question: just having two arrays and a simple serial look-up for each word *would* work, and for small input wouldn't suffer much, but the question is about finding a better way to do it.  One of the most important lessons a beginner will learn is that there is very rarely a single right answer - you often need to play with different approaches, make a judgement call, and go with it.

Another very important advantage your method has is that you never need to load the entire input into memory.  If you only ever expect to have relatively small files then memory probably isn't a problem, but as a general rule it's probably not safe to assume that will always be the case.  My method may be faster (given the right sorting algorithm) but is going to be very hungry for memory and would seriously slow down if it started to compete for more memory than was readily available.

Maybe that's all a bit acedemic, but these are things we need to consider now and then so it's all worth airing.

Author

Commented:
Thank you both very much!
Dave, I am going to use your algorithm, can you please send me the quickSort algo?
Thanks!
WTF????

My comments aren't going in.  Can someone let me know if they can read this (and my previous one).

TIA

Commented:
You posted three comments.. two of them say "no text" and the third is your request for clarification.. if you are having some sort of problem maybe it's a one time thing? Or maybe a bug with EE
Thanks kooheji.

Nope, there's something wrong here.  I can't post my code - it's only about 90 lines, I've seen some posts a lot bigger than that.  Hmmmm.

Sorry about this Java.  I've put it on the web for you to get instead:

http://www.omeno.com/ee/QSort.java

As the name suggests, it's a java class - you did want it in java, didn't you?  If not, I've got C and VB versions too.

I've tried it out for performance on my 1.7 P4 and it did 40,000 words (about 250k of text) in 200 milliseconds.

Have fun.  If you have any problems or questions just post another comment here and I'll pick it up.

Author

Commented:
Thanks man!
I wanted it is C, but I'll manage to translate it :)

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial