Solved

Count word frequency using Binary Search Tree

Posted on 2004-04-10
6
2,354 Views
Last Modified: 2012-08-14
Hi ppl,

I have been trying to code a program which counts the number of times a word appears in  a text file and in which line it occurs. I have a simple BST program which was given to me by my friend. What the simple program does is simply do a BST on integers. Mine should have a class called record which contains details about the frequency, line number in which it appears n the word itself. Each node should contain a word(class Record). So far i have managed to break apart the words and from the text file. I will be posting my header file and driver file in a sec. Just use any small text file.

Thanks,
Arjun
0
Comment
Question by:arjun84
  • 4
  • 2
6 Comments
 
LVL 1

Author Comment

by:arjun84
Comment Utility
Here is the header file:

#ifndef BST_H
#define BST_H

typedef int elem;

class record {
public:
      record() {sentence = ("\0"); element = ("\0"); freq_count = 0;}
      //~record();
      int getSentence(int);
private:
      string sentence;
      string element;
      int freq_count;
};

class node {                     // Binary tree node class
public:
    elem element;                // The node's value
    node* left;                    // Pointer to left child
    node* right;                   // Pointer to right child

    node() { left = right = NULL; }
    node(elem tmp2, node* l = NULL, node* r = NULL)
      { element = tmp2; left = l; right = r; }
   ~node() { }                  // Destructor

    elem value() const { return element; };
    void setValue(elem val) { element = val; }
    node* leftchild() const { return left; }
    node* rightchild() const { return right; }
    bool isLeaf() const            // Return TRUE if is a leaf
        { return (left == NULL) && (right == NULL); }
};

class tree {
private:
    node* root;
    void clearStub(node*);       // Helper functions for public routines
    void insertStub(node*&, const elem);
    node*deletemin(node*&);
    void removeStub(node*&, int);
    elem findStub(node*, int) const;
    void printStub(const node*, int) const;
public:
    tree() { root = NULL; }
   ~tree() { clearStub(root); }
    void clear() { clearStub(root); root = NULL; }
    void insert(elem val) { insertStub(root, val); }
    void remove(const val) { removeStub(root, val); }
    elem find(const val) const { return findStub(root, val); }
    bool isEmpty() const { return root == NULL; }
    void print() const {
        if (root == NULL) cout << "The tree is empty.\n";
        else printStub(root, 0);
    }
};
#endif

####################################################################

And the Driver File:

#include <iostream>
#include <assert.h>
#include <fstream>
#include <string>

using namespace std;

#include "bst.h"

#define      SIZE      10

#define UNSUCCESSFUL -1
#define key(X) (int)(X)
#define visit(X) cout << rt->element << " ";

//Global Variables
string sentence;
int numberOfLines;

void tree::clearStub(node* rt) {// Clear all nodes from the tree
    if (rt == NULL) return;
    clearStub(rt->leftchild());
    clearStub(rt->rightchild());
    delete rt;
}

void tree::insertStub(node*& rt, const elem val) {
    if (rt == NULL)                  // Empty tree: create node
        rt = new node(val, NULL, NULL);
    else if (key(val) < key(rt->value()))
        insertStub(rt->left, val);     // Check left
    else insertStub(rt->right, val); // Check right
}

void tree::removeStub(node*& rt, int val) {
    if (rt == NULL) cout << val << " is not in the tree.\n";
    else if (val < key(rt->value()))       // Check left
        removeStub(rt->left, val);
    else if (val > key(rt->value()))       // Check right
        removeStub(rt->right, val);
    else {                                 // Found it: remove it
        node* temp = rt;
        if (rt->left == NULL)            // Only a right child
            rt = rt->right;                  // so point to right
        else if (rt->right == NULL)          // Only a left child
            rt = rt->left;                   // so point to left
        else {                               // Both children are non-empty
            temp = deletemin(rt->right);// Replace with min value
        rt->setValue(temp->value());       // in right subtree
        }
        delete temp;                   // Return node to free store
    }
}

node* tree::deletemin(node*& rt) {
    assert(rt != NULL);       // There must be a node to delete
    if (rt->left != NULL)     // Continue left
        return deletemin(rt->left);
    else                      // Found it
        { node* temp = rt; rt = rt->right; return temp; }
}

elem tree::findStub(node* rt, int val) const {
    if (rt == NULL) return UNSUCCESSFUL; // Empty tree: nothing found
    else if (val < key(rt->value()))             // Check left
        return findStub(rt->leftchild(), val);
    else if (val == key(rt->value())) return rt->value();
    else return findStub(rt->rightchild(), val); // Check right
}

// Print the nodes of a tree; use indentation to indicate tree shape
void tree::printStub(const node *rt, int level) const {
    if (rt == NULL) return;               // Empty tree
        printStub(rt->rightchild(), level+1);  // Print left subtree
    for (int i=0; i<level; i++)           // Indent to level
        cout << "    ";
    cout << rt->value() << "\n";          // Print node value
    printStub(rt->leftchild(), level+1); // Print right subtree
}

void preorder(node* rt)        // rt is the root of a subtree
{
    if (rt == NULL) return;         // Empty subtree
    visit(rt);              // visit performs whatever action is desired
    preorder(rt->leftchild());
    preorder(rt->rightchild());
}

int record::getSentence(int numberOfLines)
{
      numberOfLines = 0;
      int count = 1;
      char *tokenPtr;
      char string[SIZE][101];

      ifstream inFile1("testing.txt", ios::in);//This is the file I used, U may change it.

      if(!inFile1)
      {
            cerr<<"ERROR: File does not exist"<<endl;
            void main();
      }

      while(!inFile1.eof())
      {
            getline(inFile1, sentence);
            numberOfLines++;//Count number of lines
      }

      inFile1.close();

      ifstream inFile2("testing.txt", ios::in);//Change to the same name here as well..

      if(!inFile2)
      {
            cerr<<"ERROR: File does not exist"<<endl;
            void main();
      }

      while(numberOfLines>0)
      {
            cout<<"\n\nThis is line number: "<<count<<endl;
            inFile2.getline(string[numberOfLines], 100);
            tokenPtr = strtok(string[numberOfLines], " ,.?!");

            while (tokenPtr != NULL)
            {
                  cout<<tokenPtr<<endl;
                  tokenPtr = strtok(NULL, " ,.?!"); //Using delimiters " ,.?!" to seperate words
            }
            numberOfLines--;
            count++;
      }


      cout<<"Number of lines in the file are: "<<count - 1<<endl;

      return 0;
}

void main()
{
      char choice;
      record tmp;

      do{
            cout<<"\n\n(1) Load Text"<<endl;
            cout<<"(2) Display Sorted Words"<<endl;
            cout<<"(3) Exit"<<endl;

            cout<<"Please select an option: ";
            cin >>choice;

            switch(toupper(choice))
            {
            case 'L': tmp.getSentence(numberOfLines);
                  break;
            case 'D':
                  break;
            case 'E':
                  break;
            };
      }while(toupper(choice) != 'E');
      
}
//The code given below is what was used for testing the BST.
//You can use it too, just replace the whole main body with this stuff.
    /*tree t;
    cout << "insert 30\n"; t.insert(30);  
    cout << "insert 13\n"; t.insert(13);
    cout << "insert 50\n"; t.insert(50);
    cout << "insert 10\n"; t.insert(10);
    cout << "insert 20\n"; t.insert(20);
    cout << "insert 40\n"; t.insert(40);
    cout << "insert 70\n"; t.insert(70);
    cout << "insert 35\n"; t.insert(35);
    cout << "insert 90\n"; t.insert(90);
    t.print();
    cout << "remove 13\n"; t.remove(13);  
    t.print();
    cout << "remove 50\n"; t.remove(50);  
    t.print();
    t.clear();
    t.print();*/

########################################################
I am expecting the output something like this:
Word    Frequency    LineNum
and       5                 2, 4
for        3                  1, 3
what     2                  2,3



Note: It will be better if the words are displayed in alphabetical order.

Thanks,
Arjun
0
 
LVL 11

Expert Comment

by:bcladd
Comment Utility
(1) What is your question?

(2) In InsertStub
  - Why is val passed by constant value? It would seem that constant reference would make more sense
  - Why are you using the key() macro? Or rather, why haven't you defined the key() macro to work with your data?
  - What _is_ the sorting order you expect to use in your BST? If your output should be sorted in word order, doesn't it make sense to use the words as the key for the BST?
 
(3) General question: If you're keeping count and line number information, isn't it possible for a word to appear on multiple lines? I am wondering what you will do in that case. Also, the code you show does not handle nodes with duplicate keys (that is, it always inserts a new node). You will want to make sure you know what happens the second time your code encounters "the" or "a" in the input (assuming English input file).

Good luck, -bcl
0
 
LVL 1

Author Comment

by:arjun84
Comment Utility
First of all thanks for such a fast response.

(1) My main question is how to write a BST program which stores the words and their frequency of appearence in the text file.
Like for example if u have a text file like this:
Peter brought some butter but the butter was bitter
so peter brought some better butter to make the
bitter butter better.

The program must give output like this:
Word     Frequency      Line Numbers
peter      2                   1, 2
brought  2                    1, 2
better     2                    2, 3
and so on...

(2) The code is my friends and I don't understand all of it... :\ But I guess the const does not make a difference anyway. You can remove it...
I am not entirely sure what the key macro does, but I guess its like comparing the values to see which tree to go to next rite? As in go to the left part or the right part of the tree and then insert the new node there.
By the sorting order you mean post, pre or inorder rite? Thats were I am not sure as well.. What I want is just to display in alphabetical order. So I think only inorder will work? Not sure there though.

(3) The file I have pasted here is the unmodified one. It uses val in the BST. val is just a int data type. What I have been trying is to completely remove the val thing and replace it with my class. So instead of storing integers in the node's data, I will have the class record which contains the word, its frequency and the line numbers in which it appears. And yah, the input file is in English. We can leave aside the duplicate nodes aside first. I can handle that.

Thanks once again,
Arjun
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 11

Accepted Solution

by:
bcladd earned 200 total points
Comment Utility
What I was trying to say in my previous post is that this sounds like homework. The experts here are always ready to help you solve your homework but we cannot do it for you.

Thus your failure to understand the code you are basing your work on means you have some work to do.

I agree that you want to replace the data in the nodes with some more complex structure. Using your class, record (you might want to come up with a more descriptive name than that), you should change the BST code to store records. That will require redefining the key macro (among other things). As your friend has not decided to include comments describiing the function of the key macro, permit me to explain it: As you know, a BST is a data structure that stores information in some sorted order (the S in BST). The code shown uses the key macro to extract the sort key from the value stored in a BST node. For the original purpose, that means casting it to an int (technically the cast is unnecessary in the original use but that is not important); for your purposes it means extracting the part of the element that is going to be used to sort the BST. That sounds a lot like it should be the word itself, right? So the key macro should take a reference to a record and return the word value stored in the record. Note that macros do have some pitfalls so you will want to be careful when you write the new key.

Alternatively you can just remove key by writing your comparisons out. That might be preferable here as it will help you understand the changes you are making as you make them.

Suggestion: Comment out everything but the insert and print routines and modify the driver to just insert four or five words (not read from a file, just hard coded in). That way you can look at the tree, knowing what it should look like because you can calculate it by hand. Make sure you comment out the routines in header files, too and see if you can get it to compile. A couple of pointers: elem is the type of elements stored in the tree. You will want to change the typedef of elem; as mentioned above, you will want to modify key to handle records; you will also need to modify the visit macro OR overload the output operator to handle the record type.

Hope this helps,
-bcl
0
 
LVL 1

Author Comment

by:arjun84
Comment Utility
After reading your description and some other online descriptions of the key macro, I think i found my fault. Need to change my code and try though. But I am pretty sure it lies in the usage of the macro key. Yah anyway I don't ask home work questions here :) Haha.. you can check my previous questions :)

Thanks for the guidelines.. Will get back to you ASAP after i try it out agian.

Thanks yet again
Arjun
0
 
LVL 1

Author Comment

by:arjun84
Comment Utility
Yep finally I got it to work a few days back..
Thanks for helping!

Arjun
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

This article shows you how to optimize memory allocations in C++ using placement new. Applicable especially to usecases dealing with creation of large number of objects. A brief on problem: Lets take example problem for simplicity: - I have a G…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now