programming to find occurances in a text document

Given an arbitrary text document written in English, write a program that will generate a
concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies.
Bonus: label each word with the sentence numbers in which each occurrence appeared.

has anyone done this, any code would be helpful
LVL 1
countrymeisterAsked:
Who is Participating?
 
käµfm³d 👽Commented:
Well "best" is a subjective term  : )

I would create a sorted dictionary of the words, and I would create a custom class to hold the word count and line occurrences for each word. The custom class could look like this:

public class WordData
{
    public int WordCount { get; set; }
    public System.Collections.Generic.List<int> LineOccurrences { get; set; }
}

Open in new window


The dictionary could then look like this:

System.Collections.Generic.SortedDictionary<string, WordData> concordance =
            new System.Collections.Generic.SortedDictionary<string, WordData>();

Open in new window


Once you iterate the lines of the file, and concurrently the words in each line, then you can simply iterate the keys of the sorted dictionary (since it's already sorted), and print the data for each word. Putting it all together could look something like this:
class Program
{
    static void Main(string[] args)
    {
        System.Collections.Generic.SortedDictionary<string, WordData> concordance =
            new System.Collections.Generic.SortedDictionary<string, WordData>();

        using (System.IO.StreamReader reader = new System.IO.StreamReader("input.txt"))
        {
            int currentLine = 0;

            while (!reader.EndOfStream)
            {
                string line = reader.ReadLine();
                string[] words = line.Split(new string[] { ", ", ". ", " ", ",", ".", " \"", "\" ", "? ", "! ", "?", "!" }, System.StringSplitOptions.RemoveEmptyEntries);

                currentLine++;

                foreach (string word in words)
                {
                    if (!concordance.ContainsKey(word))
                    {
                        concordance.Add(word, new WordData() { WordCount = 0, LineOccurrences = new System.Collections.Generic.List<int>() });
                    }

                    concordance[word].WordCount++;
                    concordance[word].LineOccurrences.Add(currentLine);
                }
            }
        }

        foreach (string word in concordance.Keys)
        {
            System.Console.Write("{0}\n\tCount: {1}\n\tLine Occurrences: ", word, concordance[word].WordCount.ToString());
            System.Console.Write(concordance[word].LineOccurrences[0].ToString());

            for (int i = 1; i < concordance[word].LineOccurrences.Count; i++)
            {
                System.Console.Write(", {0}", concordance[word].LineOccurrences[i].ToString());
            }

            System.Console.WriteLine();
        }

        System.Console.ReadKey();
    }
}

public class WordData
{
    public int WordCount { get; set; }
    public System.Collections.Generic.List<int> LineOccurrences { get; set; }
}

Open in new window

0
 
käµfm³d 👽Commented:
Is this homework?
0
 
countrymeisterAuthor Commented:
No, this is stuff I am curious to know on how is the best way to do it.
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
käµfm³d 👽Commented:
In the above, you would need to tweak the Split() part. I have included some common word separators, but I'm sure I missed a few (the colon [ : ], for example).
0
 
countrymeisterAuthor Commented:
Thanks Kaufmed

I will try it
0
 
countrymeisterAuthor Commented:
Thanks
0
 
mareanobenCommented:
In the above instead of hard coding each non-word symbols, we can use Regex.Split
                .
                .
                .
                string line = reader.ReadLine();                            
                string[]  result= splitWordsOnly(line);
                string[] words = result.Where(w => w != string.Empty).ToArray();

and the splitWordsOnly method looks like this:

    private static string[] splitWordsOnly(string line)
    {
        return Regex.Split(line, @"\W+",RegexOptions.IgnorePatternWhitespace);
    }

the linq query is to filter unresolved white spaces from splitWordsOnly method
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.