• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1831
  • Last Modified:

programming to find occurances in a text document

Given an arbitrary text document written in English, write a program that will generate a
concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies.
Bonus: label each word with the sentence numbers in which each occurrence appeared.

has anyone done this, any code would be helpful
1
countrymeister
Asked:
countrymeister
  • 3
  • 3
1 Solution
 
käµfm³d 👽Commented:
Is this homework?
0
 
countrymeisterAuthor Commented:
No, this is stuff I am curious to know on how is the best way to do it.
0
 
käµfm³d 👽Commented:
Well "best" is a subjective term  : )

I would create a sorted dictionary of the words, and I would create a custom class to hold the word count and line occurrences for each word. The custom class could look like this:

public class WordData
{
    public int WordCount { get; set; }
    public System.Collections.Generic.List<int> LineOccurrences { get; set; }
}

Open in new window


The dictionary could then look like this:

System.Collections.Generic.SortedDictionary<string, WordData> concordance =
            new System.Collections.Generic.SortedDictionary<string, WordData>();

Open in new window


Once you iterate the lines of the file, and concurrently the words in each line, then you can simply iterate the keys of the sorted dictionary (since it's already sorted), and print the data for each word. Putting it all together could look something like this:
class Program
{
    static void Main(string[] args)
    {
        System.Collections.Generic.SortedDictionary<string, WordData> concordance =
            new System.Collections.Generic.SortedDictionary<string, WordData>();

        using (System.IO.StreamReader reader = new System.IO.StreamReader("input.txt"))
        {
            int currentLine = 0;

            while (!reader.EndOfStream)
            {
                string line = reader.ReadLine();
                string[] words = line.Split(new string[] { ", ", ". ", " ", ",", ".", " \"", "\" ", "? ", "! ", "?", "!" }, System.StringSplitOptions.RemoveEmptyEntries);

                currentLine++;

                foreach (string word in words)
                {
                    if (!concordance.ContainsKey(word))
                    {
                        concordance.Add(word, new WordData() { WordCount = 0, LineOccurrences = new System.Collections.Generic.List<int>() });
                    }

                    concordance[word].WordCount++;
                    concordance[word].LineOccurrences.Add(currentLine);
                }
            }
        }

        foreach (string word in concordance.Keys)
        {
            System.Console.Write("{0}\n\tCount: {1}\n\tLine Occurrences: ", word, concordance[word].WordCount.ToString());
            System.Console.Write(concordance[word].LineOccurrences[0].ToString());

            for (int i = 1; i < concordance[word].LineOccurrences.Count; i++)
            {
                System.Console.Write(", {0}", concordance[word].LineOccurrences[i].ToString());
            }

            System.Console.WriteLine();
        }

        System.Console.ReadKey();
    }
}

public class WordData
{
    public int WordCount { get; set; }
    public System.Collections.Generic.List<int> LineOccurrences { get; set; }
}

Open in new window

0
Configuration Guide and Best Practices

Read the guide to learn how to orchestrate Data ONTAP, create application-consistent backups and enable fast recovery from NetApp storage snapshots. Version 9.5 also contains performance and scalability enhancements to meet the needs of the largest enterprise environments.

 
käµfm³d 👽Commented:
In the above, you would need to tweak the Split() part. I have included some common word separators, but I'm sure I missed a few (the colon [ : ], for example).
0
 
countrymeisterAuthor Commented:
Thanks Kaufmed

I will try it
0
 
countrymeisterAuthor Commented:
Thanks
0
 
mareanobenCommented:
In the above instead of hard coding each non-word symbols, we can use Regex.Split
                .
                .
                .
                string line = reader.ReadLine();                            
                string[]  result= splitWordsOnly(line);
                string[] words = result.Where(w => w != string.Empty).ToArray();

and the splitWordsOnly method looks like this:

    private static string[] splitWordsOnly(string line)
    {
        return Regex.Split(line, @"\W+",RegexOptions.IgnorePatternWhitespace);
    }

the linq query is to filter unresolved white spaces from splitWordsOnly method
0

Featured Post

NEW Veeam Backup for Microsoft Office 365 1.5

With Office 365, it’s your data and your responsibility to protect it. NEW Veeam Backup for Microsoft Office 365 eliminates the risk of losing access to your Office 365 data.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now