I'm trying to build a program to sort out a stream of statements into relevant and non-relevant statements with regards to a particular domain name. What algorithms and frameworks would be helpful?
I shall clarify further with an example.
Let me pick a subject like economics. For a given group of sentences and phrases, I should be able to sort out each of those to determine whether they belong to the field of economics or otherwise. If I see something regarding cooking or the weather, I should put that in the irrelevant category, and if I see something with regards to profits and GDP, I should include that in the relevant category. I understand that I should have some sort of knowledge base for that particular domain ie. economics.
I need pointers to where I can start.
How do I go about collecting the domain data?
What basic process structure should the system have?
I'm planning to use Java for the implementation.
Tutorials would also be very much appreciated.