I want to build a tool for data mining from an html page. I want the user to
select an element from a web page, and train my application to recognize it
in its later updates. For example, suppose the user wants to extract some
data from a financial web site. He want to extract his total balance, plus the table
of the last transactions. What he should do is to highlight the elements
inside the html page. After doing that, the application should analyze the
html elements structure, and learn how to find it in similar pages (even
when they are not identical). What I really need is an algorithm to
"understand" a single element (by it's structure, position in page or any
other methods), and then I want to look in a new page, and choose the most
similar element (which should probably be the right one).
I want to find a way to represent each element by a number. That way, I can look for "similar" items (or sort them by most matching).
Does anyone has an idea for it?