Solved

How to find an element in HTML document

Posted on 2003-11-27
10
509 Views
Last Modified: 2013-11-19
Hi,

I want to build a tool for data mining from an html page. I want the user to
select an element from a web page, and train my application to recognize it
in its later updates. For example, suppose the user wants to extract some
data from a financial web site. He want to extract his total balance, plus the table
of the last transactions. What he should do is to highlight the elements
inside the html page. After doing that, the application should analyze the
html elements structure, and learn how to find it in similar pages (even
when they are not identical). What I really need is an algorithm to
"understand" a single element (by it's structure, position in page or any
other methods), and then I want to look in a new page, and choose the most
similar element (which should probably be the right one).

I want to find a way to represent each element by a number. That way, I can look for "similar" items (or sort them by most matching).

Does anyone has an idea for it?

Regards,
    Gilad Novik
0
Comment
Question by:gilad_no
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +2
10 Comments
 
LVL 19

Accepted Solution

by:
RanjeetRain earned 125 total points
ID: 9834010
You will be better off studying SGML standards, originator of all mark up languages. The language specification is extrememly comprehensive and will help you understand how any program can implement HTML (or any other ML) parsing.

You may also use a shortcut. Try going thru the source code of some XML parsers. Source code of some APIs are still freely available. Going through the source code may give you some good idea about how to go about it.
0
 
LVL 19

Expert Comment

by:RanjeetRain
ID: 9834018
As far as I can red it, you also want to perform Fuzzy searches. Again, read XML implementations. Specifically how to implement parsing documents that do not conform to a DTD. You will be getting closer to what u want.
0
 

Author Comment

by:gilad_no
ID: 9834029
I don't want just to parse the document. I'm using IHTMLDocument2 to hold it. I'm looking for a way to learn a single element and to be able to retrieve it in a similar page.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 19

Expert Comment

by:RanjeetRain
ID: 9834093
Well, this is one of those simple things. But I must warn you, all logic will fail if you have a malformed (not well-formed) document.

The algorithm:

1. Maintain a list of recognized elements (take it from any HTML reference, prepare a list of all HTML elements)
2. Search for tokens and retrive them from HTML page
3. Compare them with the token you are looking for
4. If they compare, begin Pushing subsequent tokens on a stack untill u find a closing token for the toke
5. When you come across a closing element see if it matches your token at the top of the stack
6. If it does, pop it and proceed. If it doesn't, you have a mal formed docuemnt. Handle the situation.
7. Continue retirving tokens as long as long u dont find a lcosing token for the element you are lookig for.

0
 
LVL 17

Assisted Solution

by:rstaveley
rstaveley earned 125 total points
ID: 9837124
You can get the IHTMLElement from http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/document2/elementfrompoint.asp ... but that assumes that its position remains constant.

Are the HTML documents of your own making or do you need to design something to deal with any old HTML document?

0
 

Author Comment

by:gilad_no
ID: 9839660
The documents aren't mine. I need to extract data from several web pages (which the user should choose). The author of the site may changes the pages from time to time so I need to find a way to deal with it.
0
 
LVL 17

Expert Comment

by:rstaveley
ID: 9841892
> may change

That's a tough one. Are you able to agree any convention with the author to make the parsing easier.

e.g.

    <!-- Here it is -->
0
 
LVL 7

Expert Comment

by:jconde
ID: 9844480
This is a complex problem ... this is usually solved by using advanced genetic programming / algorithms (AI) that "learn" while on the run.  This kind of problem includes tons of knowledge with handling trees (black-red if everything is stored in memory and B+ trees for HDD reads) and after you get a structure of which "elements" might be of interest to the user, as previously stated a Fuzzy search is the way to go.
0
 
LVL 9

Expert Comment

by:tinchos
ID: 10286201
No comment has been added lately, so it's time to clean up this TA.
I will leave the following recommendation for this question in the Cleanup topic area:

Split: RanjeetRain {http:#9834093} & rstaveley {http:#9841892}

Please leave any comments here within the next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

Tinchos
EE Cleanup Volunteer
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Problem with SqlConnection 4 193
Problem to Popup 2 97
Problem to page 4 105
How to remove Recent Projects from Embarcadero C++ builder XE10. Berlin 2 90
I will show you how to create a ASP.NET Captcha control without using any HTTP HANDELRS or what so ever. you can easily plug it into your web pages. For Example a = 2 + 3 (where 2 and 3 are 2 random numbers) Session("Answer") = 5 then we…
This article covers the basics of the Sass, which is a CSS extension language. You will learn about variables, mixins, and nesting.
The viewer will be introduced to the technique of using vectors in C++. The video will cover how to define a vector, store values in the vector and retrieve data from the values stored in the vector.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question