?
Solved

process HTML documents

Posted on 2005-04-25
26
Medium Priority
?
1,465 Views
Last Modified: 2013-11-17
Hi!:

I need to extract the content information from a HTML file, only the important content excluding the configuration tags, special characters, html characters, is there any way I could take all the information from a HTML file??

READ YA
Mike
0
Comment
Question by:m1k3
  • 12
  • 11
  • 2
  • +1
26 Comments
 
LVL 5

Accepted Solution

by:
Greybird earned 260 total points
ID: 13864390
If you want to get the text content of the html, you could use a TCppWebBrowser (Browser in the example below), and open the page in it.

Then you use :
AnsiString aTextContent= "";
Variant vDocument = Browser->ControlInterface->Document;
if (((IDispatch *)vDocument) != NULL)
{
   Variant vBody = vDocument.OlePropertyGet("Body");
   Variant vTextContent = vBody.OlePropertyGet("InnerText");
   aTextContent = vTextContent;
}
0
 
LVL 19

Assisted Solution

by:Gary Benade
Gary Benade earned 40 total points
ID: 13864621
Once you have the body text, you could strip the control tags out by doing something like this.

String strippedText;
while( vTextContent.Pos("<") > 0)
{
   strippedText += vTextContent.SubString(1,vTextContent.Pos("<")-1).Trim();
   vTextContent.Delete(1, vTextContent.Pos(">"));
}

if you already have the html on disk, you could probably strip the header with this (assume vTextContent contains the html file in its entirety)

vTextContent.Delete(1, vTextContent.LowerCase().Pos("</head>")+6);

and then run the code above on whats left in vTextContent to get stripped Text

eg.

    FILE * tmp;
    char buffer[ 10240];
  if ((tmp=fopen("c:\\web sites\\autodealer\\homepage.htm","r"))!=NULL)
    {
      fseek(tmp,0,SEEK_SET);
      fread( buffer,10240,1,tmp);
      fclose(tmp);
     }
    String strippedText;
    String vTextContent = buffer;
    vTextContent.Delete(1, vTextContent.LowerCase().Pos("</head>")+6);
    while( vTextContent.Pos("<") > 0)
    {
       strippedText += vTextContent.SubString(1,vTextContent.Pos("<")-1).Trim();
       vTextContent.Delete(1, vTextContent.Pos(">"));
    }
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13869145
My code does not retrieve control tags.

It would have if I had called InnerHTML...
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:m1k3
ID: 13871158
Thanx 4 the quick response.... I'll check the tip later today thanx (Greybird & hobbit72)

I was gettin' tired of doin' everithin' by the old way.... retrievin' character by character... etc, etc... thanx again

READ YA
m1k3
0
 

Author Comment

by:m1k3
ID: 13871181
...by the way almost 4get... does anyone knows where can I find a manual about TCppWebBrowser that explains every function, procedure, etc... I need it to document the programmin' process...

-THXIADV- READ YA
                   m1k3
0
 

Author Comment

by:m1k3
ID: 13873200
The files that I pretend to analize are in a specific directory on my HDD so where do I assign the URL or the path of my files, in the comment that hobbit72 made... he uses a tmp variable where it can be defined but when I use the TCppWebBrowser component where can I assign the Path?? is it on Browser->ControlInterface->Document??
Browser->ControlInterface->Document or in any other property, function...

READ YA
m1k3
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13873508
Here you will find informations about navigating in the Document object :
http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/dhtml_reference_entry.asp

For help about the TCppWebBrowser, youhave the help file of C++ Builder, and not much more. But basically, it's just an ActiveX component wrapping Internet Explorer.

To have the page displayed in the browser, you could do :
// if the page is in a Memo, AnsiString, ... previously loaded from a file
WideString source = "about:" + Memo1->Lines->Text;
CppWebBrowser1->Navigate(source, 0, 0, 0, 0);

// or directly
WideString URL = L"file://C:/webpage.html";
CppWebBrowser1->Navigate(URL);
0
 
LVL 16

Assisted Solution

by:George Tokas
George Tokas earned 40 total points
ID: 13874221
One more suggestion.
If you don't want to display the contents of the HTML of the URL you can use also use TNMHTTP.. Using the Get() function you are getting the content at Body property and you can proccess it as AnsiString.
Online help is VERY helpfull at this topic in contrast with TCppWebBrowser..
I've been through hell to work with it and find my way....

gtokas.
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13875711
If you don't want to display the page you could set the Visible property of the TCppWebBrowser to false too ;)

or use IdHTTP component.

I did suggest tcppwebbrowser because it has built-in function to wrap all tags, and you can access data more easily.

But it really depends on your exact needs.
0
 

Author Comment

by:m1k3
ID: 13881241
What I want 2 do is 2 get the text without any tag or special HTML character in a txt file, don't really want 2 display the page 'cos there will B lot's of them & it consumes 2 much memory & slow down my PC

read ya
m1k3

PS: thanx 4 Ur quick response & patience I may look like a newbie 2 U... ;P (always learn new things)
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13881257
the method I gave you at the beginning will ensure you to get only the text without any tags. You should set the Visible property of the TCppWebBrowser to false.
0
 

Author Comment

by:m1k3
ID: 13882404
Hi!:

The problem now is that every time I try to open any file an error occurrs

_ASSERTE:
IsBound() @ c:\bcb\emuvcl\utilcls.h/4249
...

& doesn't allows me to execute properly the application... this is the code I've implemented:

__fastcall TForm1::TForm1(TComponent* Owner)
        : TForm(Owner)
{
 WideString URL = L"file://PATH/file_name.htm";
 Browser->Navigate(URL);
 AnsiString aTextContent= "";
 Variant vDocument = Browser->ControlInterface->Document;
 if (((IDispatch *)vDocument) != NULL)
 {
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
 }
}

the file extension is correct, when I cange it to file_name.html logically it can't be found so it can't be an extension problem, but I don't know what to 2... I asked 4 the CPPWebBrowser manual 'cos I'd like to know exactly what an I doin'.... :P

READ YA
m1k3
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13883086
Navigate returns without waiting for the page to be loaded. You have to do the following things :

add the following private member to Form1 :
_di_IDispatch CurDispatch;

set CurDispatch to NULL in the constructor of Form1 :
CurDispatch = NULL;

implement OnDocumentComplete :
//---------------------------------------------------------------------------
void __fastcall TForm1::CppWebBrowser1DocumentComplete(TObject *Sender,
      LPDISPATCH pDisp, Variant *URL)
{
   if (pDisp == CurDispatch) // pDIsp is equal to the browser dispatch only if the document is complete, and not when only a frame is loaded.
   {
      CurDispatch = NULL; // reset CurDispatch
   }
   // then you are sure that the document is loaded completely
   AnsiString aTextContent= "";
   Variant vDocument = Browser->ControlInterface->Document;
   if (((IDispatch *)vDocument) != NULL)
   {
      Variant vBody = vDocument.OlePropertyGet("Body");
      Variant vTextContent = vBody.OlePropertyGet("InnerText");
      aTextContent = vTextContent;
   }
}
//---------------------------------------------------------------------------
implement OnNavigateComplete2 :
//---------------------------------------------------------------------------
void __fastcall TForm1::CppWebBrowser1NavigateComplete2(TObject *Sender,
      LPDISPATCH pDisp, Variant *URL)
{
   if (!CurDispatch) // if CurDispatch is NULL
    CurDispatch = pDisp; // stores the browser dispatch
}
//---------------------------------------------------------------------------
0
 

Author Comment

by:m1k3
ID: 13892008
Hi!:

I've solve it... well actually Greybird did it, but U know what I mean

finally I've got a question... Is there any problem if I open about 500 files or more??

READ YA
m1k3

PS: Thanx 2 Greybird, hobbit72 & gtokas that's why I increased the pint value, to give U all some of the points but actually Greybird wins the 125 +5 original points & the rest split between hobbit72 & gtokas
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13892228
Do you mean open 500 files at the same time or one after the other ?
0
 

Author Comment

by:m1k3
ID: 13895421
One after another.. but maybe more than 500...
0
 
LVL 16

Expert Comment

by:George Tokas
ID: 13917108
If there are less than 50 files open at the same time then I don't think there will be a problem... With more than 50 there are some limitations from windows and Borland's compiler...
Anyway you can have the contents of those files (as many as you like ) creating and using buffers but remember to close the handles of the files in order not to have any problems...
As for the solution with TCppWebBrowser I'm glad you made it to work because documentation for BCB is simply not existed... That's why I proposed the native THTTP component.

gtokas.

P.S. Thankfully the lack of documentation is replaced with the experience of the people here who share it with all of us...:-)
0
 
LVL 5

Expert Comment

by:Greybird
ID: 13919864
If you open them one after the other, there isn't any problem, as gtokas said.
0
 

Author Comment

by:m1k3
ID: 13974727
thanks, the deal is that I need to extract the content of some HTML files but as a sample of a population I have between 100 & 500, I need only to obtain the information without the configuration tags or the special characters, I used to process one file after another but it's so hard to stablish a pattern in the tags specially span, style & the JavaScript ones just to name some, but with the TCppWebBrowser it looked easy but now that I'm tryin' to apply parts of my old code the first part that's to eliminate special characters it's ok, but when I try to separate the words there's when the "destruction" begins the word is full of special character dunno why :S... maybe I'll try the tip that hobbit72 gave me...
0
 

Author Comment

by:m1k3
ID: 13983835
Hi:

When I checked on the first file that's copied right after the HTML file is loaded, I do this:

In the BrowserDocumentComplete function:
...
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
/* I added the next lines */
  stream_aux = fopen ("DUMP.XNF", "w+");
  fprintf (stream_aux, "%s\n", aTextContent);
  fclose (stream_aux);
  stream_aux = NULL;
  fflush (stream_aux);

but first to call a HTML file I use FindFirstFile & FindNextFile to track any HTML file & when they're founded:

...
   getcwd(dbuffer, MAXPATH);
   strcpy (dir_url, "file://");
   strcat (dir_url, dbuffer);
   strcat (dir_url, "\\");
   strcat (dir_url, str); /*str has the file name*/
   WideString URL = dir_url;
   Browser->Navigate(URL);
[*]   Removecharacters ();
[*]   Makelist ();

I also changed the 2 last functions [*] right after fflush (stream_aux); in the BrowserDocumentComplete function, the first File (DUMP.XNF) is written correctly, after that in the Removecharacters  () a new file is generated:

if (!(DirectoryExists ("c:\\XNF_Files\\")))
  CreateDirectory ("c:\\XNF_Files\\", NULL);
 strcpy (szFileName, "c:\\XNF_Files\\");
 strcat (szFileName, str);
 strcat(szFileName, ".XNF");
 stream_aux = fopen(szFileName, "w+");

& when the function ends the 2nd. file is correct, it has the content without any special character, but in the make list function I generate another DUMP.XNF file that will contain the word list but it has some other characters (special characters) between every character or doesn't shows any change to the DUMP.XNF file just as it is created the first time, but when I move the functions after the fflush (stream_aux); in the BrowserDocumentComplete function the DUMP.XNF file has nothin'....
please help :'(.... I really need this to be done, & there R some other processing steps that R missin'...

READ YA
m1k3
0
 

Author Comment

by:m1k3
ID: 14083335
I kind of made it work but still have one little problem with the TCppWebBrowser component I search 4 every "*.htm" file but when 1 it's found how do I assign the url to the component & when it is fully loaded with the file that I assign to Browser->Navigate(URL); 'cos I assign the file at the time it is found by FindFirstFile & FindNextFile when they're found I need to copy the content to a file right after:

if (pDisp == CurDispatch) // pDIsp is equal to the browser dispatch only if the document is complete, and not when only a frame is loaded.
 {
  CurDispatch = NULL; // reset CurDispatch
 }
 Variant vDocument = Browser->ControlInterface->Document;
 if (((IDispatch *)vDocument) != NULL)
 {
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
/* ------- HERE I COPY THE BODY INTO A FILE --------- */
 }
but how can I make this 'cos everytime I execute the application first checks every file found & then starts the loading process into the TCppWebBrowser, so exactly in what part I need to call the loading just to copy the body & start the processin' I need 2 do?? please tell me... THXADV

READ YA
m1k3
0
 
LVL 5

Expert Comment

by:Greybird
ID: 14083439
Just to say that I did an error in my post.

You have to put the code to execute after the download of the page in complete juste after CurDIspatch = NULL;, inside the braces !

Sorry for the confusion.

in order to wait for the file to be fully processed in you loop that finds file, you can dfo something like that :

bool waiting = false;
//...
//FindFirstFile
do
   waiting = true;
   //Navigate
   while (waiting)
   {
0
 
LVL 5

Expert Comment

by:Greybird
ID: 14083458
Sorry I accidentaly clicked Submit.

Just to say that I did an error in my post.

You have to put the code to execute after the download of the page in complete juste after CurDIspatch = NULL;, inside the braces !

Sorry for the confusion.

in order to wait for the file to be fully processed in you loop that finds file, you can dfo something like that :

in the .h private section of the form
bool waiting = false;

in the .cpp
//FindFirstFile
do
   waiting = true;
   //Navigate
   while (waiting)
   {
      Application->ProcessMessages();
   }
   // FindNextFile
while ...

and put waiting = false; after the CurDispatch = NULL; in the OnDocumentComplete.

Then you can call all your processing in the OnDocumentComplete, after your comment "/* ------- HERE I COPY THE BODY INTO A FILE --------- */"
0
 

Author Comment

by:m1k3
ID: 14110725
I finally made it work, but it's a little slow the process, I checked & it analyses the 1st file twice 1 at the beginin' & the other at the end... why is this?? it's caused by my FindFirstFile & FindNextFile procedure??...

 hFind = FindFirstFile("*.htm", &FN);
 strcpy (str, FN.cFileName);
 if (hFind == INVALID_HANDLE_VALUE)
 {
  ShowMessage(AnsiString("Invalid Handle File. Error ") + GetLastError () + ".");
 }
 else
 {
  if (str)
  {
   waiting = true;
   getcwd(dbuffer, MAXPATH);
   strcpy (dir_url, "file://");
   strcat (dir_url, dbuffer);
   strcat (dir_url, "\\");
   strcat (dir_url, str);
   WideString URL = dir_url;
   Browser->Navigate(URL);
   while (waiting)
   {
    Application->ProcessMessages();
   }
  }
  while (FindNextFile(hFind, &FN) != 0)
  {
   strcpy (str, FN.cFileName);
   if (str)
   {
    waiting = true;
    getcwd(dbuffer, MAXPATH);
    strcpy (dir_url, "file://");
    strcat (dir_url, dbuffer);
    strcat (dir_url, "\\");
    strcat (dir_url, str);
    WideString URL = dir_url;
    Browser->Navigate(URL);
    while (waiting)
    {
     Application->ProcessMessages();
    }
   }
  }
  dwError = GetLastError();
  if (dwError == ERROR_NO_MORE_FILES)
  {
   FindClose(hFind);
  }
  else
  {
   ShowMessage(AnsiString("Invalid handle file. Error ") + GetLastError () + ".");
  }
 }
0
 

Author Comment

by:m1k3
ID: 14180901
hi! I've solved the las problem, at the end of the function that searchs every *.html I added Browser->Navigate(NULL); the only limitant I saw after some runin' some tests is that some files cannot be opened dunno why but I tested with 139 Files & so far only a few files couldn't be opened on the TCppWebBrowser but this helped me a lot...
Thank U all!!

READ YA
Mike
0
 
LVL 5

Expert Comment

by:Greybird
ID: 14181472
C Grade is not really a Thank U....
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Here is a helpful source code for C++ Builder programmers that allows you to manage and manipulate HTML content from C++ code, while also handling HTML events like onclick, onmouseover, ... Some objects defined and used in this source include: …
Jaspersoft Studio is a plugin for Eclipse that lets you create reports from a datasource.  In this article, we'll go over creating a report from a default template and setting up a datasource that connects to your database.
THe viewer will learn how to use NetBeans IDE 8.0 for Windows to perform CRUD operations on a MySql database.
The viewer will learn how to use and create new code templates in NetBeans IDE 8.0 for Windows.
Suggested Courses

839 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question