• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1502
  • Last Modified:

process HTML documents

Hi!:

I need to extract the content information from a HTML file, only the important content excluding the configuration tags, special characters, html characters, is there any way I could take all the information from a HTML file??

READ YA
Mike
0
m1k3
Asked:
m1k3
  • 12
  • 11
  • 2
  • +1
3 Solutions
 
GreybirdCommented:
If you want to get the text content of the html, you could use a TCppWebBrowser (Browser in the example below), and open the page in it.

Then you use :
AnsiString aTextContent= "";
Variant vDocument = Browser->ControlInterface->Document;
if (((IDispatch *)vDocument) != NULL)
{
   Variant vBody = vDocument.OlePropertyGet("Body");
   Variant vTextContent = vBody.OlePropertyGet("InnerText");
   aTextContent = vTextContent;
}
0
 
Gary BenadeCommented:
Once you have the body text, you could strip the control tags out by doing something like this.

String strippedText;
while( vTextContent.Pos("<") > 0)
{
   strippedText += vTextContent.SubString(1,vTextContent.Pos("<")-1).Trim();
   vTextContent.Delete(1, vTextContent.Pos(">"));
}

if you already have the html on disk, you could probably strip the header with this (assume vTextContent contains the html file in its entirety)

vTextContent.Delete(1, vTextContent.LowerCase().Pos("</head>")+6);

and then run the code above on whats left in vTextContent to get stripped Text

eg.

    FILE * tmp;
    char buffer[ 10240];
  if ((tmp=fopen("c:\\web sites\\autodealer\\homepage.htm","r"))!=NULL)
    {
      fseek(tmp,0,SEEK_SET);
      fread( buffer,10240,1,tmp);
      fclose(tmp);
     }
    String strippedText;
    String vTextContent = buffer;
    vTextContent.Delete(1, vTextContent.LowerCase().Pos("</head>")+6);
    while( vTextContent.Pos("<") > 0)
    {
       strippedText += vTextContent.SubString(1,vTextContent.Pos("<")-1).Trim();
       vTextContent.Delete(1, vTextContent.Pos(">"));
    }
0
 
GreybirdCommented:
My code does not retrieve control tags.

It would have if I had called InnerHTML...
0
Cloud Class® Course: Certified Penetration Testing

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

 
m1k3Author Commented:
Thanx 4 the quick response.... I'll check the tip later today thanx (Greybird & hobbit72)

I was gettin' tired of doin' everithin' by the old way.... retrievin' character by character... etc, etc... thanx again

READ YA
m1k3
0
 
m1k3Author Commented:
...by the way almost 4get... does anyone knows where can I find a manual about TCppWebBrowser that explains every function, procedure, etc... I need it to document the programmin' process...

-THXIADV- READ YA
                   m1k3
0
 
m1k3Author Commented:
The files that I pretend to analize are in a specific directory on my HDD so where do I assign the URL or the path of my files, in the comment that hobbit72 made... he uses a tmp variable where it can be defined but when I use the TCppWebBrowser component where can I assign the Path?? is it on Browser->ControlInterface->Document??
Browser->ControlInterface->Document or in any other property, function...

READ YA
m1k3
0
 
GreybirdCommented:
Here you will find informations about navigating in the Document object :
http://msdn.microsoft.com/library/default.asp?url=/workshop/author/dhtml/reference/dhtml_reference_entry.asp

For help about the TCppWebBrowser, youhave the help file of C++ Builder, and not much more. But basically, it's just an ActiveX component wrapping Internet Explorer.

To have the page displayed in the browser, you could do :
// if the page is in a Memo, AnsiString, ... previously loaded from a file
WideString source = "about:" + Memo1->Lines->Text;
CppWebBrowser1->Navigate(source, 0, 0, 0, 0);

// or directly
WideString URL = L"file://C:/webpage.html";
CppWebBrowser1->Navigate(URL);
0
 
George TokasCommented:
One more suggestion.
If you don't want to display the contents of the HTML of the URL you can use also use TNMHTTP.. Using the Get() function you are getting the content at Body property and you can proccess it as AnsiString.
Online help is VERY helpfull at this topic in contrast with TCppWebBrowser..
I've been through hell to work with it and find my way....

gtokas.
0
 
GreybirdCommented:
If you don't want to display the page you could set the Visible property of the TCppWebBrowser to false too ;)

or use IdHTTP component.

I did suggest tcppwebbrowser because it has built-in function to wrap all tags, and you can access data more easily.

But it really depends on your exact needs.
0
 
m1k3Author Commented:
What I want 2 do is 2 get the text without any tag or special HTML character in a txt file, don't really want 2 display the page 'cos there will B lot's of them & it consumes 2 much memory & slow down my PC

read ya
m1k3

PS: thanx 4 Ur quick response & patience I may look like a newbie 2 U... ;P (always learn new things)
0
 
GreybirdCommented:
the method I gave you at the beginning will ensure you to get only the text without any tags. You should set the Visible property of the TCppWebBrowser to false.
0
 
m1k3Author Commented:
Hi!:

The problem now is that every time I try to open any file an error occurrs

_ASSERTE:
IsBound() @ c:\bcb\emuvcl\utilcls.h/4249
...

& doesn't allows me to execute properly the application... this is the code I've implemented:

__fastcall TForm1::TForm1(TComponent* Owner)
        : TForm(Owner)
{
 WideString URL = L"file://PATH/file_name.htm";
 Browser->Navigate(URL);
 AnsiString aTextContent= "";
 Variant vDocument = Browser->ControlInterface->Document;
 if (((IDispatch *)vDocument) != NULL)
 {
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
 }
}

the file extension is correct, when I cange it to file_name.html logically it can't be found so it can't be an extension problem, but I don't know what to 2... I asked 4 the CPPWebBrowser manual 'cos I'd like to know exactly what an I doin'.... :P

READ YA
m1k3
0
 
GreybirdCommented:
Navigate returns without waiting for the page to be loaded. You have to do the following things :

add the following private member to Form1 :
_di_IDispatch CurDispatch;

set CurDispatch to NULL in the constructor of Form1 :
CurDispatch = NULL;

implement OnDocumentComplete :
//---------------------------------------------------------------------------
void __fastcall TForm1::CppWebBrowser1DocumentComplete(TObject *Sender,
      LPDISPATCH pDisp, Variant *URL)
{
   if (pDisp == CurDispatch) // pDIsp is equal to the browser dispatch only if the document is complete, and not when only a frame is loaded.
   {
      CurDispatch = NULL; // reset CurDispatch
   }
   // then you are sure that the document is loaded completely
   AnsiString aTextContent= "";
   Variant vDocument = Browser->ControlInterface->Document;
   if (((IDispatch *)vDocument) != NULL)
   {
      Variant vBody = vDocument.OlePropertyGet("Body");
      Variant vTextContent = vBody.OlePropertyGet("InnerText");
      aTextContent = vTextContent;
   }
}
//---------------------------------------------------------------------------
implement OnNavigateComplete2 :
//---------------------------------------------------------------------------
void __fastcall TForm1::CppWebBrowser1NavigateComplete2(TObject *Sender,
      LPDISPATCH pDisp, Variant *URL)
{
   if (!CurDispatch) // if CurDispatch is NULL
    CurDispatch = pDisp; // stores the browser dispatch
}
//---------------------------------------------------------------------------
0
 
m1k3Author Commented:
Hi!:

I've solve it... well actually Greybird did it, but U know what I mean

finally I've got a question... Is there any problem if I open about 500 files or more??

READ YA
m1k3

PS: Thanx 2 Greybird, hobbit72 & gtokas that's why I increased the pint value, to give U all some of the points but actually Greybird wins the 125 +5 original points & the rest split between hobbit72 & gtokas
0
 
GreybirdCommented:
Do you mean open 500 files at the same time or one after the other ?
0
 
m1k3Author Commented:
One after another.. but maybe more than 500...
0
 
George TokasCommented:
If there are less than 50 files open at the same time then I don't think there will be a problem... With more than 50 there are some limitations from windows and Borland's compiler...
Anyway you can have the contents of those files (as many as you like ) creating and using buffers but remember to close the handles of the files in order not to have any problems...
As for the solution with TCppWebBrowser I'm glad you made it to work because documentation for BCB is simply not existed... That's why I proposed the native THTTP component.

gtokas.

P.S. Thankfully the lack of documentation is replaced with the experience of the people here who share it with all of us...:-)
0
 
GreybirdCommented:
If you open them one after the other, there isn't any problem, as gtokas said.
0
 
m1k3Author Commented:
thanks, the deal is that I need to extract the content of some HTML files but as a sample of a population I have between 100 & 500, I need only to obtain the information without the configuration tags or the special characters, I used to process one file after another but it's so hard to stablish a pattern in the tags specially span, style & the JavaScript ones just to name some, but with the TCppWebBrowser it looked easy but now that I'm tryin' to apply parts of my old code the first part that's to eliminate special characters it's ok, but when I try to separate the words there's when the "destruction" begins the word is full of special character dunno why :S... maybe I'll try the tip that hobbit72 gave me...
0
 
m1k3Author Commented:
Hi:

When I checked on the first file that's copied right after the HTML file is loaded, I do this:

In the BrowserDocumentComplete function:
...
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
/* I added the next lines */
  stream_aux = fopen ("DUMP.XNF", "w+");
  fprintf (stream_aux, "%s\n", aTextContent);
  fclose (stream_aux);
  stream_aux = NULL;
  fflush (stream_aux);

but first to call a HTML file I use FindFirstFile & FindNextFile to track any HTML file & when they're founded:

...
   getcwd(dbuffer, MAXPATH);
   strcpy (dir_url, "file://");
   strcat (dir_url, dbuffer);
   strcat (dir_url, "\\");
   strcat (dir_url, str); /*str has the file name*/
   WideString URL = dir_url;
   Browser->Navigate(URL);
[*]   Removecharacters ();
[*]   Makelist ();

I also changed the 2 last functions [*] right after fflush (stream_aux); in the BrowserDocumentComplete function, the first File (DUMP.XNF) is written correctly, after that in the Removecharacters  () a new file is generated:

if (!(DirectoryExists ("c:\\XNF_Files\\")))
  CreateDirectory ("c:\\XNF_Files\\", NULL);
 strcpy (szFileName, "c:\\XNF_Files\\");
 strcat (szFileName, str);
 strcat(szFileName, ".XNF");
 stream_aux = fopen(szFileName, "w+");

& when the function ends the 2nd. file is correct, it has the content without any special character, but in the make list function I generate another DUMP.XNF file that will contain the word list but it has some other characters (special characters) between every character or doesn't shows any change to the DUMP.XNF file just as it is created the first time, but when I move the functions after the fflush (stream_aux); in the BrowserDocumentComplete function the DUMP.XNF file has nothin'....
please help :'(.... I really need this to be done, & there R some other processing steps that R missin'...

READ YA
m1k3
0
 
m1k3Author Commented:
I kind of made it work but still have one little problem with the TCppWebBrowser component I search 4 every "*.htm" file but when 1 it's found how do I assign the url to the component & when it is fully loaded with the file that I assign to Browser->Navigate(URL); 'cos I assign the file at the time it is found by FindFirstFile & FindNextFile when they're found I need to copy the content to a file right after:

if (pDisp == CurDispatch) // pDIsp is equal to the browser dispatch only if the document is complete, and not when only a frame is loaded.
 {
  CurDispatch = NULL; // reset CurDispatch
 }
 Variant vDocument = Browser->ControlInterface->Document;
 if (((IDispatch *)vDocument) != NULL)
 {
  Variant vBody = vDocument.OlePropertyGet("Body");
  Variant vTextContent = vBody.OlePropertyGet("InnerText");
  aTextContent = vTextContent;
/* ------- HERE I COPY THE BODY INTO A FILE --------- */
 }
but how can I make this 'cos everytime I execute the application first checks every file found & then starts the loading process into the TCppWebBrowser, so exactly in what part I need to call the loading just to copy the body & start the processin' I need 2 do?? please tell me... THXADV

READ YA
m1k3
0
 
GreybirdCommented:
Just to say that I did an error in my post.

You have to put the code to execute after the download of the page in complete juste after CurDIspatch = NULL;, inside the braces !

Sorry for the confusion.

in order to wait for the file to be fully processed in you loop that finds file, you can dfo something like that :

bool waiting = false;
//...
//FindFirstFile
do
   waiting = true;
   //Navigate
   while (waiting)
   {
0
 
GreybirdCommented:
Sorry I accidentaly clicked Submit.

Just to say that I did an error in my post.

You have to put the code to execute after the download of the page in complete juste after CurDIspatch = NULL;, inside the braces !

Sorry for the confusion.

in order to wait for the file to be fully processed in you loop that finds file, you can dfo something like that :

in the .h private section of the form
bool waiting = false;

in the .cpp
//FindFirstFile
do
   waiting = true;
   //Navigate
   while (waiting)
   {
      Application->ProcessMessages();
   }
   // FindNextFile
while ...

and put waiting = false; after the CurDispatch = NULL; in the OnDocumentComplete.

Then you can call all your processing in the OnDocumentComplete, after your comment "/* ------- HERE I COPY THE BODY INTO A FILE --------- */"
0
 
m1k3Author Commented:
I finally made it work, but it's a little slow the process, I checked & it analyses the 1st file twice 1 at the beginin' & the other at the end... why is this?? it's caused by my FindFirstFile & FindNextFile procedure??...

 hFind = FindFirstFile("*.htm", &FN);
 strcpy (str, FN.cFileName);
 if (hFind == INVALID_HANDLE_VALUE)
 {
  ShowMessage(AnsiString("Invalid Handle File. Error ") + GetLastError () + ".");
 }
 else
 {
  if (str)
  {
   waiting = true;
   getcwd(dbuffer, MAXPATH);
   strcpy (dir_url, "file://");
   strcat (dir_url, dbuffer);
   strcat (dir_url, "\\");
   strcat (dir_url, str);
   WideString URL = dir_url;
   Browser->Navigate(URL);
   while (waiting)
   {
    Application->ProcessMessages();
   }
  }
  while (FindNextFile(hFind, &FN) != 0)
  {
   strcpy (str, FN.cFileName);
   if (str)
   {
    waiting = true;
    getcwd(dbuffer, MAXPATH);
    strcpy (dir_url, "file://");
    strcat (dir_url, dbuffer);
    strcat (dir_url, "\\");
    strcat (dir_url, str);
    WideString URL = dir_url;
    Browser->Navigate(URL);
    while (waiting)
    {
     Application->ProcessMessages();
    }
   }
  }
  dwError = GetLastError();
  if (dwError == ERROR_NO_MORE_FILES)
  {
   FindClose(hFind);
  }
  else
  {
   ShowMessage(AnsiString("Invalid handle file. Error ") + GetLastError () + ".");
  }
 }
0
 
m1k3Author Commented:
hi! I've solved the las problem, at the end of the function that searchs every *.html I added Browser->Navigate(NULL); the only limitant I saw after some runin' some tests is that some files cannot be opened dunno why but I tested with 139 Files & so far only a few files couldn't be opened on the TCppWebBrowser but this helped me a lot...
Thank U all!!

READ YA
Mike
0
 
GreybirdCommented:
C Grade is not really a Thank U....
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

  • 12
  • 11
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now