Solved

parse html and submit form

Posted on 2009-07-13
15
1,698 Views
Last Modified: 2013-11-19
I am trying to use libcurl to access remote web pages via a bank of anonymous web proxies.  I can go to a particular web proxy URL (for example http://www.antifilters.info/) and retrieve their page just fine.  The problem is I now need to parse the returned HTML, find out which "input-type=submit" form to use and fill all associated form fields before submitting.  I'm trying to use libxml but cannot figure out how to do it.  Can anyone give me a clue?

Thanks,
Curt
0
Comment
Question by:97WideGlide
  • 7
  • 7
15 Comments
 
LVL 40

Expert Comment

by:evilrix
Comment Utility
Try El Kabong, it's a very simple (and very forgiving) Sax style HTML parser.

"El-Kabong is a high-speed, forgiving, sax-style HTML parser. Its aim is to provide consumers with a very fast, clean, lightweight library which parses HTML quickly, while forgiving syntactically incorrect tags."

http://sourceforge.net/projects/ekhtml/
0
 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
What I generally recommend is to run the HTML through tidy (http://tidy.sourceforge.net/) to clean it up and generate proper XHTML, and then use an XML parser (like libxml : http://xmlsoft.org/) to parse it.

The reason is that a lot of the HTML out there on the internet is full of mistakes, so even a specialized parser can easily "misinterpret" it. Tidy is specialized in cleaning up such HTML.

Of course, you can see if a specialized parser like the one evilrix suggested works for you - if so it would be a bit easier to implement ... I've never tried any, so I can't comment on that :)
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
If we assume that I've got clean XHTML how would I go about forming the data for a proper SUBMIT via libcurl ?
0
 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
Here is an example of how to perform an HTTP POST using libcurl :

        http://curl.haxx.se/lxr/source/docs/examples/postit2.c

You'll have to know what the form looks like (ie. what fields it has etc.). And since you have the XHTML page that contains the form, you can simply extract the necessary information from it using an XML parser (or HTML parser).
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
HHMM, maybe I"m asking a question that is too involved.  Sorry if I am.  But it is the details regarding how to take the XML after it is tidied up and creating a valid libcurl POST that I am having trouble with.

I'm attaching sample XML code between the <form>...</form> marks.  This part I want to use libcurl to POST.

Thanks,
Curt
 
<form method="post" action="/index.php">

<table width="570" border="0" align="center" cellpadding="0"

cellspacing="0">

<tr>

<td>

<div id="address">

<div align="left"><input id="address_box" name="q" type="text"

class="bar" onfocus="this.select()" value="http://www." /></div>

</div>

</td>

<td><input id="button" type="submit" value="" /></td>

</tr>

</table>

<script type="text/javascript">

//<![CDATA[

<!--

google_ad_client = "pub-9576634561657687";

google_ad_width = 468;

google_ad_height = 15;

google_ad_format = "468x15_0ads_al";

google_ad_channel = "";

google_color_border = "FFFFFF";

google_color_bg = "FFFFFF";

google_color_link = "ea4b0c";

google_color_text = "666666";

google_color_url = "000000";

//-->

//]]>

</script> <script type="text/javascript" src=

"http://pagead2.googlesyndication.com/pagead/show_ads.js">

</script><br />

<label><input type="checkbox" name="hl[remove_scripts]" checked=

"checked" /> Disable JavaScript</label> <label><input type=

"checkbox" name="hl[accept_cookies]" checked="checked" /> Allow

Cookies</label> <label><input type="checkbox" name=

"hl[show_images]" checked="checked" /> Show Images</label>

<label><input type="checkbox" name="hl[base64_encode]" checked=

"checked" /> Use Base64</label> <label><input type="checkbox" name=

"hl[strip_meta]" checked="checked" /> Strip Meta</label>

<label><input type="checkbox" name="hl[include_form]" checked=

"checked" /> Include URL Form</label></form>

Open in new window

0
 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
>> But it is the details regarding how to take the XML after it is tidied up and creating a valid libcurl POST that I am having trouble with.

You need to have a few bits of information :

(a) the submit URI, which can be found in the action attribute of the form tag. If it's a relative path, make sure to turn it into a complete URI.

(b) for each input field (input tag in the HTML), the type, name and value. Some of the values will be pre-defined (like the submit button in your example), some will have to be filled in by you.

All this information can be extracted from the HTML. The tricky bit will be to make your code understand it, and handle it correctly. I don't know what you're trying to do, but it seems you want to get this to work for any site ... which means you'll have to put in some intelligence in your code to understand what each of the fields mean.

Maybe if you could describe in a bit more detail what exactly you are trying to do, ...
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
I'll try to provide more detail regarding exactly what I want to do.

If you go to: http://www.publicproxyservers.com/proxy/list_last_test1.html you will see a table of anonymous web proxies.   In a nutshell, under program control, I want to be able to go to each of those proxies in turn and then (through them)  go anonymously to a different site of my choosing.  The steps I have coded already are:

1-retrieve web page from link above and extract list of URLs of proxies
2-go to the first proxy in the list and retrieve its Web page.

So now, I need to parse the HTML retrieved from the Web Proxy in order to create a valid libcurl POST from the <form method="post"...>...</form> areas of the HTML.

I appreciate you taking an interest in my problem.


0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
So, you basically want to extract form information from a web page without ptior knowledge of that page. All that is known, is that it's supposed to contain a form that allows to enter a URI, and then submit it to get the result. Right ?

In other words, your code will have to be quite intelligent to get around all pitfalls and gotchas involved in this, for example :

(a) some pages might contain more than one form (maybe a search form, or a comment form, or ...) - how will you know which one is the one you need ?

(b) some of these forms might require you to fill in more than just the URI, and might have a different way of submitting. How will you know how to use the form ?

(c) some of these forms might have multiple fields - how will you find the one that is supposed to contain the URI ?

(d) on some pages, JavaScript or other such things might be involved, which complicates the task of submitting automatically.

Just to mention a few.

This is by no means a simple task. For a human user of the website, things are straightforward, since he can see the web page, and understands what he's supposed to do (given all the meta information, and his experience with using a browser). But writing code to perform the same actions is more difficult, since you have to basically make it understand the website, and what's supposed to happen.

An easier approach is probably to gather a list of such web proxies, and all the necessary information, store that in a resource file somewhere (in a format of your choice), and then your code can simply use that resource file to do whatever it needs to do. That way, you perform the actions a human is good at yourself, and leave the rest up to your code.
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
a) If the page has > 1 form tag I can simply submit the first one found and see if I get the target page back.  If not, try the following one until I get the expected result.

b) A web browser knows how to parse this information and it is pretty well defined at www.w3.org so I am not too concerned about that.  In fact I am just about to break down and just write my own parser.

c) I think this should be pretty easy to guess.  i.e. type="text" (value="" or "http://" or ...)

d) not sure about javascript but I just don't think it can be that difficult.

Am I wrong?
0
 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
>> b) A web browser knows how to parse this information and it is pretty well defined at www.w3.org so I am not too concerned about that.  In fact I am just about to break down and just write my own parser.

A web browser just displays whatever the HTML tells it to. You as the user actually interpret the results, fill in the appropriate fields, and click the appropriate button. That's something that is not part of the W3C standard.


>> c) I think this should be pretty easy to guess.  i.e. type="text" (value="" or "http://" or ...)

What if there's multiple text fields ?


>> Am I wrong?

I guess you'll find out ;) I'm not saying it's impossible ... I'm saying it's not trivial, and might involve quite a bit of work to get it working right. If that doesn't scare you off, then go for it ! :)
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
The work doesn't scare me but getting back to the original question, does anyone know how to parse the <form> tag and create a valid POST via libcurl ?
0
 
LVL 53

Expert Comment

by:Infinity08
Comment Utility
>> does anyone know how to parse the <form> tag and create a valid POST via libcurl ?

I thought that was what I was answering ;)

But I guess you're looking more for the technical side of things. If so, then check back in the first few posts, where either a HTML parser or an XML parser was suggested. If you run the HTML through such a parser, you'll get a tree as output that represents the format of the HTML. You can then iterate over that tree to get the elements you want. For specific examples, you can always refer to the documentation for the specific library you intend to use.

Once you have the data you need, you can POST it using libcurl. For that, see the sample code I mentioned earlier.
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
Damn, I've looked all over the web for hours and can't find any sample code.  Below is what I have so far.  I can get the name of each element but I can't figure out how to get the strings associated with that content.


  

// Here I have already read the remote file into memory and tidied it up with libtidy.

//      The retrieved HTML is in output.bp
 

  TidyBuffer output = {0};
 

  xmlDocPtr doc; /* the resulting document tree */
 

  doc = xmlReadMemory((const char *)output.bp, output.size, "noname.xml", NULL, 0);

  if (doc == NULL) {

      fprintf(stderr, "Failed to parse document\n");

  }
 

  // now traverse the tree to find POST

  print_element_names(xmlDocGetRootElement(doc));
 

===========================================================
 

static void

print_element_names(xmlNode * a_node)

{

    xmlNode *cur_node = NULL;
 

    for (cur_node = a_node; cur_node; cur_node = cur_node->next) {

        if (cur_node->type == XML_ELEMENT_NODE) {

      		printf("%s\n", cur_node->name;

        }

        print_element_names(cur_node->children);

    }

}

Open in new window

0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 500 total points
Comment Utility
>> I can get the name of each element but I can't figure out how to get the strings associated with that content.

What do you mean by "strings" ? Which strings ?

You probably want to start by finding an element with name "form", and then the child elements with name "input". You can get attributes of an element using xmlGetProp, like in :

        http://xmlsoft.org/tutorial/ar01s08.html

Note that the rest of that tutorial might also be interesting to you :

        http://xmlsoft.org/tutorial/index.html
0
 
LVL 8

Author Comment

by:97WideGlide
Comment Utility
That looks to be exactly what I am looking for.  I'm going to try coding it up in a few hours and I'll report back.   Thank you.

Man, don't know why I couldn't find it.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Suggested Solutions

This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
Use these top 10 tips to master the art of email signature design. Create an email signature design that will easily wow recipients, promote your brand and highlight your professionalism.
In this tutorial viewers will learn how to style transparent/translucent elements using alpha transparency in CSS Start with a normal styled element, such as a div.: Define its "background-color" property as "rgba (255, 255, 255, .5): The numbers in…
The viewer will learn the basics of jQuery including how to code hide show and toggles. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now