parse html and submit form

I am trying to use libcurl to access remote web pages via a bank of anonymous web proxies.  I can go to a particular web proxy URL (for example and retrieve their page just fine.  The problem is I now need to parse the returned HTML, find out which "input-type=submit" form to use and fill all associated form fields before submitting.  I'm trying to use libxml but cannot figure out how to do it.  Can anyone give me a clue?

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

evilrixSenior Software Engineer (Avast)Commented:
Try El Kabong, it's a very simple (and very forgiving) Sax style HTML parser.

"El-Kabong is a high-speed, forgiving, sax-style HTML parser. Its aim is to provide consumers with a very fast, clean, lightweight library which parses HTML quickly, while forgiving syntactically incorrect tags."
What I generally recommend is to run the HTML through tidy ( to clean it up and generate proper XHTML, and then use an XML parser (like libxml : to parse it.

The reason is that a lot of the HTML out there on the internet is full of mistakes, so even a specialized parser can easily "misinterpret" it. Tidy is specialized in cleaning up such HTML.

Of course, you can see if a specialized parser like the one evilrix suggested works for you - if so it would be a bit easier to implement ... I've never tried any, so I can't comment on that :)
97WideGlideAuthor Commented:
If we assume that I've got clean XHTML how would I go about forming the data for a proper SUBMIT via libcurl ?
Big Business Goals? Which KPIs Will Help You

The most successful MSPs rely on metrics – known as key performance indicators (KPIs) – for making informed decisions that help their businesses thrive, rather than just survive. This eBook provides an overview of the most important KPIs used by top MSPs.

Here is an example of how to perform an HTTP POST using libcurl :

You'll have to know what the form looks like (ie. what fields it has etc.). And since you have the XHTML page that contains the form, you can simply extract the necessary information from it using an XML parser (or HTML parser).
97WideGlideAuthor Commented:
HHMM, maybe I"m asking a question that is too involved.  Sorry if I am.  But it is the details regarding how to take the XML after it is tidied up and creating a valid libcurl POST that I am having trouble with.

I'm attaching sample XML code between the <form>...</form> marks.  This part I want to use libcurl to POST.

<form method="post" action="/index.php">
<table width="570" border="0" align="center" cellpadding="0"
<div id="address">
<div align="left"><input id="address_box" name="q" type="text"
class="bar" onfocus="" value="http://www." /></div>
<td><input id="button" type="submit" value="" /></td>
<script type="text/javascript">
google_ad_client = "pub-9576634561657687";
google_ad_width = 468;
google_ad_height = 15;
google_ad_format = "468x15_0ads_al";
google_ad_channel = "";
google_color_border = "FFFFFF";
google_color_bg = "FFFFFF";
google_color_link = "ea4b0c";
google_color_text = "666666";
google_color_url = "000000";
</script> <script type="text/javascript" src=
</script><br />
<label><input type="checkbox" name="hl[remove_scripts]" checked=
"checked" /> Disable JavaScript</label> <label><input type=
"checkbox" name="hl[accept_cookies]" checked="checked" /> Allow
Cookies</label> <label><input type="checkbox" name=
"hl[show_images]" checked="checked" /> Show Images</label>
<label><input type="checkbox" name="hl[base64_encode]" checked=
"checked" /> Use Base64</label> <label><input type="checkbox" name=
"hl[strip_meta]" checked="checked" /> Strip Meta</label>
<label><input type="checkbox" name="hl[include_form]" checked=
"checked" /> Include URL Form</label></form>

Open in new window

>> But it is the details regarding how to take the XML after it is tidied up and creating a valid libcurl POST that I am having trouble with.

You need to have a few bits of information :

(a) the submit URI, which can be found in the action attribute of the form tag. If it's a relative path, make sure to turn it into a complete URI.

(b) for each input field (input tag in the HTML), the type, name and value. Some of the values will be pre-defined (like the submit button in your example), some will have to be filled in by you.

All this information can be extracted from the HTML. The tricky bit will be to make your code understand it, and handle it correctly. I don't know what you're trying to do, but it seems you want to get this to work for any site ... which means you'll have to put in some intelligence in your code to understand what each of the fields mean.

Maybe if you could describe in a bit more detail what exactly you are trying to do, ...
97WideGlideAuthor Commented:
I'll try to provide more detail regarding exactly what I want to do.

If you go to: you will see a table of anonymous web proxies.   In a nutshell, under program control, I want to be able to go to each of those proxies in turn and then (through them)  go anonymously to a different site of my choosing.  The steps I have coded already are:

1-retrieve web page from link above and extract list of URLs of proxies
2-go to the first proxy in the list and retrieve its Web page.

So now, I need to parse the HTML retrieved from the Web Proxy in order to create a valid libcurl POST from the <form method="post"...>...</form> areas of the HTML.

I appreciate you taking an interest in my problem.

So, you basically want to extract form information from a web page without ptior knowledge of that page. All that is known, is that it's supposed to contain a form that allows to enter a URI, and then submit it to get the result. Right ?

In other words, your code will have to be quite intelligent to get around all pitfalls and gotchas involved in this, for example :

(a) some pages might contain more than one form (maybe a search form, or a comment form, or ...) - how will you know which one is the one you need ?

(b) some of these forms might require you to fill in more than just the URI, and might have a different way of submitting. How will you know how to use the form ?

(c) some of these forms might have multiple fields - how will you find the one that is supposed to contain the URI ?

(d) on some pages, JavaScript or other such things might be involved, which complicates the task of submitting automatically.

Just to mention a few.

This is by no means a simple task. For a human user of the website, things are straightforward, since he can see the web page, and understands what he's supposed to do (given all the meta information, and his experience with using a browser). But writing code to perform the same actions is more difficult, since you have to basically make it understand the website, and what's supposed to happen.

An easier approach is probably to gather a list of such web proxies, and all the necessary information, store that in a resource file somewhere (in a format of your choice), and then your code can simply use that resource file to do whatever it needs to do. That way, you perform the actions a human is good at yourself, and leave the rest up to your code.
97WideGlideAuthor Commented:
a) If the page has > 1 form tag I can simply submit the first one found and see if I get the target page back.  If not, try the following one until I get the expected result.

b) A web browser knows how to parse this information and it is pretty well defined at so I am not too concerned about that.  In fact I am just about to break down and just write my own parser.

c) I think this should be pretty easy to guess.  i.e. type="text" (value="" or "http://" or ...)

d) not sure about javascript but I just don't think it can be that difficult.

Am I wrong?
>> b) A web browser knows how to parse this information and it is pretty well defined at so I am not too concerned about that.  In fact I am just about to break down and just write my own parser.

A web browser just displays whatever the HTML tells it to. You as the user actually interpret the results, fill in the appropriate fields, and click the appropriate button. That's something that is not part of the W3C standard.

>> c) I think this should be pretty easy to guess.  i.e. type="text" (value="" or "http://" or ...)

What if there's multiple text fields ?

>> Am I wrong?

I guess you'll find out ;) I'm not saying it's impossible ... I'm saying it's not trivial, and might involve quite a bit of work to get it working right. If that doesn't scare you off, then go for it ! :)
97WideGlideAuthor Commented:
The work doesn't scare me but getting back to the original question, does anyone know how to parse the <form> tag and create a valid POST via libcurl ?
>> does anyone know how to parse the <form> tag and create a valid POST via libcurl ?

I thought that was what I was answering ;)

But I guess you're looking more for the technical side of things. If so, then check back in the first few posts, where either a HTML parser or an XML parser was suggested. If you run the HTML through such a parser, you'll get a tree as output that represents the format of the HTML. You can then iterate over that tree to get the elements you want. For specific examples, you can always refer to the documentation for the specific library you intend to use.

Once you have the data you need, you can POST it using libcurl. For that, see the sample code I mentioned earlier.
97WideGlideAuthor Commented:
Damn, I've looked all over the web for hours and can't find any sample code.  Below is what I have so far.  I can get the name of each element but I can't figure out how to get the strings associated with that content.

// Here I have already read the remote file into memory and tidied it up with libtidy.
//      The retrieved HTML is in output.bp
  TidyBuffer output = {0};
  xmlDocPtr doc; /* the resulting document tree */
  doc = xmlReadMemory((const char *)output.bp, output.size, "noname.xml", NULL, 0);
  if (doc == NULL) {
      fprintf(stderr, "Failed to parse document\n");
  // now traverse the tree to find POST
static void
print_element_names(xmlNode * a_node)
    xmlNode *cur_node = NULL;
    for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
        if (cur_node->type == XML_ELEMENT_NODE) {
      		printf("%s\n", cur_node->name;

Open in new window

>> I can get the name of each element but I can't figure out how to get the strings associated with that content.

What do you mean by "strings" ? Which strings ?

You probably want to start by finding an element with name "form", and then the child elements with name "input". You can get attributes of an element using xmlGetProp, like in :

Note that the rest of that tutorial might also be interesting to you :

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
97WideGlideAuthor Commented:
That looks to be exactly what I am looking for.  I'm going to try coding it up in a few hours and I'll report back.   Thank you.

Man, don't know why I couldn't find it.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.