• C

C Programming: Getting HTML content from URL

I have a daemon built in C that listens for incoming requests from a client. An example request might be: "www.facebook.com"

The requirements are:
The output of the program will be the HTTP server response minus the HTTP header.  I.e., only the HTML content associated with the URL is emitted on standard output.

Question:
How do I get HTML content from a website in C, and store it in a string?
LVL 8
pzozulkaAsked:
Who is Participating?
 
Dave BaldwinFixer of ProblemsCommented:
Then since I assume this is a simplified exercise, you probably 'write' the GET command in proper format to a specific IP address on port 80 which is the standard HTTP port.  Then you 'read' the response.  This shows a 'simple example': http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Example_session
0
 
käµfm³d 👽Commented:
In what environment/platform?
0
 
pzozulkaAuthor Commented:
Linux
0
Making Bulk Changes to Active Directory

Watch this video to see how easy it is to make mass changes to Active Directory from an external text file without using complicated scripts.

 
Dave BaldwinFixer of ProblemsCommented:
That does not make any sense.  If you are listening for 'requests', then you are the one that is supposed to send the 'response'.
0
 
pzozulkaAuthor Commented:
The project requires to listen for requests from a client for a URL. The server's/daemon's job is to go fetch HTML from the URL, save the HTML to a string and return the  HTML string back to the requesting client.
0
 
Dave BaldwinFixer of ProblemsCommented:
If the client is a web browser, it needs the headers to determine what to do with the content.
0
 
pzozulkaAuthor Commented:
The client is a c program that uses a socket to attempt a connection to the server's daemon listening on port 9000. Once connection is made. Client sends a string "www.facebook.com" to the daemon. None of this is important though. My question has to do more with getting HTML content from a URL using C. How? Any tips?
0
 
Dave BaldwinFixer of ProblemsCommented:
It would be the same way that any other HTTP client does it and that is by making an HTTP request to the remote server and receiving the response followed by the content.  Basically you are doing the first steps that a web browser does to get a page.

The most complete software other than a browser for downloading web pages is 'wget'.  The source code for it is available so you can use it in your project.  http://www.gnu.org/software/wget/
0
 
ZoppoCommented:
Hi pzozulka,

a widely used library for requesting and recieving HTTP on client-side is libcurl (http://curl.haxx.se/libcurl/).

Here's a very short sample how it is used: http://curl.haxx.se/libcurl/c/simple.html

Hope this helps,

ZOPPO
0
 
pzozulkaAuthor Commented:
Since this is for a school project, we are using their linux server and have no control over it -- in other words, we have to use already built-in functionality.

As for wget, I looked at the source file and it's over a quarter million lines of code. It's going to take me a while to find what I'm looking for.

I don't think this is the router our professor had in mind. I have a feeling since every HTML webpage is in reality simply a text file with a .html extension (not worried about dynamic web content), the project simply needs to read() the file.

Do you have any other ideas? Something simpler that's already built in? I just need to connect to another machine and request to read a file.
0
 
pzozulkaAuthor Commented:
I found a similar question on Stackoverflow and one of the answers seems to hit the mark:
HTTP is built on top of TCP. If you know socket programming, you can write a simple networking application that opens a socket to the desired server and issues an HTTP GET command. Whatever the server responds with, you'll have to remove the HTTP headers that precede the actual document you want.

I already know how to create a socket program. In fact that's how my client connects to my server daemon, and they use read() and write() to communicate back an forth using file descriptors.

Now, can someone guide me on how to -- "issue an HTTP GET command".
0
 
ZoppoCommented:
Well, maybe it's worth to check whether libcurl already exists on your machine. It is used by a lot of applications and it probably it's already installed with any other package.

BTW: Is it an option for you to just call a command line tool?

ZOPPO
0
 
pzozulkaAuthor Commented:
I just checked, and libcurl is NOT on that machine. Also, all work needs to be done in C, using my own code. This is suppose to be a learning exercise, so I don't think the professor would want us to use any existing command line tools.

After doing a bit of googling, I now that what I'm looking for is to send an HTTP GET command across the socket. That's what I need help with.
0
 
pzozulkaAuthor Commented:
Perfect, this is exactly what I was looking for. So just to clarify, would I make a char pointer, and then write() to the socket connection like this?

char *getCMD = "GET /index.html HTTP/1.1\nHost: www.example.com";
write(sockfd, getCMD, strlen(getCMD));

Then do a read() on sockfd to get the HTTP response from the web server?
0
 
Dave BaldwinFixer of ProblemsCommented:
That's sounds about right.  Try it and see.
0
 
pzozulkaAuthor Commented:
So after reading the RFCs, I understand now that:
Request Line = {Method}     {Request-URI}    [HTTP version}

Example:
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.w3.org

What would be the Request-URI for www.facebook.com, or www.google.com, or www.twitter.com?

Since in neither of those examples, I'm looking for any specific .html document, and there would be no way for me to tell if those sites would use index.html, index.htm, index.php, etc.
0
 
Dave BaldwinFixer of ProblemsCommented:
This is what Firefox sends for a request to Facebook.  The response you are going to get is probably going to be a 301 Redirect to 'https' instead of a 200 followed by content.  Same with Google and Twitter.  You should probably pick a site that does not use 'https' since your C program won't be able to negotiate a secure connection.
GET / HTTP/1.1
Host: www.facebook.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Firefox/24.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

Open in new window

0
 
masheikCommented:
char *getCMD = "GET /index.html HTTP/1.1\nHost: www.example.com";
write(sockfd, getCMD, strlen(getCMD));

Open in new window



If you want to get the url or ip address of the webserver from user you can do something like below,

 char getReq[255];
 char webServer[255]; // Assume this contains either ip address or url
 if ( validateIP(webServer))  //You need to write fucntion to validate ip address
 { 
    sprintf(getReq, "GET / HTTP/1.0\nHOST: %s\n\n", webServer); 
 } 
 else 
 { 
    //Need to verify whether it is a url or dns
    sprintf(getReq, "GET / HTTP/1.0\nHOST: %s\n\n", webServer); 
 }

Open in new window


Once it is done you can use write call to write the request,

On the server end,
You need to wait for incoming connections from the client and using accept
call you can do that,

while(1) //Infinite loop
{
        char get[255];
        char path[255];
        char http[255];
  	accept() //accepts call, with proper arguments
 
  	recv(fd, reqMsg, 100, 0); //receives the http get request

  	sscanf(reqMsg, "%s %s %s", get, path, http); //reads and parses the http get request

  	//print get path http now
        
        // Next step is to send the http requested data to the client

        // Frame the data to be send
        
        // If there is an error send the error code like 404 etc,
        // finally close the socket

}

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.