Solved

C Programming: Getting HTML content from URL

Posted on 2013-10-24
18
334 Views
Last Modified: 2013-10-29
I have a daemon built in C that listens for incoming requests from a client. An example request might be: "www.facebook.com"

The requirements are:
The output of the program will be the HTTP server response minus the HTTP header.  I.e., only the HTML content associated with the URL is emitted on standard output.

Question:
How do I get HTML content from a website in C, and store it in a string?
0
Comment
Question by:pzozulka
  • 8
  • 6
  • 2
  • +2
18 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 39599145
In what environment/platform?
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39599154
Linux
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39599225
That does not make any sense.  If you are listening for 'requests', then you are the one that is supposed to send the 'response'.
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39599245
The project requires to listen for requests from a client for a URL. The server's/daemon's job is to go fetch HTML from the URL, save the HTML to a string and return the  HTML string back to the requesting client.
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39599312
If the client is a web browser, it needs the headers to determine what to do with the content.
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39599344
The client is a c program that uses a socket to attempt a connection to the server's daemon listening on port 9000. Once connection is made. Client sends a string "www.facebook.com" to the daemon. None of this is important though. My question has to do more with getting HTML content from a URL using C. How? Any tips?
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39599388
It would be the same way that any other HTTP client does it and that is by making an HTTP request to the remote server and receiving the response followed by the content.  Basically you are doing the first steps that a web browser does to get a page.

The most complete software other than a browser for downloading web pages is 'wget'.  The source code for it is available so you can use it in your project.  http://www.gnu.org/software/wget/
0
 
LVL 30

Expert Comment

by:Zoppo
ID: 39599652
Hi pzozulka,

a widely used library for requesting and recieving HTTP on client-side is libcurl (http://curl.haxx.se/libcurl/).

Here's a very short sample how it is used: http://curl.haxx.se/libcurl/c/simple.html

Hope this helps,

ZOPPO
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39600697
Since this is for a school project, we are using their linux server and have no control over it -- in other words, we have to use already built-in functionality.

As for wget, I looked at the source file and it's over a quarter million lines of code. It's going to take me a while to find what I'm looking for.

I don't think this is the router our professor had in mind. I have a feeling since every HTML webpage is in reality simply a text file with a .html extension (not worried about dynamic web content), the project simply needs to read() the file.

Do you have any other ideas? Something simpler that's already built in? I just need to connect to another machine and request to read a file.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 8

Author Comment

by:pzozulka
ID: 39600708
I found a similar question on Stackoverflow and one of the answers seems to hit the mark:
HTTP is built on top of TCP. If you know socket programming, you can write a simple networking application that opens a socket to the desired server and issues an HTTP GET command. Whatever the server responds with, you'll have to remove the HTTP headers that precede the actual document you want.

I already know how to create a socket program. In fact that's how my client connects to my server daemon, and they use read() and write() to communicate back an forth using file descriptors.

Now, can someone guide me on how to -- "issue an HTTP GET command".
0
 
LVL 30

Expert Comment

by:Zoppo
ID: 39600715
Well, maybe it's worth to check whether libcurl already exists on your machine. It is used by a lot of applications and it probably it's already installed with any other package.

BTW: Is it an option for you to just call a command line tool?

ZOPPO
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39600842
I just checked, and libcurl is NOT on that machine. Also, all work needs to be done in C, using my own code. This is suppose to be a learning exercise, so I don't think the professor would want us to use any existing command line tools.

After doing a bit of googling, I now that what I'm looking for is to send an HTTP GET command across the socket. That's what I need help with.
0
 
LVL 82

Accepted Solution

by:
Dave Baldwin earned 333 total points
ID: 39600996
Then since I assume this is a simplified exercise, you probably 'write' the GET command in proper format to a specific IP address on port 80 which is the standard HTTP port.  Then you 'read' the response.  This shows a 'simple example': http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Example_session
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39601058
Perfect, this is exactly what I was looking for. So just to clarify, would I make a char pointer, and then write() to the socket connection like this?

char *getCMD = "GET /index.html HTTP/1.1\nHost: www.example.com";
write(sockfd, getCMD, strlen(getCMD));

Then do a read() on sockfd to get the HTTP response from the web server?
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39601175
That's sounds about right.  Try it and see.
0
 
LVL 8

Author Comment

by:pzozulka
ID: 39603048
So after reading the RFCs, I understand now that:
Request Line = {Method}     {Request-URI}    [HTTP version}

Example:
GET /pub/WWW/TheProject.html HTTP/1.1
Host: www.w3.org

What would be the Request-URI for www.facebook.com, or www.google.com, or www.twitter.com?

Since in neither of those examples, I'm looking for any specific .html document, and there would be no way for me to tell if those sites would use index.html, index.htm, index.php, etc.
0
 
LVL 82

Assisted Solution

by:Dave Baldwin
Dave Baldwin earned 333 total points
ID: 39603092
This is what Firefox sends for a request to Facebook.  The response you are going to get is probably going to be a 301 Redirect to 'https' instead of a 200 followed by content.  Same with Google and Twitter.  You should probably pick a site that does not use 'https' since your C program won't be able to negotiate a secure connection.
GET / HTTP/1.1
Host: www.facebook.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:24.0) Gecko/20100101 Firefox/24.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

Open in new window

0
 
LVL 9

Assisted Solution

by:masheik
masheik earned 167 total points
ID: 39607927
char *getCMD = "GET /index.html HTTP/1.1\nHost: www.example.com";
write(sockfd, getCMD, strlen(getCMD));

Open in new window



If you want to get the url or ip address of the webserver from user you can do something like below,

 char getReq[255];
 char webServer[255]; // Assume this contains either ip address or url
 if ( validateIP(webServer))  //You need to write fucntion to validate ip address
 { 
    sprintf(getReq, "GET / HTTP/1.0\nHOST: %s\n\n", webServer); 
 } 
 else 
 { 
    //Need to verify whether it is a url or dns
    sprintf(getReq, "GET / HTTP/1.0\nHOST: %s\n\n", webServer); 
 }

Open in new window


Once it is done you can use write call to write the request,

On the server end,
You need to wait for incoming connections from the client and using accept
call you can do that,

while(1) //Infinite loop
{
        char get[255];
        char path[255];
        char http[255];
  	accept() //accepts call, with proper arguments
 
  	recv(fd, reqMsg, 100, 0); //receives the http get request

  	sscanf(reqMsg, "%s %s %s", get, path, http); //reads and parses the http get request

  	//print get path http now
        
        // Next step is to send the http requested data to the client

        // Frame the data to be send
        
        // If there is an error send the error code like 404 etc,
        // finally close the socket

}

Open in new window

0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn ho…
Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
The goal of this video is to provide viewers with basic examples to understand opening and writing to files in the C programming language.
Video by: Grant
The goal of this video is to provide viewers with basic examples to understand and use for-loops in the C programming language.

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now