Terminator4
asked on
Fastest httpstring emails extractor
Hi, I need to extract emails out of a big string...
then in the results, it removes the duplicates.
Then the function returns a string with emails divided by whitespace.
I will give all the points to the person thats make this functions work the fastest.
In C language
char* returnEmails(char* httpPage){
}
input: a Web page
output: "johndoe@hotmail.com johnsmith@gmail.com"
then in the results, it removes the duplicates.
Then the function returns a string with emails divided by whitespace.
I will give all the points to the person thats make this functions work the fastest.
In C language
char* returnEmails(char* httpPage){
}
input: a Web page
output: "johndoe@hotmail.com johnsmith@gmail.com"
In case this is homework, I'll just give you some hints (since you mention C and not C++, I'll stick to C) :
1) you can use strchr to look for the '@' character :
http://www.cplusplus.com/reference/clibrary/cstring/strchr.html
2) you can use strtok with ' ' as separator to extract the whole email address. Possibly you'll have to add separators other than ' ', like '(', ')', etc.
http://www.cplusplus.com/reference/clibrary/cstring/strtok.html
note that strtok modifies the array. So, an alternative could be to use strncpy with a limited length :
http://www.cplusplus.com/reference/clibrary/cstring/strncpy.html
3) you can store the e-mail addresses you find in an array of strings, then sort that array, and remove doubles
4) finally, just create the result string, by placing all strings from the array into it, separated by spaces. You could use sprintf for that :
http://www.cplusplus.com/reference/clibrary/cstdio/sprintf.html
Could you tell if you are restricted to C ? If you can use C++, there are several easier ways of doing this.
1) you can use strchr to look for the '@' character :
http://www.cplusplus.com/reference/clibrary/cstring/strchr.html
2) you can use strtok with ' ' as separator to extract the whole email address. Possibly you'll have to add separators other than ' ', like '(', ')', etc.
http://www.cplusplus.com/reference/clibrary/cstring/strtok.html
note that strtok modifies the array. So, an alternative could be to use strncpy with a limited length :
http://www.cplusplus.com/reference/clibrary/cstring/strncpy.html
3) you can store the e-mail addresses you find in an array of strings, then sort that array, and remove doubles
4) finally, just create the result string, by placing all strings from the array into it, separated by spaces. You could use sprintf for that :
http://www.cplusplus.com/reference/clibrary/cstdio/sprintf.html
Could you tell if you are restricted to C ? If you can use C++, there are several easier ways of doing this.
ASKER
Yes this is restricted to C, can you guys make functional code out of this?
It's a one-liner in Perl :)
It's kinda hard to do in C, as it involves sorting.
Also you need to think about equivalent names: Joe_Blow@AOL.com is equivalent to JOE_BLOW@Mail.AOL.Com
That's hard to do as it involves DNS lookups.
It's kinda hard to do in C, as it involves sorting.
Also you need to think about equivalent names: Joe_Blow@AOL.com is equivalent to JOE_BLOW@Mail.AOL.Com
That's hard to do as it involves DNS lookups.
As I said : this sounds like an assignment (correct me if I'm wrong), so if you can give it a try with the hints I gave, then we'll be glad to help you along with any problems/questions you might still have.
ASKER
THIS IS NOT A HOMEWORK
char* returnEmails(char* httpPage){
char* array;
for(int cpt = 0;cpt<?,cpt++){
//strrchr to look for the '@' character :
//get character before and after?
//put found (lower caps)email in array
}
return array;
}
char* returnEmails(char* httpPage){
char* array;
for(int cpt = 0;cpt<?,cpt++){
//strrchr to look for the '@' character :
//get character before and after?
//put found (lower caps)email in array
}
return array;
}
ASKER
Sorry
ASKER
I think I got carried away because I thought being homework of not has nothing to do with the programming question.
ASKER
lol, the last sentence, change "of" for "or".
Here's how you COULD do it :
#include <stdio.h>
#include <stdlib.h>
#define DOMAIN_CHARS "-.0123456789ABCDEFGHIJKLM NOPQRSTUVW XYZabcdefg hijklmnopq rstuvwxyz"
#define LOCAL_CHARS "!#$%&\'*+-./0123456789=?A BCDEFGHIJK LMNOPQRSTU VWXYZ^_`ab cdefghijkl mnopqrstuv wxyz{|}~"
char *returnEmails(char *httpPage) {
char *httpPtr = httpPage;
char *addresses = (char*) calloc(1, sizeof(char));
addresses[0] = '\0';
while (httpPtr = strchr(httpPtr, '@')) {
char *new_address = 0;
int total_len = 0;
int domain_len = strspn(httpPtr + 1, DOMAIN_CHARS);
int local_len = 0;
while (--httpPtr >= httpPage) {
char c = *httpPtr;
if ((0x20 <= c) && (c < 0x7F)) {
switch (c) {
case ' ' :
case '\"' :
case '(' :
case ')' :
case ',' :
case ':' :
case ';' :
case '<' :
case '>' :
case '@' :
case '[' :
case '\\' :
case ']' :
break;
default :
++local_len;
continue;
}
}
break;
}
++httpPtr;
total_len = local_len + 1 + domain_len;
new_address = (char*) calloc(total_len + 1, sizeof(char));
strncpy(new_address, httpPtr, total_len);
new_address[total_len] = '\0';
if (!strstr(addresses, new_address)) {
int cur_len = strlen(addresses);
addresses = (char*) realloc(addresses, cur_len + 1 + total_len);
addresses[cur_len] = ' ';
strcpy(addresses + cur_len + 1, new_address);
}
httpPtr += total_len;
}
return addresses;
}
int main(void) {
char *httpPage = "test@yahoo.com sdjrgfjhxb and test2@yahoo-2.com fg 45621; <>te+!&st3@yahoo.com(test@ yahoo.com) ";
char *emails = returnEmails(httpPage);
fprintf(stdout, emails);
fprintf(stdout, "\n");
free(emails);
system("PAUSE");
return 0;
}
Take a look, and see if you understand what it's doing. Feel free to ask questions if you want.
You'll probably have to fine-tune it to fit your exact needs, but this can at least be a start.
Note that I wrote this code mostly for clarity (to clearly show what's happening), and not for speed at the first place - although it will be quite fast.
#include <stdio.h>
#include <stdlib.h>
#define DOMAIN_CHARS "-.0123456789ABCDEFGHIJKLM
#define LOCAL_CHARS "!#$%&\'*+-./0123456789=?A
char *returnEmails(char *httpPage) {
char *httpPtr = httpPage;
char *addresses = (char*) calloc(1, sizeof(char));
addresses[0] = '\0';
while (httpPtr = strchr(httpPtr, '@')) {
char *new_address = 0;
int total_len = 0;
int domain_len = strspn(httpPtr + 1, DOMAIN_CHARS);
int local_len = 0;
while (--httpPtr >= httpPage) {
char c = *httpPtr;
if ((0x20 <= c) && (c < 0x7F)) {
switch (c) {
case ' ' :
case '\"' :
case '(' :
case ')' :
case ',' :
case ':' :
case ';' :
case '<' :
case '>' :
case '@' :
case '[' :
case '\\' :
case ']' :
break;
default :
++local_len;
continue;
}
}
break;
}
++httpPtr;
total_len = local_len + 1 + domain_len;
new_address = (char*) calloc(total_len + 1, sizeof(char));
strncpy(new_address, httpPtr, total_len);
new_address[total_len] = '\0';
if (!strstr(addresses, new_address)) {
int cur_len = strlen(addresses);
addresses = (char*) realloc(addresses, cur_len + 1 + total_len);
addresses[cur_len] = ' ';
strcpy(addresses + cur_len + 1, new_address);
}
httpPtr += total_len;
}
return addresses;
}
int main(void) {
char *httpPage = "test@yahoo.com sdjrgfjhxb and test2@yahoo-2.com fg 45621; <>te+!&st3@yahoo.com(test@
char *emails = returnEmails(httpPage);
fprintf(stdout, emails);
fprintf(stdout, "\n");
free(emails);
system("PAUSE");
return 0;
}
Take a look, and see if you understand what it's doing. Feel free to ask questions if you want.
You'll probably have to fine-tune it to fit your exact needs, but this can at least be a start.
Note that I wrote this code mostly for clarity (to clearly show what's happening), and not for speed at the first place - although it will be quite fast.
um, in the general case, e-mail addresses can be a whole lot more complex than you're assuming.
There can be parts in quotes, in angle brackets, and in parenteses.
There can be parts in quotes, in angle brackets, and in parenteses.
I checked RFC 2822 on it ... I made some minor simplifications (that shouldn't matter), and as I said, probably the algorithm will have to be refined. But it will extract e-mail addresses from an HTML page. Did you notice something basic that I overlooked ?
>> There can be parts in quotes, in angle brackets, and in parenteses.
btw, I know that the local part of an e-mail address can be in quotes (which is very rarely done), but have never heard of e-mail addresses with angle brackets or parentheses
>> There can be parts in quotes, in angle brackets, and in parenteses.
btw, I know that the local part of an e-mail address can be in quotes (which is very rarely done), but have never heard of e-mail addresses with angle brackets or parentheses
ASKER
Did you notice something basic that I overlooked ? > did you forget to put the emails in the result array in lower caps to have less doubles?
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
Stilgar.