Link to home
Create AccountLog in
Avatar of Terminator4
Terminator4

asked on

Fastest httpstring emails extractor

Hi, I need to extract emails out of a big string...
then in the results, it removes the duplicates.
Then the function returns a string with emails divided by whitespace.
I will give all the points to the person thats make this functions work the fastest.

In C language


char* returnEmails(char* httpPage){




}

input: a Web page

output: "johndoe@hotmail.com johnsmith@gmail.com"
Avatar of _Stilgar_
_Stilgar_
Flag of Israel image

I would use RegEx for parsing, put each within a vector / array, and then output it however I want.

Stilgar.
In case this is homework, I'll just give you some hints (since you mention C and not C++, I'll stick to C) :

1) you can use strchr to look for the '@' character :

      http://www.cplusplus.com/reference/clibrary/cstring/strchr.html

2) you can use strtok with ' ' as separator to extract the whole email address. Possibly you'll have to add separators other than ' ', like '(', ')', etc.

      http://www.cplusplus.com/reference/clibrary/cstring/strtok.html

    note that strtok modifies the array. So, an alternative could be to use strncpy with a limited length :

      http://www.cplusplus.com/reference/clibrary/cstring/strncpy.html

3) you can store the e-mail addresses you find in an array of strings, then sort that array, and remove doubles

4) finally, just create the result string, by placing all strings from the array into it, separated by spaces. You could use sprintf for that :

      http://www.cplusplus.com/reference/clibrary/cstdio/sprintf.html


Could you tell if you are restricted to C ? If you can use C++, there are several easier ways of doing this.
Avatar of Terminator4
Terminator4

ASKER

Yes this is restricted to C, can you guys make functional code out of this?
It's a one-liner in Perl   :)

It's kinda hard to do in C, as it involves sorting.

Also you need to think about equivalent names:  Joe_Blow@AOL.com  is equivalent to JOE_BLOW@Mail.AOL.Com

That's hard to do as it involves DNS lookups.



As I said : this sounds like an assignment (correct me if I'm wrong), so if you can give it a try with the hints I gave, then we'll be glad to help you along with any problems/questions you might still have.
THIS IS NOT A HOMEWORK

char* returnEmails(char* httpPage){
char* array;
for(int cpt = 0;cpt<?,cpt++){
     //strrchr to look for the '@' character :
     //get character before and after?
     //put found (lower caps)email in array
}    
return array;
}
Sorry
I think I got carried away because I thought being homework of not has nothing to do with the programming question.
lol, the last sentence, change "of" for "or".
Here's how you COULD do it :

#include <stdio.h>
#include <stdlib.h>

#define DOMAIN_CHARS "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
#define LOCAL_CHARS  "!#$%&\'*+-./0123456789=?ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz{|}~"

char *returnEmails(char *httpPage) {
  char *httpPtr = httpPage;
  char *addresses = (char*) calloc(1, sizeof(char));
  addresses[0] = '\0';
  while (httpPtr = strchr(httpPtr, '@')) {
    char *new_address = 0;
    int total_len = 0;
    int domain_len = strspn(httpPtr + 1, DOMAIN_CHARS);
    int local_len = 0;
    while (--httpPtr >= httpPage) {
      char c = *httpPtr;
      if ((0x20 <= c) && (c < 0x7F)) {
        switch (c) {
          case ' ' :
          case '\"' :
          case '(' :
          case ')' :
          case ',' :
          case ':' :
          case ';' :
          case '<' :
          case '>' :
          case '@' :
          case '[' :
          case '\\' :
          case ']' :
            break;
          default :
            ++local_len;
            continue;
        }
      }
      break;
    }
    ++httpPtr;
    total_len = local_len + 1 + domain_len;
    new_address = (char*) calloc(total_len + 1, sizeof(char));
    strncpy(new_address, httpPtr, total_len);
    new_address[total_len] = '\0';

    if (!strstr(addresses, new_address)) {
      int cur_len = strlen(addresses);
      addresses = (char*) realloc(addresses, cur_len + 1 + total_len);
      addresses[cur_len] = ' ';
      strcpy(addresses + cur_len + 1, new_address);
    }

    httpPtr += total_len;
  }
  return addresses;
}


int main(void) {
  char *httpPage = "test@yahoo.com sdjrgfjhxb and test2@yahoo-2.com fg 45621; <>te+!&st3@yahoo.com(test@yahoo.com)";
  char *emails = returnEmails(httpPage);
  fprintf(stdout, emails);
  fprintf(stdout, "\n");
  free(emails);

  system("PAUSE");
  return 0;
}


Take a look, and see if you understand what it's doing. Feel free to ask questions if you want.

You'll probably have to fine-tune it to fit your exact needs, but this can at least be a start.

Note that I wrote this code mostly for clarity (to clearly show what's happening), and not for speed at the first place - although it will be quite fast.
um, in the general case, e-mail addresses can be a whole lot more complex than you're assuming.

There can be parts in quotes, in angle brackets, and in parenteses.

I checked RFC 2822 on it ... I made some minor simplifications (that shouldn't matter), and as I said, probably the algorithm will have to be refined. But it will extract e-mail addresses from an HTML page. Did you notice something basic that I overlooked ?

>> There can be parts in quotes, in angle brackets, and in parenteses.
btw, I know that the local part of an e-mail address can be in quotes (which is very rarely done), but have never heard of e-mail addresses with angle brackets or parentheses
Did you notice something basic that I overlooked ? > did you forget to put the emails in the result array in lower caps to have less doubles?
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer