Avatar of Terminator4
Terminator4

asked on 

Fastest httpstring emails extractor

Hi, I need to extract emails out of a big string...
then in the results, it removes the duplicates.
Then the function returns a string with emails divided by whitespace.
I will give all the points to the person thats make this functions work the fastest.

In C language


char* returnEmails(char* httpPage){




}

input: a Web page

output: "johndoe@hotmail.com johnsmith@gmail.com"
Editors IDEsCC++

Avatar of undefined
Last Comment
Infinity08
Avatar of _Stilgar_
_Stilgar_
Flag of Israel image

I would use RegEx for parsing, put each within a vector / array, and then output it however I want.

Stilgar.
Avatar of Infinity08
Infinity08
Flag of Belgium image

In case this is homework, I'll just give you some hints (since you mention C and not C++, I'll stick to C) :

1) you can use strchr to look for the '@' character :

      http://www.cplusplus.com/reference/clibrary/cstring/strchr.html

2) you can use strtok with ' ' as separator to extract the whole email address. Possibly you'll have to add separators other than ' ', like '(', ')', etc.

      http://www.cplusplus.com/reference/clibrary/cstring/strtok.html

    note that strtok modifies the array. So, an alternative could be to use strncpy with a limited length :

      http://www.cplusplus.com/reference/clibrary/cstring/strncpy.html

3) you can store the e-mail addresses you find in an array of strings, then sort that array, and remove doubles

4) finally, just create the result string, by placing all strings from the array into it, separated by spaces. You could use sprintf for that :

      http://www.cplusplus.com/reference/clibrary/cstdio/sprintf.html


Could you tell if you are restricted to C ? If you can use C++, there are several easier ways of doing this.
Avatar of Terminator4
Terminator4

ASKER

Yes this is restricted to C, can you guys make functional code out of this?
Avatar of grg99
grg99

It's a one-liner in Perl   :)

It's kinda hard to do in C, as it involves sorting.

Also you need to think about equivalent names:  Joe_Blow@AOL.com  is equivalent to JOE_BLOW@Mail.AOL.Com

That's hard to do as it involves DNS lookups.



Avatar of Infinity08
Infinity08
Flag of Belgium image

As I said : this sounds like an assignment (correct me if I'm wrong), so if you can give it a try with the hints I gave, then we'll be glad to help you along with any problems/questions you might still have.
Avatar of Terminator4
Terminator4

ASKER

THIS IS NOT A HOMEWORK

char* returnEmails(char* httpPage){
char* array;
for(int cpt = 0;cpt<?,cpt++){
     //strrchr to look for the '@' character :
     //get character before and after?
     //put found (lower caps)email in array
}    
return array;
}
Avatar of Terminator4
Terminator4

ASKER

Sorry
Avatar of Terminator4
Terminator4

ASKER

I think I got carried away because I thought being homework of not has nothing to do with the programming question.
Avatar of Terminator4
Terminator4

ASKER

lol, the last sentence, change "of" for "or".
Avatar of Infinity08
Infinity08
Flag of Belgium image

Here's how you COULD do it :

#include <stdio.h>
#include <stdlib.h>

#define DOMAIN_CHARS "-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
#define LOCAL_CHARS  "!#$%&\'*+-./0123456789=?ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz{|}~"

char *returnEmails(char *httpPage) {
  char *httpPtr = httpPage;
  char *addresses = (char*) calloc(1, sizeof(char));
  addresses[0] = '\0';
  while (httpPtr = strchr(httpPtr, '@')) {
    char *new_address = 0;
    int total_len = 0;
    int domain_len = strspn(httpPtr + 1, DOMAIN_CHARS);
    int local_len = 0;
    while (--httpPtr >= httpPage) {
      char c = *httpPtr;
      if ((0x20 <= c) && (c < 0x7F)) {
        switch (c) {
          case ' ' :
          case '\"' :
          case '(' :
          case ')' :
          case ',' :
          case ':' :
          case ';' :
          case '<' :
          case '>' :
          case '@' :
          case '[' :
          case '\\' :
          case ']' :
            break;
          default :
            ++local_len;
            continue;
        }
      }
      break;
    }
    ++httpPtr;
    total_len = local_len + 1 + domain_len;
    new_address = (char*) calloc(total_len + 1, sizeof(char));
    strncpy(new_address, httpPtr, total_len);
    new_address[total_len] = '\0';

    if (!strstr(addresses, new_address)) {
      int cur_len = strlen(addresses);
      addresses = (char*) realloc(addresses, cur_len + 1 + total_len);
      addresses[cur_len] = ' ';
      strcpy(addresses + cur_len + 1, new_address);
    }

    httpPtr += total_len;
  }
  return addresses;
}


int main(void) {
  char *httpPage = "test@yahoo.com sdjrgfjhxb and test2@yahoo-2.com fg 45621; <>te+!&st3@yahoo.com(test@yahoo.com)";
  char *emails = returnEmails(httpPage);
  fprintf(stdout, emails);
  fprintf(stdout, "\n");
  free(emails);

  system("PAUSE");
  return 0;
}


Take a look, and see if you understand what it's doing. Feel free to ask questions if you want.

You'll probably have to fine-tune it to fit your exact needs, but this can at least be a start.

Note that I wrote this code mostly for clarity (to clearly show what's happening), and not for speed at the first place - although it will be quite fast.
Avatar of grg99
grg99

um, in the general case, e-mail addresses can be a whole lot more complex than you're assuming.

There can be parts in quotes, in angle brackets, and in parenteses.

Avatar of Infinity08
Infinity08
Flag of Belgium image

I checked RFC 2822 on it ... I made some minor simplifications (that shouldn't matter), and as I said, probably the algorithm will have to be refined. But it will extract e-mail addresses from an HTML page. Did you notice something basic that I overlooked ?

>> There can be parts in quotes, in angle brackets, and in parenteses.
btw, I know that the local part of an e-mail address can be in quotes (which is very rarely done), but have never heard of e-mail addresses with angle brackets or parentheses
Avatar of Terminator4
Terminator4

ASKER

Did you notice something basic that I overlooked ? > did you forget to put the emails in the result array in lower caps to have less doubles?
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
C++
C++

C++ is an intermediate-level general-purpose programming language, not to be confused with C or C#. It was developed as a set of extensions to the C programming language to improve type-safety and add support for automatic resource management, object-orientation, generic programming, and exception handling, among other features.

58K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo