We help IT Professionals succeed at work.

seeking optimal way to parse an email sent to my PHP script via PIPE

bitt3n
bitt3n asked
on
2,938 Views
Last Modified: 2012-06-27
I am currently piping mail to a script that does a bit of parsing as per this article:

http://www.devarticles.com/c/a/PHP/Incoming-Mail-and-PHP/1/

Email piped to the script gets recorded as $email. The script then parses it like this:

http://www.filefarmer.com/2/bitt3n/example.html

I want to parse $email more completely. I want to isolate fully the message body and email addresses, and ideally also isolate the message from any text that comprises a message to which the message body is a reply. (Ultimately I hope to handle attachments.)

I assume I will either need to use regular expressions or some existing function that must exist somewhere for parsing mail. If the former, then I will need help devising robust expressions that handle all possibilities. If the latter, I may need some help getting the function operating properly.

I’ll consider my question completely answered if I can get my script to handle everything except saving attachments (one thing at a time).

In a perfect world, some function would take $email as its argument and returns variables $to, $from, $subject, $message, $num_attachments, and $attachments_array with the array indicating the name, type, size and filepath of each attachment saved by the function (which saves each attachment to a specified directory if this attachment meets the size, type and number criteria).

Is there some library I can install that has such a function? Searching around I found some software called ripMIME that looks interesting (http://www.pldaniels.com/ripmime/), but I am not sure that is what I want and it doesn’t appear to come with any documentation. Also I’ve never needed to install a library before.

Thanks for your help.
Comment
Watch Question

ripMIME looks like it may do what you want - however, it's written in C and will need to be compiled (unless you can find a binary for it).

If you don't have the ability to compile ripMIME then I'm aware of at least one MIME decoder written in PHP - PEAR's Mail_mimeDecode:

http://pear.php.net/manual/en/package.mail.mail-mime.php

It can decode e-mails including headers, message body and attachments (including all the base64/quoted-printable stuff). If you need an example let me know and I'll try and dig one out. There are probably other PHP-based decoders out there too, but in general they'll all be slower than something like ripMIME.

Removing the original message that the current message is replying to is quite tricky - you could just naively remove any line beginning with > but this could break quite easily - clueless people typing in the > area; broken mail clients which mangle the original message; clients configured to use alternative characters like colons; HTML e-mails.

Hope some of that helps...
Top Expert 2005

Commented:
I have short script that can do this, but it doesn't handle pipes, because it's the most inefficient way to decode complex mail messages! My script gets mail by logging into the POP or IMAP or HTTP mail server and processes mail using smart memory handling, only downloading what it needed.

Here is what my script can do out of the box!

1. login with authorization
2. keep track of messages already processed if you don't delete after processing
3. decodes (headers, bodies, attachments)
4. can save attachments based on attach type. $allow =  'zip|rar|jpg|jpeg|rar|txt|gif|png';
5. can save message bodies, both html and text, or save only the html or text part (inline (attachments / images)  are auto linked to the file (if saved)
6. saves messages and attachments in directory structure starting at a base directory!
7. directory mail store information can be saved in PHP files or in a database.
8. uses memory buffers, dumping the core if memory usage reaches the max allowed.
9. can scan attachments for viruses as they are being decoded. (needs some virus software, CLAM, McAfee)
10. can use a database to handle multi accounts.
11. full detailed logging, 2 levels => normal, debug
12. script can run via a Cron Type Task, Win/Scheduled, or via the browser.

What you need to run this script!

PHP IMAP extension.
PHP 4.3 or higher

Optional....

PHP Magic Mime extension (for a more secure way dealing with many different attachment types) [validating file types]
CLAM, McAfee for virus scanning

If something like this will work for you tell me and I post a link to the download! The script/class is only 6Kbytes in size, it's lite weight but very efficient.

ms!

Author

Commented:
@sjohnstone1234: thanks, I would be interested in seeing an example. I am not familiar with object-oriented programming, and have never previously required PEAR modules. I am reading about them presently through your link.

@mensuck: I am interested in learning more about your approach. The reason I set up the mail parsing as a PIPE is the fact that I need to parse and respond to mail promptly upon receipt, and I have no need to retain a copy of the received message. I imagine I could set a cron job to run your script every 30 seconds and automatically delete parsed mail?

The process would be:
1) parse mail
2) store jpeg and gif attachments under the max size limit and max # attachment limit
3) notify user if an attachment was not saved
4) delete mail

Would using your script this way be more efficient than PIPE? Just out of curiosity, do you happen to know how fast your method could parse, say, 100 emails each with a 200k jpeg attachment? Thanks for your help.
Hi bitt3n,

I can only repeat my last post:

Use http://www.smilingsouls.net/Mail_IMAP in your PHP-Script, 'cause it handles attachments, multi-part mails, text-only mails very well.
You open a connection through http://php.net/imap_open and start a new class with IMAP_v2($yoursock).

Now you can do all he things you need.

Friendly Regards
Henning Möllendorf
Okay, assuming you've downloaded Mime_mailDecode itself, here's an example. It took around 30 seconds to process 100 e-mails with a 200k JPEG attachment (on a fairly slow machine):

#!/usr/bin/php
<?php

include_once('Mail/mimeDecode.php');

// get the message from a file; in practice you would want to read from
// stdin as in the original article
$message = file_get_contents('message');

$decoder =& new Mail_mimeDecode($message);
$decoded = $decoder->decode(array(
    'include_bodies' => TRUE,
    'decode_bodies' => TRUE,
    'decode_headers' => TRUE
));

// we need to process e-mails recursively as they can contain any number
// of nested parts
process_message($decoded);

function process_message($decoded) {
    // headers are in $decoded->headers
    print_r($decoded->headers);

    if(isset($decoded->parts)) {
        // process each part of a multipart message
        foreach($decoded->parts as $part) {
            process_message($part);
        }
    } else {
        // not multipart - do something interesting here, like checking
        // for a particular content type
        if($decoded->ctype_primary == 'image'
        and $decoded->ctype_secondary == 'jpeg') {
            // $decoded->body contains the image data - we could write it
            // to a file or into a database, but for now just output its size
            print 'Found a jpeg of size: '.strlen($decoded->body)."\n";
        }
    }
}

?>

Confused yet by all the methods? :)

Author

Commented:
OK. I am still figuring out the code (I am PHP newb), but I set $message equal to a cut-and-pasted unparsed email and ran the script locally (mimeDecode.php was already installed with my PHP 5.05). I am printing the result below. It looks like it worked fine but it isn't immediately obvious to me how to pull out the from/to/subject/body fields from this. The [from] field has the sender's name rather than the sender's email address, and I cannot find the message body, which should be "message_contents".

This is the unparsed message (everything after 'MESSAGE UNPARSED: '):

http://www.findmoby.com/e_exchange_mail_example.htm

This is the result of the script when I plug in the unparsed message. I don't see the message body and I'm not sure how to reference the other pieces properly.

Array ( [from bitt3n@gmail.com fri feb 03 14] => 58:47 2006 [received] => Array ( [0] => from [64.233.162.204] (helo=zproxy.gmail.com) by athena.pronameservice.net with esmtp (Exim 4.52) id 1F5811-0006Pv-Kj for bitt3n@findmoby.com; Fri, 03 Feb 2006 14:58:47 -0600 [1] => by zproxy.gmail.com with SMTP id s1so721951nze for ; Fri, 03 Feb 2006 12:58:54 -0800 (PST) [2] => by 10.65.206.2 with SMTP id i2mr1268341qbq; Fri, 03 Feb 2006 12:58:49 -0800 (PST) [3] => by 10.65.38.11 with HTTP; Fri, 3 Feb 2006 12:58:49 -0800 (PST) ) [domainkey-signature] => a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type; b=Br7JFbDR6spMTWcL/83vK59+T6xmzh10RV36PRH7fpb/dOtbRdhoHpmDc2q7Irp7nCt7cfW2XatHc9vnqETYxXYGfLKUipBnVF4qMGEO+3hCqCmcYpvWzVn9fMGjarimyUjkLJ5z+b2VXHC2jdSUUC2DNgB7yC7e4t5040Xkzh8= [message-id] =>  [date] => Fri, 3 Feb 2006 15:58:49 -0500 [from] => D G  [to] => bitt3n@findmoby.com [subject] => unparsed [mime-version] => 1.0 [content-type] => multipart/mixed; boundary="----=_Part_11440_8988137.1139000329456" ) Array ( [content-type] => multipart/alternative; boundary="----=_Part_11441_1568128.1139000329456" ) Array ( [content-type] => text/plain; charset=ISO-8859-1 [content-transfer-encoding] => quoted-printable [content-disposition] => inline ) Array ( [content-type] => text/html; charset=ISO-8859-1 [content-transfer-encoding] => quoted-printable [content-disposition] => inline )
The subject line can be accessed using $decoded->headers['subject'] - the other headers can be accessed in the same way, e.g. $decoded->headers['to'], $decoded->headers['from'], etc.

If you look at the original message you'll notice the sender's e-mail address is surrounded by < >. This is perfectly valid BUT if you run the script from a web browser then it will be mistaken for an HTML tag and you won't see it. If you click "View Source" you should see the e-mail address along with the rest of the output (should be more readable too!).

As for the message body - the basic text should be in $decoded->body, but this will probably be empty for all but the simplest of e-mails due to mixed HTML/plain text parts, attachments, etc. If you just want to extract any plain text in the message then you could do it with this function:

print extract_text($decoded);

function extract_text($decoded) {
    $text = '';

    if(isset($decoded->parts)) {
        foreach($decoded->parts as $part) {
            $text .= extract_text($part);
        }
    } else {
        if($decoded->ctype_primary == 'text'
        and $decoded->ctype_secondary == 'plain') {
            $text .= $decoded->body;
        }
    }

    return $text;
}

If you need to do anything more complicated (looking for HTML e-mails and parsing them) then that will be more complicated.
This problem has been solved!
(Unlock this solution with a 7-day Free Trial)
UNLOCK SOLUTION

Author

Commented:
hm.. I tried the raw message from here:

http://www.findmoby.com/e_exchange_mail_example.htm

but the decoder says "No JPEG attachments found."

I'll take a look at the code now and report back soon.
I think that message is corrupt - there should be another boundary at the end (the bit that looks like this: ------=_Part_11440_8988137.1139000329456--).

Author

Commented:
hm.. I am having a basic problem -- maybe you can give me some advice.

I modified my existing script with your code and put it on my server, but I think I have the includes path wrong for the Mail package. When I try to run the script, I get this error:

Fatal error: Class 'Mail_mimeDecode' not found in /home/bitt3n/incomingmailscriptsj.php on line 19

I believe that right package exists on the server, because the shell command "pear remote list" returns among the installed packages the package

Mail_Mime                       1.3.1

which I assume includes the Mail_mimeDecode::decode() function.

I checked the include path in phpinfo() (if that is even relevant) and it says:

.:/usr/lib/php:/usr/local/lib/php

I have tried to find the file mimeDecode.php on the server, and experimented with various file paths for the include without success. Can you tell me how to figure out what the correct path is? Thanks.

"pear remote-list" returns a list of packages on the download server, i.e. these are packages that are available to install (they might not be installed on your server).

So you'll need to install Mail_Mime (use "pear install Mail_Mime" or something).

Author

Commented:
yeah apparently my hosting provider has to install it because it's a shared server.. I am waiting for them to do it and will report back as soon as they have done so. thanks again.

Author

Commented:
ok the module is installed and the script is basically working. You can see it in action by sending a message to bitt3n@findmoby.com. The script will email back the resulting parsed message. The time taken to parse one message with a 192k attachment was about 0.011 seconds. That seems surprisingly fast.

The only thing that isn't working is the fact that if you include multiple jpegs, only the name of the first one gets returned, and so far I cannot figure out what I am doing wrong, but I am still working on it.

Here is the exact script I am using, into which I have incorporated your code:

http://www.findmoby.com/incomingmailscript.zip

Also I realized something interesting -- apparently if someone sends my script a message with a bad reply-to address, then a mailer-daemon bounces the reply, and then my script bounces the bounce etc. Is there an easy way to stop this? For example, I see that the bounce alerts are coming from Mailer-Daemon@athena.pronameservice.net. I assume if I automatically do not respond to any address beginning with Mailer-Daemon I should be OK?

It would be great to start saving the jpegs in a directory. What I want to do in plain English is:

If there are attachments of the right type, cycle through them as follows:
if the max attachments per message has not been reached, and the current attachment is of the right type, and is less than maxsize, and is less than max dimensions, save it to a specific directory, and add the filename to an array.
If one or more attachments were not saved, notify the sender.

If that's not completely trivial I would be happy to award points for this question and open a new question regarding storing the jpegs. I have a basic understanding of how to do it but I am not completely clear on how the arrays work and how you cycle through them to store the file. The parsing solution you provided for my original question is exactly what I wanted.
Oops, my mistake - the beginning of extract_jpegs should look like this:

function extract_jpegs(&$decoded) {
    $attachments = array();

    if(isset($decoded->parts)) {
        foreach($decoded->parts as $part) {
            $attachments = array_merge($attachments, extract_jpegs($part));
        }
etc...

the difference being using array_merge instead of just adding the arrays together.

As for ignoring bounces - you could just ignore anything from Mailer-Daemon but a more effective way would be extracting the "envelope sender" (also known as the return path) as this will catch vacation responses and other weird things.

How to do this will depend on your mail server software - it might appear in the headers (so look in the variable $decoded->headers['return-path']) or it might be in an environment variable called SENDER (so look in $_ENV['SENDER']).

If you manage to extract the envelope sender then you are looking for either a blank string "" or the null sender "<>" - either of those would indicate that the message is a bounce, and so your script should ignore it.

For the image processing, I see you are already looping through the images (it should work if you fix the bug in my code!); to find the dimensions of the image you can use imagecreatefromstring() and then imagesx() and imagesy() (look in the PHP online manual - should be fairly straightforward though). Then if the file matches your criteria, write the file using fopen(), fwrite() and fclose(). Most of these functions return FALSE if an error occurs, so you can notify the sender.

Hope that helps!

Author

Commented:
OK it was $_ENV['SENDER'] and I have modified the script to ignore return paths of "" or "<>". I tested it with a vacation response and it appears to work. It seems odd that the return path wouldn't be in the header though ($decoded->headers['return-path'] didn't work). I assume it has to be in there somewhere, and I am wondering if using $_ENV['SENDER'] might break my script if I move to another mail server at some point. Is there any way around this?

I will read up on the functions you mentioned for handling the attachments and see how far I get. If I need more tips, I will open up a new question regarding that specific topic.

Thanks for your help, this was very useful.
Hi bitt3n, glad you got it working.

I'm not sure whether $_ENV['SENDER'] works with other MTAs; I just noticed you were using Exim which I know supports it. I have a feeling most (e.g. qmail, Postfix) apart from Sendmail support it, or something similar. If you are cursed with Sendmail on your server then you might be able to use something like Procmail to filter out bounces.

As for whether the Return-Path header should appear - normally it is only added when a message is delivered to its final destination (i.e. a mailbox file on disk), so whether or not you see it when a message is piped to your script is probably MTA dependent too (e.g. with Exim it can be toggled using the return_path_add option in the pipe transport). Sorry...

Author

Commented:
OK I am tinkering around with the file saving and I can create the file on the server and save attachment['data'] to it, but for some reason, this only saves a 0 into the file. is attachament['data'] the wrong field to be saving? am I doing something idiotic? here is the code:

if(count($attachments) > 0) {
      foreach($attachments as $attachment) {
            if(isset($attachment['name'])) {
                  $attachment_names .= ($attachment['name']) . ' ';
                  $file = $attachment['name'];  
                  if (!$file_handle = fopen($file,"w")) { $file_status = "Cannot open file"; }  
                  if (!fwrite($file_handle, $attachment['data'])) { $file_status = "Cannot write to file"; }    
                  fclose($file_handle);
            } else {
                  $attachment_names .= 'noname_';
            }
            $attachment_sizes .= ($attachment['size']);
      }
} else {
$attachment_names = 'No JPEG attachments found. ';
}
No - I did something idiotic (again) in my code...

In the extract_jpegs function, the line which sets $attachment['data'] should read:

$attachment['data'] = $decoded->body;

Hope that helps

Author

Commented:
yes, that fixed the problem, thanks.

is there some simple function that can extract the e-mail address from: 'Test User <bitt3n@findmoby.com>'? When I receive messages I need to compare the 'to' and 'from' address with addresses in a database, so I need the address to be isolated if there is additional text surrounding it.

I suspect the answer is ereg_replace, parsing the string into 'text up to <', 'text@text' and '>' (where the first and third parts may be empty) but I hope the answer is not that involved.
I'm not aware of a built-in function which does it; you could use something like this:

function parse_emails($data) {
    $parsed = array();

    foreach(explode(',', $data) as $address) {
        if(preg_match('/<?([^ ]+@[^ >]+)>?/', $address, $matches) > 0) {
            $parsed[] = $matches[1];
        }
    }

    return $parsed;
}

It will parse a list of addresses (separated by commas) and return an array containing the address parts of each. The regular expression isn't entirely accurate in that it will allow e-mail addresses which aren't strictly valid.

And yes I have tested this one (quickly) and it seems to work :) let me know if you ahve any problems though.