Link to home
Start Free TrialLog in
Avatar of sara_bellum
sara_bellumFlag for United States of America

asked on

python retrieval of email message body/text fails

I'm using python file globbing to read through a series of email messages and store each part of the message in a database table. So far I can retrieve each part of the header, but I can't isolate the body of the message, which should be simple except that it's not.  
   
FILESPEC = "/path-to-eml-files/*.eml"
for f in files:
    gg = open(f)
    text = gg.read()                
    head = message_from_string(text)
    message_id = head['Message-ID']
# etc for all portions of the email header...
but how can the body of the email be retrieved independently of the header for storage?

I tried readlines() but that too fails (output is chaotic so I probably have flow control issues here):

for f in files:
    jj = open(f, 'r')
    text = jj.readlines()
    for i, line in enumerate(text):
         if i >= 8: # body starts on line 8 (current eml format)
            #print(line)
            body = ''.join(line)
    body_text = message_from_string(body)
    print('start body', body_text)
Avatar of sara_bellum
sara_bellum
Flag of United States of America image

ASKER

It may be useful if I post the modules I'm importing fyi:
import pymysql
from glob import glob
from email import message_from_string
from datetime import datetime, timedelta
from email.utils import parsedate_tz, mktime_tz
Avatar of pepr
pepr

Can you attach some of your typical .elm file?  (You can create some dummy one.  I just want to see how complex the elm file is.)

My initial guess is that you should not parse the content of the file at all.  It should be done by a parser probably from the email module.  You are probably required only to open the file and pass the file object to the parser.  (I do not have first-hand experience with the subject, but I dare to try if you attach the elm ;)
Thanks for writing pepr! For this drill I looked at my inbox and formatted some samples, making sure that all of my samples use the same format. There's no telling that another set of emails would have the same format of course, but extracting header data has proven to be much simpler than capturing body text, which has no title and any number of lines. I'd like to strip the message body of unnecessary empty lines also. So here's an eml:

Subject: Re: [WinEdt] Email-Mode
From: Roger Mudd <roger.mudd@gmx.net>
Reply-To: <winedt+list@wsg.net>
Date: Fri, 18 Nov 2011 12:35:19 +0200
To: WinEdt Mailing List <winedt+list@wsg.net>
Message-ID: <200805051111.33339999>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv: 1.0.0) Gecko/20020530

On 8/6/2002 7:35 AM, Robert W. Kuhn wrote:

The world will end on this date: 2002-02-02 18:12:00
ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks pepr you made my day!! This is genius: if I'd thought to research the email module I should have found the get_payload() function myself! But I didn't think of it - it's too easy to miss important points when trying to learn many things at once.

Now that you've answered the question I should just close it, but will wait until tomorrow in case I think of something I failed to understand. Thanks again!
Thanks!!