Link to home
Start Free TrialLog in
Avatar of Vladimir Buzalka
Vladimir BuzalkaFlag for Czechia

asked on

Encoded email headers how to decode in Python

Dear Experts

I am reading in Python eml files, in this format I keep my archived emails. I am just starting with python so I usually stop quite often realizing that I cannot manage something which must be quite easy.

Now I stopped on international header, exactly From:

There is this string in email file right after keyword From: =?UTF-8?Q?Martin_Bo=C4=8Dan_ACTIVE24?= <helpdesk@active24.cz>

It is simply name of operator from big Czech internet company providing internet services. that name correctly displayed is "Martin Bočan ACTIVE24 <helpdesk@active24.cz>"

Is there a way I can decode that string and work with it?

Many thanks

Vladimir
Avatar of Norie
Norie

Vladimir

How are you currently reading the files?
Avatar of Vladimir Buzalka

ASKER

Dear Norie

I opened file x=open('message.eml','rb') and asked re module to find "From"
y=x.read()
r1=re.findall(r"^From:(.*)",y,re.M)
print(r1)

And I got =?UTF-8?Q?Martin_Bo=C4=8Dan_ACTIVE24?= + real email

Can you advice?

Thanks
V
=?UTF-8          is an indicator that a UTF string follows.
?Q    means Quoted printable  (?b = base64 encoded).
String,,,
?=                   last part of utf-8is the string.

So in this case you need to decode Quoted printable:   Martin_Bo=C4=8Dan_ACTIVE24....

Quoted printable is any =XX where XX is hexa decimal need to be replaced with hex value.

This may be helpful: https://docs.python.org/2/library/quopri.html
HI noci

thanks a lot for advice. I tried to use email.parser module of standard Python.
with open('message.eml', 'rb') as fp:
    msg = BytesParser(policy=policy.default).parse(fp)
print('From:', msg['from'])

This gave me fully correct name from: Martin Bočan ACTIVE24 <helpdesk@active24.cz>

But I am still in mist with regards decoding.

When I used your advice, i.e. Quoted printable - I decoded via
retezec2=b'Martin_Bo=C4=8Dan_ACTIVE24'
retezec2=quopri.decodestring(retezec2)
print(retezec2)
print(retezec2.decode('utf-8'))
print(retezec2.decode('windows-1250'))

I got
b'Martin_Bo\xc4\x8dan_ACTIVE24'
Martin_Bočan_ACTIVE24
Martin_BoÄŤan_ACTIVE24

You can see that ITF-8 encoding works OK, however I am completely confused with underscore after "n" letter. Why in retezec2.,decode ('utf-8') it is still kept in string, while in previous example of email.parser it is not there. And in Outlook, it is not there as well?

Many thanks

Vladimir
Here is eml file we are discussing about, just in case you want to trymessage.eml
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.