Vladimir Buzalka
asked on
Encoded email headers how to decode in Python
Dear Experts
I am reading in Python eml files, in this format I keep my archived emails. I am just starting with python so I usually stop quite often realizing that I cannot manage something which must be quite easy.
Now I stopped on international header, exactly From:
There is this string in email file right after keyword From: =?UTF-8?Q?Martin_Bo=C4=8Da n_ACTIVE24 ?= <helpdesk@active24.cz>
It is simply name of operator from big Czech internet company providing internet services. that name correctly displayed is "Martin Bočan ACTIVE24 <helpdesk@active24.cz>"
Is there a way I can decode that string and work with it?
Many thanks
Vladimir
I am reading in Python eml files, in this format I keep my archived emails. I am just starting with python so I usually stop quite often realizing that I cannot manage something which must be quite easy.
Now I stopped on international header, exactly From:
There is this string in email file right after keyword From: =?UTF-8?Q?Martin_Bo=C4=8Da
It is simply name of operator from big Czech internet company providing internet services. that name correctly displayed is "Martin Bočan ACTIVE24 <helpdesk@active24.cz>"
Is there a way I can decode that string and work with it?
Many thanks
Vladimir
ASKER
Dear Norie
I opened file x=open('message.eml','rb') and asked re module to find "From"
y=x.read()
r1=re.findall(r"^From:(.*) ",y,re.M)
print(r1)
And I got =?UTF-8?Q?Martin_Bo=C4=8Da n_ACTIVE24 ?= + real email
Can you advice?
Thanks
V
I opened file x=open('message.eml','rb')
y=x.read()
r1=re.findall(r"^From:(.*)
print(r1)
And I got =?UTF-8?Q?Martin_Bo=C4=8Da
Can you advice?
Thanks
V
=?UTF-8 is an indicator that a UTF string follows.
?Q means Quoted printable (?b = base64 encoded).
String,,,
?= last part of utf-8is the string.
So in this case you need to decode Quoted printable: Martin_Bo=C4=8Dan_ACTIVE24 ....
Quoted printable is any =XX where XX is hexa decimal need to be replaced with hex value.
This may be helpful: https://docs.python.org/2/library/quopri.html
?Q means Quoted printable (?b = base64 encoded).
String,,,
?= last part of utf-8is the string.
So in this case you need to decode Quoted printable: Martin_Bo=C4=8Dan_ACTIVE24
Quoted printable is any =XX where XX is hexa decimal need to be replaced with hex value.
This may be helpful: https://docs.python.org/2/library/quopri.html
ASKER
HI noci
thanks a lot for advice. I tried to use email.parser module of standard Python.
with open('message.eml', 'rb') as fp:
msg = BytesParser(policy=policy. default).p arse(fp)
print('From:', msg['from'])
This gave me fully correct name from: Martin Bočan ACTIVE24 <helpdesk@active24.cz>
But I am still in mist with regards decoding.
When I used your advice, i.e. Quoted printable - I decoded via
retezec2=b'Martin_Bo=C4=8D an_ACTIVE2 4'
retezec2=quopri.decodestri ng(retezec 2)
print(retezec2)
print(retezec2.decode('utf -8'))
print(retezec2.decode('win dows-1250' ))
I got
b'Martin_Bo\xc4\x8dan_ACTI VE24'
Martin_Bočan_ACTIVE24
Martin_BoÄŤan_ACTIVE24
You can see that ITF-8 encoding works OK, however I am completely confused with underscore after "n" letter. Why in retezec2.,decode ('utf-8') it is still kept in string, while in previous example of email.parser it is not there. And in Outlook, it is not there as well?
Many thanks
Vladimir
thanks a lot for advice. I tried to use email.parser module of standard Python.
with open('message.eml', 'rb') as fp:
msg = BytesParser(policy=policy.
print('From:', msg['from'])
This gave me fully correct name from: Martin Bočan ACTIVE24 <helpdesk@active24.cz>
But I am still in mist with regards decoding.
When I used your advice, i.e. Quoted printable - I decoded via
retezec2=b'Martin_Bo=C4=8D
retezec2=quopri.decodestri
print(retezec2)
print(retezec2.decode('utf
print(retezec2.decode('win
I got
b'Martin_Bo\xc4\x8dan_ACTI
Martin_Bočan_ACTIVE24
Martin_BoÄŤan_ACTIVE24
You can see that ITF-8 encoding works OK, however I am completely confused with underscore after "n" letter. Why in retezec2.,decode ('utf-8') it is still kept in string, while in previous example of email.parser it is not there. And in Outlook, it is not there as well?
Many thanks
Vladimir
ASKER
Here is eml file we are discussing about, just in case you want to trymessage.eml
This question needs an answer!
Become an EE member today
7 DAY FREE TRIALMembers can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
How are you currently reading the files?