Solved

Python 2.7 - French characters

Posted on 2016-09-17
6
75 Views
Last Modified: 2016-10-02
Hi there,

1.html contains French characters (éèÉç etc...)

In Python 2.7, I need to print the file content with the proper French characters.

Thanks for your help,
Rene

f = open('1.html', 'r')
file_contents = f.read()
print (file_contents)
f.close()

Open in new window

0
Comment
Question by:ReneGe
  • 3
  • 3
6 Comments
 
LVL 28

Accepted Solution

by:
pepr earned 500 total points
ID: 41803395
In Python 2.x, the open() function returns the open file object that pretends to be the one that returns a text file content. Actually, it does not work with any encoding, and it returns streams of bytes in a string variable. Actually, Python string object is a string of bytes. The only thing to help you reliably with national alphabet are unicode strings (the u'prefixed string literals' and the like converted strings of bytes.

In Python 2.x you can use codecs.open() function of the standard codecs module. It differs from the open() by the encoding arguments that tells how the bytes from the file should be converted to the unicode string.

When the unicode string is printed to console, it is likely to be converte to the correct encoding.

In Python 3.x, the string type is actually what the u'string' is in Python 2, and the open() is what codecs.open() was in Python 2.

import codecs
with codecs.open('1.html', 'r', encoding='utf-8') as f:
    content = f.read()
    print content

Open in new window

0
 
LVL 10

Author Comment

by:ReneGe
ID: 41803408
Hi pepr,

Thanks for your prompt response, explanation, and code.

I tried your code and this is what I got.

Traceback (most recent call last):
  File "1.py", line 3, in <module>
    content = f.read()
  File "C:\Python27\lib\codecs.py", line 674, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 480, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1246: invalid start byte


Cheers
0
 
LVL 28

Assisted Solution

by:pepr
pepr earned 500 total points
ID: 41803942
If your file uses a different encoding, you have to pass that encoding, not UTF-8. If the file was generated on Windows, then you probably should use 'cp1252' instead. If it was stored on a Unix-based system, it can be ISO-8859-15.

If the HTML was constructed properly, you can find the encoding at the beginning, in the head section.
0
Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

 
LVL 28

Expert Comment

by:pepr
ID: 41825667
Hi ReneGe. Have you found a solution?
0
 
LVL 10

Author Comment

by:ReneGe
ID: 41825669
Hi pepr,

Sorry for taking so long to reply.

cp1252 worked :)

Thank you so much for your help :)

Cheers mate!
1
 
LVL 10

Author Comment

by:ReneGe
ID: 41825670
Thanks
0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Installing Git and chefdk via bat script 8 72
Python variable _ manually assigned 9 80
Way to decrease size of apk file 9 65
Python filter object attributes 2 12
Having just graduated from college and entered the workforce, I don’t find myself always using the tools and programs I grew accustomed to over the past four years. However, there is one program I continually find myself reverting back to…R.   So …
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now