Solved

Python 2.7 - French characters

Posted on 2016-09-17
6
188 Views
Last Modified: 2016-10-02
Hi there,

1.html contains French characters (éèÉç etc...)

In Python 2.7, I need to print the file content with the proper French characters.

Thanks for your help,
Rene

f = open('1.html', 'r')
file_contents = f.read()
print (file_contents)
f.close()

Open in new window

0
Comment
Question by:ReneGe
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 29

Accepted Solution

by:
pepr earned 500 total points
ID: 41803395
In Python 2.x, the open() function returns the open file object that pretends to be the one that returns a text file content. Actually, it does not work with any encoding, and it returns streams of bytes in a string variable. Actually, Python string object is a string of bytes. The only thing to help you reliably with national alphabet are unicode strings (the u'prefixed string literals' and the like converted strings of bytes.

In Python 2.x you can use codecs.open() function of the standard codecs module. It differs from the open() by the encoding arguments that tells how the bytes from the file should be converted to the unicode string.

When the unicode string is printed to console, it is likely to be converte to the correct encoding.

In Python 3.x, the string type is actually what the u'string' is in Python 2, and the open() is what codecs.open() was in Python 2.

import codecs
with codecs.open('1.html', 'r', encoding='utf-8') as f:
    content = f.read()
    print content

Open in new window

0
 
LVL 10

Author Comment

by:ReneGe
ID: 41803408
Hi pepr,

Thanks for your prompt response, explanation, and code.

I tried your code and this is what I got.

Traceback (most recent call last):
  File "1.py", line 3, in <module>
    content = f.read()
  File "C:\Python27\lib\codecs.py", line 674, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 480, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1246: invalid start byte


Cheers
0
 
LVL 29

Assisted Solution

by:pepr
pepr earned 500 total points
ID: 41803942
If your file uses a different encoding, you have to pass that encoding, not UTF-8. If the file was generated on Windows, then you probably should use 'cp1252' instead. If it was stored on a Unix-based system, it can be ISO-8859-15.

If the HTML was constructed properly, you can find the encoding at the beginning, in the head section.
0
How Do You Stack Up Against Your Peers?

With today’s modern enterprise so dependent on digital infrastructures, the impact of major incidents has increased dramatically. Grab the report now to gain insight into how your organization ranks against your peers and learn best-in-class strategies to resolve incidents.

 
LVL 29

Expert Comment

by:pepr
ID: 41825667
Hi ReneGe. Have you found a solution?
0
 
LVL 10

Author Comment

by:ReneGe
ID: 41825669
Hi pepr,

Sorry for taking so long to reply.

cp1252 worked :)

Thank you so much for your help :)

Cheers mate!
1
 
LVL 10

Author Comment

by:ReneGe
ID: 41825670
Thanks
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Python to .bat or Powershell 2 78
Help to debug powershell script 5 58
powershell script error 2 36
change script to get csv file on the prompt 8 33
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.

710 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question