• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 394
  • Last Modified:

Python unicode problem

I'm trying to do experiment on Scrapy. But the chinese characters in stored result appears to be \u4e00\u8d77\u6e38\u5427(\u5168\u7403\u65c5\u884c\u8d34\u8eab\u4f34\u4fa3... What should I do? thanks
0
fxp007
Asked:
fxp007
  • 3
  • 2
2 Solutions
 
peprCommented:
What do you expect to be stored? Where is the problem?
0
 
fxp007Author Commented:
I want to see the chinese characters as "¿¿" instead of \u....\u...
0
 
peprCommented:
How do you display the wanted result.  I guess that you print the result to the console window that is not capable to display the characters.  Because of this it prints the symbolic representation of the characters in the form of escape sequences for unicode characters.  Try the following code to store the value into a HTML file. Then display the resulting file:

import codecs

s = u'\u4e00\u8d77\u6e38\u5427(\u5168\u7403\u65c5\u884c\u8d34\u8eab\u4f34\u4fa3'

f = codecs.open('test.html', 'w', encoding='utf-8')

f.write('''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>Chinese test</title>
</head>
<body>
<p>The value displayed via HTML browser: ''')
f.write(s)
f.write('''</p>
<p>Representation of the same value using escape sequences: ''')
f.write(repr(s))
f.write('''</p>
</body>
</html>''')

f.close()

Open in new window


It displays in my case:

 snapsot of the browser window content
In other words, your problem may actually be no problem.  The representation can be OK.  Probably only your output device is not capable to display the chinese characters.
test.zip
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
jpg526Commented:
Chinese characters cannot be stored directly, it must be converted into Unicode first, you may use some available tools to check the corresponding unicode of each chinese character (e.g. http://weber.ucsd.edu/~dkjordan/resources/unicodemaker.html)

solution from pepr should useful for you.
0
 
peprCommented:
There are also older encodings than Unicode.  So, there is more ways to store the Chinese chararacters in a file.  However, the Unicode way should be preferred these days.

In the Unicode standard, each (Chinese or whatever) character is assigned one unambiguous numeric value.  I.e. each character glyph (picture) is related to the concrete number.  The number is pure abstract integer.  When you want to store an unicode text to a file, you have to choose the way how the integers should be stored in the file.  The UTF-8 is one of the several possible ways.

Have a look at Chapter4. Strings by Mark Pilgrim (http://diveintopython3.org/strings.html) that starts with problems of encodings and continues with explanation of Unicode and the Unicode encodings.
0
 
fxp007Author Commented:
Thanks pepr. I'm reading that.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now