Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium



Posted on 2006-05-01
Medium Priority
Last Modified: 2010-04-16
From what I understand when working with Python sockets, specifically the recv and sendall calls, it is String data that is either received or sent.

My question is simple, what encoding is the data sent in?

If it assumes an encoding such as UTF-8, does this mean I cannot write Python socket code which communicates in UTF-16?

If it does not assume an encoding, how does it go about generating String data from the received bytes?

Seems like a catch 22 to me.
Question by:derekl
  • 4
  • 3
LVL 17

Expert Comment

ID: 16582349
As far as I can tell sockets send-receive "strings". Encoding is not an issue.
LVL 14

Expert Comment

ID: 16584739
Python strings can be either 8-bit or Unicode:

>>> s = "Hello"
>>> type(s)
<type 'str'>
>>> u = u"Hello"
>>> type(u)
<type 'unicode'>

8-bit strings are simply strings of bytes with values from 0 to 255.  They know nothing of encodings.  Unicode strings are pure Unicode, in that each character is a Unicode character.  If you want to output a Unicode string to a device that's expecting an 8-bit string (as sockets do) you or Python must encode into an 8-bit string first.  If you send a Unicode string down a socket without explicitly encoding it, it will be encoded using the default system encoding, which is 'ascii' for out-of-the-box Python.  Best practice is to explicitly encode unicode strings before sending them:

>>> sock.send(u.encode('utf-8'))

For receiving, the answer is that reading a socket returns an 8-bit string.  If you need to interpret this as Unicode, you need to decode it:

>>> s = read_my_socket()
>>> u = s.decode('utf-8')

If you don't decode it but leave it as an 8-bit string, and then use it in a context that expects a Unicode string, Python will decode it according to the default encoding.

(There are two main consequences of Python's choice of having a default encoding, and choosing ascii for that encoding.  One is that most things, at least in the English-speaking world, just work.  The other is that often a piece of software which has been working happily for weeks will fall over with a UnicodeError because it's been presented with an accented character for the first time.)

Author Comment

ID: 16586375
Thanks for the excellent answer Richie.  I was sort of converging on the fact that normal python strings can be treated more or less as byte buffers but your answer cleared it up for me.  If you don't mind, could you tell me what's wrong with the following code?

    string_builder.append(u"GEORGIAN - &#4317;&#4307;&#4308;&#4321;&#4304;&#4330; &#4315;&#4321;&#4317;&#4324;&#4314;&#4312;&#4317;&#4321; &#4323;&#4320;&#4311;&#4312;&#4320;&#4311;&#4317;&#4305;&#4304; &#4321;&#4323;&#4320;&#4321;, &#4312;&#4306;&#4312; Unicode-&#4312;&#4321; &#4308;&#4316;&#4304;&#4310;&#4308; &#4314;&#4304;&#4318;&#4304;&#4320;&#4304;&#4313;&#4317;&#4305;&#4321;")

I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-19: character maps to <undefined>
[Webinar On Demand] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.


Author Comment

ID: 16586389
The above was a bunch of Georgian Unicode characters before I submitted them to the server and they were escaped.
LVL 14

Expert Comment

ID: 16590414
Your problem isn't with the code, but with printing the results.  Just like writing a Unicode string to a socket without encoding it first, writing a Unicode string to the terminal (eg. using "print") will try to encode it using the default encoding.  Printing the result of your code will try to encode it as ascii (or whatever your default coding is).  Try encoding it, eg. as UTF-8, before printing it.

Author Comment

ID: 16590546
I'm not trying to print it, simply join the two, but I'm assuming the same rule still applies.  Is it me or is Python's unicode handling hideous?  How do I guard against people potentially handing in normal strings to functions which expect unicode and vice versa?
LVL 14

Accepted Solution

RichieHindle earned 500 total points
ID: 16590671
The rule may well apply when you use a plain string ('') to join two Unicode strings... we're reaching the limits of my knowledge.  8-)

As to guarding against programmers abusing your APIs, that's what assert is for!  From my current project:

def my_unicode_api(u):
    assert isinstance(u, unicode), "You must pass Unicode strings"
    # ...

Python's Unicode handling is in transition.  Ideally all strings would be Unicode, just like in Java, and there would be a separate byte array type.  That's what's planned for Python 3.0.  The problem is that large parts of the standard library, and large numbers of third-party modules, wouldn't work under a model where all strings were Unicode and you needed to explicitly decode data on the way and encode it on the way out.  What we have now is a compromise, which works pretty well for people living in blissful ignorance of Unicode, and very well if you take the time to learn its ins and outs.  (And yes, sometimes I think it's hideous too.  8-)

Author Comment

ID: 16609737
I like the idea of type checking the strings at run time.  I'm going to add that to all of my methods which take strings.  Thanks again!

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question