Posted on 2006-05-01
Last Modified: 2010-04-16
From what I understand when working with Python sockets, specifically the recv and sendall calls, it is String data that is either received or sent.

My question is simple, what encoding is the data sent in?

If it assumes an encoding such as UTF-8, does this mean I cannot write Python socket code which communicates in UTF-16?

If it does not assume an encoding, how does it go about generating String data from the received bytes?

Seems like a catch 22 to me.
Question by:derekl
    LVL 17

    Expert Comment

    As far as I can tell sockets send-receive "strings". Encoding is not an issue.
    LVL 14

    Expert Comment

    Python strings can be either 8-bit or Unicode:

    >>> s = "Hello"
    >>> type(s)
    <type 'str'>
    >>> u = u"Hello"
    >>> type(u)
    <type 'unicode'>

    8-bit strings are simply strings of bytes with values from 0 to 255.  They know nothing of encodings.  Unicode strings are pure Unicode, in that each character is a Unicode character.  If you want to output a Unicode string to a device that's expecting an 8-bit string (as sockets do) you or Python must encode into an 8-bit string first.  If you send a Unicode string down a socket without explicitly encoding it, it will be encoded using the default system encoding, which is 'ascii' for out-of-the-box Python.  Best practice is to explicitly encode unicode strings before sending them:

    >>> sock.send(u.encode('utf-8'))

    For receiving, the answer is that reading a socket returns an 8-bit string.  If you need to interpret this as Unicode, you need to decode it:

    >>> s = read_my_socket()
    >>> u = s.decode('utf-8')

    If you don't decode it but leave it as an 8-bit string, and then use it in a context that expects a Unicode string, Python will decode it according to the default encoding.

    (There are two main consequences of Python's choice of having a default encoding, and choosing ascii for that encoding.  One is that most things, at least in the English-speaking world, just work.  The other is that often a piece of software which has been working happily for weeks will fall over with a UnicodeError because it's been presented with an accented character for the first time.)

    Author Comment

    Thanks for the excellent answer Richie.  I was sort of converging on the fact that normal python strings can be treated more or less as byte buffers but your answer cleared it up for me.  If you don't mind, could you tell me what's wrong with the following code?

        string_builder.append(u"GEORGIAN - &#4317;&#4307;&#4308;&#4321;&#4304;&#4330; &#4315;&#4321;&#4317;&#4324;&#4314;&#4312;&#4317;&#4321; &#4323;&#4320;&#4311;&#4312;&#4320;&#4311;&#4317;&#4305;&#4304; &#4321;&#4323;&#4320;&#4321;, &#4312;&#4306;&#4312; Unicode-&#4312;&#4321; &#4308;&#4316;&#4304;&#4310;&#4308; &#4314;&#4304;&#4318;&#4304;&#4320;&#4304;&#4313;&#4317;&#4305;&#4321;")

    I get the following error:

    UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-19: character maps to <undefined>

    Author Comment

    The above was a bunch of Georgian Unicode characters before I submitted them to the server and they were escaped.
    LVL 14

    Expert Comment

    Your problem isn't with the code, but with printing the results.  Just like writing a Unicode string to a socket without encoding it first, writing a Unicode string to the terminal (eg. using "print") will try to encode it using the default encoding.  Printing the result of your code will try to encode it as ascii (or whatever your default coding is).  Try encoding it, eg. as UTF-8, before printing it.

    Author Comment

    I'm not trying to print it, simply join the two, but I'm assuming the same rule still applies.  Is it me or is Python's unicode handling hideous?  How do I guard against people potentially handing in normal strings to functions which expect unicode and vice versa?
    LVL 14

    Accepted Solution

    The rule may well apply when you use a plain string ('') to join two Unicode strings... we're reaching the limits of my knowledge.  8-)

    As to guarding against programmers abusing your APIs, that's what assert is for!  From my current project:

    def my_unicode_api(u):
        assert isinstance(u, unicode), "You must pass Unicode strings"
        # ...

    Python's Unicode handling is in transition.  Ideally all strings would be Unicode, just like in Java, and there would be a separate byte array type.  That's what's planned for Python 3.0.  The problem is that large parts of the standard library, and large numbers of third-party modules, wouldn't work under a model where all strings were Unicode and you needed to explicitly decode data on the way and encode it on the way out.  What we have now is a compromise, which works pretty well for people living in blissful ignorance of Unicode, and very well if you take the time to learn its ins and outs.  (And yes, sometimes I think it's hideous too.  8-)

    Author Comment

    I like the idea of type checking the strings at run time.  I'm going to add that to all of my methods which take strings.  Thanks again!

    Featured Post

    Find Ransomware Secrets With All-Source Analysis

    Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

    Join & Write a Comment

    Suggested Solutions

    The really strange introduction Once upon a time there were individuals who intentionally put the grass seeds to the soil with anticipation of solving their nutrition problems. Or they maybe only played with seeds and noticed what happened... Som…
    Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
    Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
    Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

    754 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now