Link to home
Start Free TrialLog in
Avatar of derekl
derekl

asked on

Sockets

From what I understand when working with Python sockets, specifically the recv and sendall calls, it is String data that is either received or sent.

My question is simple, what encoding is the data sent in?

If it assumes an encoding such as UTF-8, does this mean I cannot write Python socket code which communicates in UTF-16?

If it does not assume an encoding, how does it go about generating String data from the received bytes?

Seems like a catch 22 to me.
Avatar of ramrom
ramrom
Flag of United States of America image

As far as I can tell sockets send-receive "strings". Encoding is not an issue.
Avatar of RichieHindle
RichieHindle

Python strings can be either 8-bit or Unicode:

>>> s = "Hello"
>>> type(s)
<type 'str'>
>>> u = u"Hello"
>>> type(u)
<type 'unicode'>

8-bit strings are simply strings of bytes with values from 0 to 255.  They know nothing of encodings.  Unicode strings are pure Unicode, in that each character is a Unicode character.  If you want to output a Unicode string to a device that's expecting an 8-bit string (as sockets do) you or Python must encode into an 8-bit string first.  If you send a Unicode string down a socket without explicitly encoding it, it will be encoded using the default system encoding, which is 'ascii' for out-of-the-box Python.  Best practice is to explicitly encode unicode strings before sending them:

>>> sock.send(u.encode('utf-8'))

For receiving, the answer is that reading a socket returns an 8-bit string.  If you need to interpret this as Unicode, you need to decode it:

>>> s = read_my_socket()
>>> u = s.decode('utf-8')

If you don't decode it but leave it as an 8-bit string, and then use it in a context that expects a Unicode string, Python will decode it according to the default encoding.

(There are two main consequences of Python's choice of having a default encoding, and choosing ascii for that encoding.  One is that most things, at least in the English-speaking world, just work.  The other is that often a piece of software which has been working happily for weeks will fall over with a UnicodeError because it's been presented with an accented character for the first time.)
Avatar of derekl

ASKER

Thanks for the excellent answer Richie.  I was sort of converging on the fact that normal python strings can be treated more or less as byte buffers but your answer cleared it up for me.  If you don't mind, could you tell me what's wrong with the following code?

    string_builder.append("str")
    string_builder.append(u"GEORGIAN - &#4317;&#4307;&#4308;&#4321;&#4304;&#4330; &#4315;&#4321;&#4317;&#4324;&#4314;&#4312;&#4317;&#4321; &#4323;&#4320;&#4311;&#4312;&#4320;&#4311;&#4317;&#4305;&#4304; &#4321;&#4323;&#4320;&#4321;, &#4312;&#4306;&#4312; Unicode-&#4312;&#4321; &#4308;&#4316;&#4304;&#4310;&#4308; &#4314;&#4304;&#4318;&#4304;&#4320;&#4304;&#4313;&#4317;&#4305;&#4321;")
    ''.join(string_builder)

I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-19: character maps to <undefined>
Avatar of derekl

ASKER

The above was a bunch of Georgian Unicode characters before I submitted them to the server and they were escaped.
Your problem isn't with the code, but with printing the results.  Just like writing a Unicode string to a socket without encoding it first, writing a Unicode string to the terminal (eg. using "print") will try to encode it using the default encoding.  Printing the result of your code will try to encode it as ascii (or whatever your default coding is).  Try encoding it, eg. as UTF-8, before printing it.
Avatar of derekl

ASKER

I'm not trying to print it, simply join the two, but I'm assuming the same rule still applies.  Is it me or is Python's unicode handling hideous?  How do I guard against people potentially handing in normal strings to functions which expect unicode and vice versa?
ASKER CERTIFIED SOLUTION
Avatar of RichieHindle
RichieHindle

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of derekl

ASKER

I like the idea of type checking the strings at run time.  I'm going to add that to all of my methods which take strings.  Thanks again!