Link to home
Create AccountLog in
Avatar of zeinth
zeinthFlag for United States of America

asked on

Python: lxml: encoding

================================= testxml.py ====================

somexmldata = """<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <res xmlns="http://www.abcxp.com">
            <jx xmlns="" xsi:type="typens:output">
                <fx xsi:type="name:Fields">
                    <FA xsi:type="name:ArrayOfField">
                        <Field xsi:type="name:Field">
                            <Name>machine1</Name>
                            <Type>xp</Type>
                            <Length>4</Length>
                            <foreignchar>3¿me Arrondissement</foreignchar>
                        </Field>
                        <Field xsi:type="name:Field">
                            <Name>IDFNDFIELD</Name>
                            <Type>win7</Type>
                            <Length>10</Length>
                            <foreignchar>20ème Arrondissement P</foreignchar>
                        </Field>
                    </FA>
                </fx>
            </jx>
        </res>
    </soap:Body>
</soap:Envelope> """


root = etree.fromstring(somexmldata)
print (etree.tostring.root)

=============================== Script testxml.py ==============

When I am running the above testxml.py script, then I am getting an error that
"ValueError: Unicode strings with encoding declaration are not supported."

how I can pass a xml file which will have "Unicoded strings with encoding" to lxml XML parser ?


Thanks!
Avatar of pepr
pepr

Are you using Python 3 or Python 2?
If you do not need lxml for some serious reason, you can use the built-in xml.etree. Fix your last command. You should also try to write the result to a file as your console may not be capable to display some characters:
#!python3
import xml.etree.ElementTree as ET

somexmldata = """<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <res xmlns="http://www.abcxp.com">
            <jx xmlns="" xsi:type="typens:output">
                <fx xsi:type="name:Fields">
                    <FA xsi:type="name:ArrayOfField">
                        <Field xsi:type="name:Field">
                            <Name>machine1</Name>
                            <Type>xp</Type>
                            <Length>4</Length>
                            <foreignchar>3¿me Arrondissement</foreignchar>
                        </Field>
                        <Field xsi:type="name:Field">
                            <Name>IDFNDFIELD</Name>
                            <Type>win7</Type>
                            <Length>10</Length>
                            <foreignchar>20ème Arrondissement P</foreignchar>
                        </Field>
                    </FA>
                </fx>
            </jx>
        </res>
    </soap:Body>
</soap:Envelope> """


root = ET.fromstring(somexmldata)

with open('output.xml', 'w', encoding='utf-8') as f:
    f.write(ET.tostring(root, encoding='unicode'))  
                # the 'unicode' leads to the unicode string result
    
print(ET.tostring(root, encoding='ascii'))
               # the 'ascii' leads to the stream-of-bytes result (i.e. bytes type)

Open in new window

I have lxml installed only for Python 2.7. Then the same code for Python 2 and lxml looks like this:
#!python2
# -*- coding: utf-8 -*-
from lxml import etree

somexmldata = """<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <res xmlns="http://www.abcxp.com">
            <jx xmlns="" xsi:type="typens:output">
                <fx xsi:type="name:Fields">
                    <FA xsi:type="name:ArrayOfField">
                        <Field xsi:type="name:Field">
                            <Name>machine1</Name>
                            <Type>xp</Type>
                            <Length>4</Length>
                            <foreignchar>3¿me Arrondissement</foreignchar>
                        </Field>
                        <Field xsi:type="name:Field">
                            <Name>IDFNDFIELD</Name>
                            <Type>win7</Type>
                            <Length>10</Length>
                            <foreignchar>20ème Arrondissement P</foreignchar>
                        </Field>
                    </FA>
                </fx>
            </jx>
        </res>
    </soap:Body>
</soap:Envelope> """


root = etree.fromstring(somexmldata)

with open('output.xml', 'w') as f:
    f.write(etree.tostring(root, encoding='utf-8'))
    
print etree.tostring(root, encoding='ascii')

Open in new window

Avatar of zeinth

ASKER

Sorry for late reply, Thanks pepr for the help, Actually on my machine python 3 and lxml are installed. And I am looking for solution using lxml parser.

I tried to run your lxml code in my machine (which has python 3), then I am getting this error:

============= Error message from machine ========
 root = etree.fromstring(somexmldata)
  File "lxml.etree.pyx", line 2969, in lxml.etree.fromstring (src\lxml\lxml.etree.c:61729)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91131)
ValueError: Unicode strings with encoding declaration are not supported.
===========================================

After looking into this error, can you give me some suggestions here so that we can fix my code .....  Thanks!
As I cannot simulate it exactly, I can only guess that you should remove the first line with the <?xml version="1.0" encoding="utf-8"?>. This is the line that declares encoding. It makes sense with .fromstring() because it expects UNICODE string where any encoding declaration makes no sense.

It makes sense if the XML content is stored in a file. Then you should call:
root = etree.parse("myfile.xml")

Open in new window

Then the encoding declaration inside makes sense.
Avatar of zeinth

ASKER

Now I tried this code :
==========================
f = open("somexmldata.xml", "w")
f.write(somexmldata)
f.close()
tree = etree.parse("somexmldata.xml")


Now, I am getting this error from above code:
========================================
Traceback (most recent call last):
  File "C:\test2.py", line 38, in <module>
    f.write(somexmldata)
  File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 709: character maps to <undefined>


I think, if I can write xml file to disk in "UTF-8" encoding
f.write(somexmldata)  # How to write xml file in disk in utf-8 encoding

Then, I think this piece of code "root = etree.parse("myfile.xml")" will work in Python 3


Thanks!
ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Avatar of zeinth

ASKER

Wonderful, now the code is working, thanks!

============================================
with open('somexmldata.xml', 'w', encoding='utf-8') as f:
     f.write(somexmldata)
     f.close()
     
tree = etree.parse("somexmldata.xml")
print (tree)
===================================

A side note: I suggest to get used to the with construct...
>> I remember this suggestion .....


Thanks!
When using the with construct, the f.close() is called automatically. This is the reason why the construct was introduced (not only for files; it is generally used for objects of classes that implement that kind of finalisation).

You can still use open/close pair of functions (without the with), but it is more error-prone.