Link to home
Create AccountLog in
Avatar of tradinfo
tradinfo

asked on

What is the most efficient memory-size algorithm for building and manipulating an XML-like data structure in memory ?

I have to read an XML file on the fly, forward only, and create a data structure as small and efficient as possible in memory.
Example:
<mydevice>
    <light AccessMode="ReadWrite" Address="0x1234" Color="red">
            off             <!-- Default value -->
    </light>
    <sensor AccessMode="ReadOnly" Address="0x1234" ByteToRead="8">
            12             <!-- Default value -->
    </sensor>
    <motor AccessMode="ReadWrite" Address="0x34">
            15             <!-- Default value -->
           <light AccessMode="ReadOnly" Address="0x45">
            off             <!-- Default value -->
           </light>
    </motor>
</mydevice>

My first thoughts was that an XML file can be represented as a tree, and as I am constraint to go forward only, .

      mydevice
     /       |      \|
light  motor   sensor
                        /
                    light

Many problems here.
- My tree is not binary, but is a n-ary tree. Each node can have many sons => many pointers to keep.
- I am forced to parse my XML file and build an associated data structure with a depth-first strategy.
- I must minimize the size of the data (tree) structure as smallest as possible (memory constrained).
- I must keep the "XML tree structure" to know where I am when parsing my structure, in order to add XPath capabilities to parse it.

My question is not on how can I optimize the content of each node, and create non-redundant data in memory, like having only one time "AccessMode" or "light" data in memory as they are present several times in the XML file. I already solved this problem with specific arrays, 'dictionaries", and pointers (array index) in the node.


The problem is "how can I built an optimal data structure to keep the structure as smallest as possible, and as efficient as possible, as I have to parse my structure, and insert, or remove nodes (one or more)  at runtime, and I can't to recreate my structure in memory to do such simple things...

What is the most efficient way (more than just a n-ary tree), to create a memory efficient and easy data manipulation data structure for such a problem ?

Algorithm of creation, parsing, addition/deletion are welcome. Or ANSI C source code too !

Best regards,
David.?
Avatar of Infinity08
Infinity08
Flag of Belgium image

>> - My tree is not binary, but is a n-ary tree. Each node can have many sons => many pointers to keep.

Just as many as there are children for the current node ... So, if your XML spec lists 3 sub-tags for a given tag, then your tree will list 3 children for the node that corresponds to that tag.


>> - I am forced to parse my XML file and build an associated data structure with a depth-first strategy.

Why ? You can parse you XML file from beginning to end, and insert the data in the tree as you read it from the XML. No depth-first is needed. It looks like it slightly, but that shouldn't be a problem.


>> - I must minimize the size of the data (tree) structure as smallest as possible (memory constrained).

Then only use pointers wherever flexibility is required (optional fields etc.).
If the entire XML is "fixed" (there are no optional fields, or repeated fields), then you don't even need a tree to store the data in - you can simply store everything in a flat data structure. All depends on how the XML is defined. A tree just makes it easier to parse or create the XML to/from the data, but if memory limitations are more important, then in certain cases, you can go with a solution that doesn't use a tree.

Also choose data types that minimally can contain the values needed. Ie. if you have a value that can be 0 to 100, use an unsigned char (1 byte) - do not use an int.


>> - I must keep the "XML tree structure" to know where I am when parsing my structure, in order to add XPath capabilities to parse it.

I'm not sure what you mean here ... If you have an XPath, you can simply start at the root of the tree, and traverse the tree according to the XPath, until you arrive at the data the XPath refers to.


>> "how can I built an optimal data structure to keep the structure as smallest as possible, and as efficient as possible, as I have to parse my structure, and insert, or remove nodes (one or more)  at runtime, and I can't to recreate my structure in memory to do such simple things...

The standard solution as you know is to use a tree. That approach allows you to easily remove/modify specific fields/values without any problem. It also gives you a very straight-forward encoding/decoding approach (data to XML and vice versa).

However, if you are dealing with extreme memory requirements, then, as I said earlier, you might be better off using a flat data structure if your XML spec allows that. It will make the processing of the data slightly more complicated, but could potentially decrease the memory usage by 50% or more.


I often use a mix of the tree approach and the flat data structure approach like this (using your XML as example, assuming that the sensor is optional) :

    typedef enum {
        AM_ReadWrite,
        AM_ReadOnly
    } AccessMode;

    /* definitions for Color and State */

    typedef struct Light {
        AccessMode accessMode;
        void *address;
        Color color;
        State state;
    } Light;

    /* definitions for Sensor and Motor */

    typedef struct Device {
        Light light;
        Sensor *sensor;    /* OPTIONAL */
        Motor motor;
    } Device;


When creating the XML from the data, you'd use several functions like this one for the Device struct :

    void enc_Device(XMLstring *buf, Device *device) {
        appendToXML(buf, "<mydevice>");
        enc_Light(buf, &(device->light));
        if (device->sensor) {    /* OPTIONAL */
            enc_Sensor(buf, device->sensor);
        }
        enc_Motor(buf, &(device->motor));
        appendToXML(buf, "</mydevice>");
    }

with XMLstring your own datatype that has an append function that takes care of memory allocation.
Avatar of tradinfo
tradinfo

ASKER

Well, I just missed to tell you some important points...

1. I don't know the XML file I'll have to read. Neither the structure of the XML file, nor its content is known. So I can't predict if my tree will have at most 3 sons or more, I must be very flexible. And I can't predefine the content of my node. So I can't predefine enums or Light structs... I can possibly read any XML file.

2. I am forced to read my XML file forward only, because I am forced to use a StAX parser on a stream.
No choice, forward only...

3. I am searching a way to minimise the size of my tree structure, as I must keep the level/nodes informations.

At this time, my (non optimal) solution is the following:

1. Creating my XML tree structure via a linked list, on the fly. Size Problem: Each node must embed all pointers to its sons.

2. I have 2 other linked lists, to memorize only once all "attributes", all "XML markers" found in the XML file read.

3. My structure node embeds a list of pointers to attributes and data value associated.

I hoped to find a way to reduce the size of the linked list for my n-ary tree, may be with a trick not to store all sons of a node to each node ? Or with some other optimizations, or with another data structure without linked lists...


>> I don't know the XML file I'll have to read.

Heh, that changes everything indeed :)


>> I hoped to find a way to reduce the size of the linked list for my n-ary tree

A linked list is not at all memory-optimal, so I'd get rid of the linked lists, and use a dynamic array instead - so, you just have the overhead of storing one pointer and one integer (for the length of the array).


>> may be with a trick not to store all sons of a node to each node ?

I don't see how this would save memory. You'd have to store the relations between the nodes in a different way, and that will have to use less than the size of a pointer. Tricky.


>> Or with some other optimizations, or with another data structure without linked lists...

That what I would do - something like :

    typedef struct Node {
        int attrs_len;              /* <--- number of attributes for this node */
        Attribute *attrs;         /* <--- dynamic array of the attrs_len attributes */
        int children_len;         /* <--- number of child nodes for this node */
        Node *children;         /* <--- dynamic array of the children_len child nodes */
        /* any other data, like the value, name of the node, etc. */
    } Node;

    typedef struct Attribute {
        char *name;
        char *value;
    } Attribute;

You can dynamically add/remove/modify any attribute or node, without too much memory overhead.

Also take a look at this free XML library :

        http://xmlsoft.org/

It is very complete, and does exactly what you want to do. If you don't use it, you can probably get some inspiration from it by looking at how they did it.
ASKER CERTIFIED SOLUTION
Avatar of grg99
grg99

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
tradinfo, do you have a good reason to want to decrease the memory usage in such an extreme way ? (especially when using a scheme like grg99's)
Infinity08: => I have a very good reason to decrease my memory usage as smallest as possible: I have to embed my Ansi C structure in a micro device which has only 96Kb RAM available.
Every byte used counts !

Choosing to create a tree when having so few memory is a choice I can assume if I can win on the two sides : memory size efficiency (no data duplication in memory, smallest use of pointers...) and usage efficiency (parsing, adding/deleting a node easily, ...).

Hard to achieve, but interesting to work on it ! Help and Ansi C tricks welcome !

Then do look into grg99's approach to pointers. The only problem (as he also mentioned) is that the computing overhead can become quite big. So, if you also need to keep the computing time as small as possible, this might be a problem.

And you might consume more memory for the code part of your executable than you won for the data part :) Then again, you might not, but be aware that it's a possibility.

To avoid data duplication, you can make use of lookup tables in the same way. For example, if a certain string occurs more than once in the XML (for example the "ReadWrite" or "Address" from your example), then you can save (a lot of) memory by putting that string in a lookup table once, and use its index (see grg99's post on how to do indexing optimally) to retrieve it when needed.


I'm sure you realize that these are very extreme measures to save memory, and (what follows is just an estimate for one possible situation) for a 1kB XML string, it might bring your memory usage down from 1.5kB to around 1kB, or even less if there's a lot of duplicate data in the XML.

Usually, memory usage for XML data is temporary - ie. it remains in memory only for the duration of the processing of the XML. So, even on low-memory environments, it might still be acceptable to have a temporary memory "peak".
In one particular real-world application, applying those tricks cut down memory usage to 0.4% of the original size.  Saved the company $millions.  Your mileage may vary.

If data size is crucial and code speed is completely irrelevant, just don't build a meta-structure at all, hold only the xml in memory and write algorithms to walk back and forth.

If this is too slow, improve the back/forth searching speed by holding an array of start/end positions for the tags.

If this array gets too big, use a fixed-length array and hold a usage count for discarding.

something like:
__________1111111111222222222233333333334444444444
01234567890123456789012345678901234567890123456789
<tag1><tag2>Hello</tag2><tag2>Goodbye</tag2></tag1>

Table:
0,42
6,16
22,35

Paul
What do you want the data structure to represent?
If you can know what nodes you want to insert or delete on the fly, reading forward only,
then you may be able to write out the data as you read it and only need to keep track of  the path to the current node, which you may be able to keep on the call stack without need of building a new data structure