Algorithm to reduce a list of hierarchically-named items

jdpipe
jdpipe used Ask the Experts™
on
Hi all,

I think this is a pretty interesting algorithm problem so I'll post a link under 'programming' as well. But I'm hoping for a python solution, so it's going to live here.

I have a list of items like this, *as strings*.

C.x
C.p
C.T
C.delta[1]
C.delta[2]
C.delta[3]
C.sat.x
C.sat.p
C.h
C.delta[5]

and I would like to convert that into something more readable, like

C: x, p, T, delta[1-3,5], h
C.sat: x, p

* values in brackets can be numbers, in which case they can be assumed integers, and sequences of integers should be condensed like [1,2,3,5] to [1-3,5]
* values in brackets can also be 'symbols', in which case they are just listed out, eg [symbol1, symbol2, symbol3].
* brackets can act as a suffix for any node, not just the final one
* nodes are either array member nodes like x in "A[1]", or symbol nodes, like "A[symbol1]" or named nodes, like "A.x".
* the algorithm should zipper together as much as possible, as shown

Optional: it would be cool if

A.x[1].p
A.x[2].p

could be output as "A: x[1-2].p". Obviously this makes the problem quite a bit harder.

I thought that it might be possible to tackle this using python's list.reduce(func) method. But I haven't had any luck yet. The only way this is going to be solvable seems to be to re-construct the 'tree' then reduce it somehow with a top-down approach of some sort.

Cheers
JP



Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I can offer a suggestion as to how you'd do this... first and foremost, you'll want to loop entirely through each string and create an entry for it... I'd suggest having a switch statement that catches any numerical value and sets it aside into that category.. then letter values and sets them aside in that category... then catch the open bracket [ and anything up until the close bracket ] and repeat the process for each term, and even separate the terms by . so for example have a data structure with a sub1,2,3,4,5,etc and bkt1,2,3,4,5,etc and num1,2,3,4,5,etc.
We could take this:
C.x[12].pqzed.o
and as it looped through, it would be stored as follows:
C -> it's a letter, store it in sub1 then check next value
. -> it's a dot, so we'll take it as a separation
x -> it's a letter, so store it in sub2 and check next value
[ -> ooh it's a bracket, so we store not only the letter before it but also the element(s) inside it as well as the number itself in
] -> end the bracket
. -> next term
p -> letter, check next...
q -> letter, check next...
z -> letter, check next...
e -> letter, check next...
d -> letter, check next...
. -> terminate the last string and start new...
o -> letter, end of term...
Archive the term and start fresh for the next one...
Once you've achieved this, you'll want to do data comparisons...
Check each of the sub-terms to see if they're the same or at least similar, differing only by numbers... and if so, order the numbers...
As for the 1-7, that's a simple counting algorithm... compare the numerical values, and increment to see if they're in sequence... if so, continue until that sequence is broken, and once it's broken you have the starting term and the term that breaks the cycle(minus one) and just separate them with a dash... then, continue for the rest of them... I could probably write this up in a day or two although I write in C/C++ more than Python...

I hope this helps.
-aNthrax

Author

Commented:
Hi anthrax,

You've explained the issue of string parsing the lines here but that's really not the part I have a problem with. I can do this with simple regexps or DParser or something. The tough part is the "data comparisons" part, and that is the only part I am struggling to conceptualise properly. I really think that one needs to construct a tree and then traverse it in a clever way. But this is the part that I'm looking for input on.

Cheers
JP
Well, as a native C++ programmer, I use strstr() or strcmp() for string comparisons... I'm not sure as to how to do it through Python... I'll do a little research and get back to you some time this weekend.
Exploring SharePoint 2016

Explore SharePoint 2016, the web-based, collaborative platform that integrates with Microsoft Office to provide intranets, secure document management, and collaboration so you can develop your online and offline capabilities.

Author

Commented:
Cool, thanks for this. Hope you find it interesting, too.
Commented:
from itertools import groupby
from operator import itemgetter

def fold(data):
    """ fold sorted numeric sequence data into ranged representation:
    >>> fold([1,  4,5,6, 10, 15,16,17,18, 22, 25,26,27,28])
    '[1,4-6,10,15-18,22,25-28]'
    """
    folded = []
    for k, g in groupby(enumerate(data), lambda (i,x):i-x):
        seq = map(itemgetter(1), g)
        if len(seq) > 1:
            x = '%s-%s' % (seq[0], seq[-1])
        else:
            x = str(seq[0])
        folded.append(x)
    return folded and '[%s]' % ','.join(folded) or ''

def reduce(names):
    """reduce a list of items nto something more readable:
    >>> data = 'C.x C.p C.T C.delta[1] C.delta[2] C.delta[3] C.sat.x C.sat.p C.h C.delta[5]'.split()
    >>> res = reduce(data)
    >>> for k in sorted(res):
    ...   print '%s: %s' % (k, res[k])
    C: T, delta[1-3,5], h, p, x
    C.sat: p, x
    """
    data = sorted([n.split('.') for n in names], key=len)
    res = {}
    for k, g in groupby(data, lambda x: len(x)):
        item = g.next()
        assert len(item) == k
        key = '.'.join(item[:-1])
        indexed = {}
        seq = set(get(indexed, item))
        for item in g:
            seq.add(get(indexed, item))
        res[key] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
    return res

def get(indexed, item):
    item = item[-1]
    if item.endswith(']'):
        item, idx = item[:-1].split('[')
        indexed.setdefault(item, []).append(int(idx))
    return item

Enjoy!

Author

Commented:
Hi Mish33, thanks very much, a very nicely-written solution!

Author

Commented:
Hi Mish,

Have just been trying your suggestion out now. I have a minor problem though, mostly due to my example not being general enough. In some cases, my list will look like this:
Sl.phirdelta_r2[49]
Sl.phirdelta_r2[50]
Sl.phirdelta_r2[51]
Sl.dPSI_ddelta[56]
Sl.DELTA[56]
Sl.tau
rhof
rhol
T
Sl.delta
Sl.d1

As you can see, there are multiple root nodes in this list, eg 'rhof', 'rhol', 'T'. It seems to be a possibility that isn't handled by your code, and the code ends up splitting strings into characters in this case. Any suggestions for a quick fix on this?

Cheers
JP

Commented:
Sure. Change two lines inside reduce for loop to:

        key = '.'.join(item[:-1]) or ''
        seq = set([get(indexed, item)])

Global names will be grouped under ''. Put whatever name you want (eg. 'GLOBAL') instead of ''.

Author

Commented:
Cool, that works well, thanks

JP

Author

Commented:
Hmm seems that there is still a bug actually. Tried it with the following:

S2.d1
S2.rho
S2.t1
S2.phi0tau
S2.phirtau_r2[8]
S2.phirtau_r2[9]
S2.phirtau_r2[10]
S2.phirtau_r2[11]
S2.phirtau_r2[12]
S2.phirtau_r2[13]
S2.phirtau_r2[14]
S2.phirtau_r2[15]
S2.phirtau_r2[16]
S2.phirtau_r2[17]
S2.phirtau_r2[18]
S2.phirtau_r2[19]
S2.phirtau_r2[20]
S2.phirtau_r2[21]
S2.phirtau_r2[22]
S2.phirtau_r2[23]
S2.phirtau_r2[24]
S2.phirtau_r2[25]
S2.phirtau_r2[26]
S2.phirtau_r2[27]
S2.phirtau_r2[28]
S2.phirtau_r2[29]
S2.phirtau_r2[30]
S2.phirtau_r2[31]
S2.phirtau_r2[32]
S2.phirtau_r2[33]
S2.phirtau_r2[34]
S2.phirtau_r2[35]
S2.phirtau_r2[36]
S2.phirtau_r2[37]
S2.phirtau_r2[38]
S2.phirtau_r2[39]
S2.phirtau_r2[40]
S2.phirtau_r2[41]
S2.phirtau_r2[42]
S2.phirtau_r2[43]
S2.phirtau_r2[44]
S2.phirtau_r2[45]
S2.phirtau_r2[46]
S2.phirtau_r2[47]
S2.phirtau_r2[48]
S2.phirtau_r2[49]
S2.phirtau_r2[50]
S2.phirtau_r2[51]
S2.phirdelta_r2[9]
S2.phirdelta_r2[10]
S2.phirdelta_r2[11]
S2.phirdelta_r2[12]
S2.phirdelta_r2[13]
S2.phirdelta_r2[14]
S2.phirdelta_r2[15]
S2.phirdelta_r2[16]
S2.phirdelta_r2[17]
S2.phirdelta_r2[18]
S2.phirdelta_r2[19]
S2.phirdelta_r2[20]
S2.phirdelta_r2[21]
S2.phirdelta_r2[22]
S2.phirdelta_r2[23]
S2.phirdelta_r2[24]
S2.phirdelta_r2[25]
S2.phirdelta_r2[26]
S2.phirdelta_r2[27]
S2.phirdelta_r2[28]
S2.phirdelta_r2[29]
S2.phirdelta_r2[30]
S2.phirdelta_r2[31]
S2.phirdelta_r2[32]
S2.phirdelta_r2[33]
S2.phirdelta_r2[34]
S2.phirdelta_r2[35]
S2.phirdelta_r2[36]
S2.phirdelta_r2[37]
S2.phirdelta_r2[38]
S2.phirdelta_r2[39]
S2.phirdelta_r2[40]
S2.phirdelta_r2[41]
S2.phirdelta_r2[42]
S2.phirdelta_r2[43]
S2.phirdelta_r2[44]
S2.phirdelta_r2[45]
S2.phirdelta_r2[46]
S2.phirdelta_r2[47]
S2.phirdelta_r2[48]
S2.phirdelta_r2[49]
S2.phirdelta_r2[50]
S2.phirdelta_r2[51]
S2.r3_b1[52]
S2.r3_b1[53]
S2.r3_b1[54]
S2.PSI[55]
S2.dPSI_dtau[55]
S2.dPSI_ddelta[55]
S2.theta[55]
S2.dDELTA_ddelta[55]
S2.DELTA[55]
S2.dDELTAbi_ddelta[55]
S2.dDELTAbi_dtau[55]
S2.theta[56]
S2.dDELTA_ddelta[56]
S2.DELTA[56]
S2.dDELTAbi_ddelta[56]
S2.dDELTAbi_dtau[56]
S2.PSI[56]
S2.dPSI_dtau[56]
S2.dPSI_ddelta[56]
T_ho
C.DT_2
S2.phirdelta
C.LMTD
q
C.h_h2
S2.phirdelta_r2[8]
S2.phirtau
S2.tau
S2.delta

and I got the entry 'LMTD' showing up under 'S2'. Any further thoughts?

Commented:
Sorry about that. Replace fold() and reduce() with

def fold(data):
    """ fold sorted numeric sequence data into ranged representation:
    >>> fold([1,  4,5,6, 10, 15,16,17,18, 22, 25,26,27,28])
    '[1,4-6,10,15-18,22,25-28]'
    """
    folded = []
    for k, g in groupby(enumerate(sorted(data)), lambda (i,x):i-x):
        seq = map(itemgetter(1), g)
        if len(seq) > 1:
            x = '%s-%s' % (seq[0], seq[-1])
        else:
            x = str(seq[0])
        folded.append(x)
    return folded and '[%s]' % ','.join(folded) or ''

def reduce(names):
    """reduce a list of items nto something more readable:
    >>> data = 'C.x C.p C.T C.delta[1] C.delta[2] C.delta[3] C.sat.x C.sat.p C.h C.delta[5]'.split()
    >>> res = reduce(data)
    >>> for k in sorted(res):
    ...   print '%s: %s' % (k, res[k])
    C: T, delta[1-3,5], h, p, x
    C.sat: p, x
    """
    data = sorted([n.split('.') for n in sorted(names)], key=len)
    res = {}
    for k, g in groupby(data, lambda x: len(x)):
        if k == 1:
            indexed = {}
            seq = set(get(indexed, item) for item in g)
            res['[global]'] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
        else:
            for key, g1 in groupby(g, lambda x: '.'.join(x[:-1])):
                indexed = {}
                seq = set(get(indexed, item) for item in g1)
                res[key] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
    return res

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial