Algorithm to reduce a list of hierarchically-named items

Hi all,

I think this is a pretty interesting algorithm problem so I'll post a link under 'programming' as well. But I'm hoping for a python solution, so it's going to live here.

I have a list of items like this, *as strings*.

C.x
C.p
C.T
C.delta[1]
C.delta[2]
C.delta[3]
C.sat.x
C.sat.p
C.h
C.delta[5]

and I would like to convert that into something more readable, like

C: x, p, T, delta[1-3,5], h
C.sat: x, p

* values in brackets can be numbers, in which case they can be assumed integers, and sequences of integers should be condensed like [1,2,3,5] to [1-3,5]
* values in brackets can also be 'symbols', in which case they are just listed out, eg [symbol1, symbol2, symbol3].
* brackets can act as a suffix for any node, not just the final one
* nodes are either array member nodes like x in "A[1]", or symbol nodes, like "A[symbol1]" or named nodes, like "A.x".
* the algorithm should zipper together as much as possible, as shown

Optional: it would be cool if

A.x[1].p
A.x[2].p

could be output as "A: x[1-2].p". Obviously this makes the problem quite a bit harder.

I thought that it might be possible to tackle this using python's list.reduce(func) method. But I haven't had any luck yet. The only way this is going to be solvable seems to be to re-construct the 'tree' then reduce it somehow with a top-down approach of some sort.

Cheers
JP



LVL 7
jdpipeAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

anthrax759Commented:
I can offer a suggestion as to how you'd do this... first and foremost, you'll want to loop entirely through each string and create an entry for it... I'd suggest having a switch statement that catches any numerical value and sets it aside into that category.. then letter values and sets them aside in that category... then catch the open bracket [ and anything up until the close bracket ] and repeat the process for each term, and even separate the terms by . so for example have a data structure with a sub1,2,3,4,5,etc and bkt1,2,3,4,5,etc and num1,2,3,4,5,etc.
We could take this:
C.x[12].pqzed.o
and as it looped through, it would be stored as follows:
C -> it's a letter, store it in sub1 then check next value
. -> it's a dot, so we'll take it as a separation
x -> it's a letter, so store it in sub2 and check next value
[ -> ooh it's a bracket, so we store not only the letter before it but also the element(s) inside it as well as the number itself in
] -> end the bracket
. -> next term
p -> letter, check next...
q -> letter, check next...
z -> letter, check next...
e -> letter, check next...
d -> letter, check next...
. -> terminate the last string and start new...
o -> letter, end of term...
Archive the term and start fresh for the next one...
Once you've achieved this, you'll want to do data comparisons...
Check each of the sub-terms to see if they're the same or at least similar, differing only by numbers... and if so, order the numbers...
As for the 1-7, that's a simple counting algorithm... compare the numerical values, and increment to see if they're in sequence... if so, continue until that sequence is broken, and once it's broken you have the starting term and the term that breaks the cycle(minus one) and just separate them with a dash... then, continue for the rest of them... I could probably write this up in a day or two although I write in C/C++ more than Python...

I hope this helps.
-aNthrax
jdpipeAuthor Commented:
Hi anthrax,

You've explained the issue of string parsing the lines here but that's really not the part I have a problem with. I can do this with simple regexps or DParser or something. The tough part is the "data comparisons" part, and that is the only part I am struggling to conceptualise properly. I really think that one needs to construct a tree and then traverse it in a clever way. But this is the part that I'm looking for input on.

Cheers
JP
anthrax759Commented:
Well, as a native C++ programmer, I use strstr() or strcmp() for string comparisons... I'm not sure as to how to do it through Python... I'll do a little research and get back to you some time this weekend.
Starting with Angular 5

Learn the essential features and functions of the popular JavaScript framework for building mobile, desktop and web applications.

jdpipeAuthor Commented:
Cool, thanks for this. Hope you find it interesting, too.
mish33Commented:
from itertools import groupby
from operator import itemgetter

def fold(data):
    """ fold sorted numeric sequence data into ranged representation:
    >>> fold([1,  4,5,6, 10, 15,16,17,18, 22, 25,26,27,28])
    '[1,4-6,10,15-18,22,25-28]'
    """
    folded = []
    for k, g in groupby(enumerate(data), lambda (i,x):i-x):
        seq = map(itemgetter(1), g)
        if len(seq) > 1:
            x = '%s-%s' % (seq[0], seq[-1])
        else:
            x = str(seq[0])
        folded.append(x)
    return folded and '[%s]' % ','.join(folded) or ''

def reduce(names):
    """reduce a list of items nto something more readable:
    >>> data = 'C.x C.p C.T C.delta[1] C.delta[2] C.delta[3] C.sat.x C.sat.p C.h C.delta[5]'.split()
    >>> res = reduce(data)
    >>> for k in sorted(res):
    ...   print '%s: %s' % (k, res[k])
    C: T, delta[1-3,5], h, p, x
    C.sat: p, x
    """
    data = sorted([n.split('.') for n in names], key=len)
    res = {}
    for k, g in groupby(data, lambda x: len(x)):
        item = g.next()
        assert len(item) == k
        key = '.'.join(item[:-1])
        indexed = {}
        seq = set(get(indexed, item))
        for item in g:
            seq.add(get(indexed, item))
        res[key] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
    return res

def get(indexed, item):
    item = item[-1]
    if item.endswith(']'):
        item, idx = item[:-1].split('[')
        indexed.setdefault(item, []).append(int(idx))
    return item

Enjoy!

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
jdpipeAuthor Commented:
Hi Mish33, thanks very much, a very nicely-written solution!
jdpipeAuthor Commented:
Hi Mish,

Have just been trying your suggestion out now. I have a minor problem though, mostly due to my example not being general enough. In some cases, my list will look like this:
Sl.phirdelta_r2[49]
Sl.phirdelta_r2[50]
Sl.phirdelta_r2[51]
Sl.dPSI_ddelta[56]
Sl.DELTA[56]
Sl.tau
rhof
rhol
T
Sl.delta
Sl.d1

As you can see, there are multiple root nodes in this list, eg 'rhof', 'rhol', 'T'. It seems to be a possibility that isn't handled by your code, and the code ends up splitting strings into characters in this case. Any suggestions for a quick fix on this?

Cheers
JP
mish33Commented:
Sure. Change two lines inside reduce for loop to:

        key = '.'.join(item[:-1]) or ''
        seq = set([get(indexed, item)])

Global names will be grouped under ''. Put whatever name you want (eg. 'GLOBAL') instead of ''.
jdpipeAuthor Commented:
Cool, that works well, thanks

JP
jdpipeAuthor Commented:
Hmm seems that there is still a bug actually. Tried it with the following:

S2.d1
S2.rho
S2.t1
S2.phi0tau
S2.phirtau_r2[8]
S2.phirtau_r2[9]
S2.phirtau_r2[10]
S2.phirtau_r2[11]
S2.phirtau_r2[12]
S2.phirtau_r2[13]
S2.phirtau_r2[14]
S2.phirtau_r2[15]
S2.phirtau_r2[16]
S2.phirtau_r2[17]
S2.phirtau_r2[18]
S2.phirtau_r2[19]
S2.phirtau_r2[20]
S2.phirtau_r2[21]
S2.phirtau_r2[22]
S2.phirtau_r2[23]
S2.phirtau_r2[24]
S2.phirtau_r2[25]
S2.phirtau_r2[26]
S2.phirtau_r2[27]
S2.phirtau_r2[28]
S2.phirtau_r2[29]
S2.phirtau_r2[30]
S2.phirtau_r2[31]
S2.phirtau_r2[32]
S2.phirtau_r2[33]
S2.phirtau_r2[34]
S2.phirtau_r2[35]
S2.phirtau_r2[36]
S2.phirtau_r2[37]
S2.phirtau_r2[38]
S2.phirtau_r2[39]
S2.phirtau_r2[40]
S2.phirtau_r2[41]
S2.phirtau_r2[42]
S2.phirtau_r2[43]
S2.phirtau_r2[44]
S2.phirtau_r2[45]
S2.phirtau_r2[46]
S2.phirtau_r2[47]
S2.phirtau_r2[48]
S2.phirtau_r2[49]
S2.phirtau_r2[50]
S2.phirtau_r2[51]
S2.phirdelta_r2[9]
S2.phirdelta_r2[10]
S2.phirdelta_r2[11]
S2.phirdelta_r2[12]
S2.phirdelta_r2[13]
S2.phirdelta_r2[14]
S2.phirdelta_r2[15]
S2.phirdelta_r2[16]
S2.phirdelta_r2[17]
S2.phirdelta_r2[18]
S2.phirdelta_r2[19]
S2.phirdelta_r2[20]
S2.phirdelta_r2[21]
S2.phirdelta_r2[22]
S2.phirdelta_r2[23]
S2.phirdelta_r2[24]
S2.phirdelta_r2[25]
S2.phirdelta_r2[26]
S2.phirdelta_r2[27]
S2.phirdelta_r2[28]
S2.phirdelta_r2[29]
S2.phirdelta_r2[30]
S2.phirdelta_r2[31]
S2.phirdelta_r2[32]
S2.phirdelta_r2[33]
S2.phirdelta_r2[34]
S2.phirdelta_r2[35]
S2.phirdelta_r2[36]
S2.phirdelta_r2[37]
S2.phirdelta_r2[38]
S2.phirdelta_r2[39]
S2.phirdelta_r2[40]
S2.phirdelta_r2[41]
S2.phirdelta_r2[42]
S2.phirdelta_r2[43]
S2.phirdelta_r2[44]
S2.phirdelta_r2[45]
S2.phirdelta_r2[46]
S2.phirdelta_r2[47]
S2.phirdelta_r2[48]
S2.phirdelta_r2[49]
S2.phirdelta_r2[50]
S2.phirdelta_r2[51]
S2.r3_b1[52]
S2.r3_b1[53]
S2.r3_b1[54]
S2.PSI[55]
S2.dPSI_dtau[55]
S2.dPSI_ddelta[55]
S2.theta[55]
S2.dDELTA_ddelta[55]
S2.DELTA[55]
S2.dDELTAbi_ddelta[55]
S2.dDELTAbi_dtau[55]
S2.theta[56]
S2.dDELTA_ddelta[56]
S2.DELTA[56]
S2.dDELTAbi_ddelta[56]
S2.dDELTAbi_dtau[56]
S2.PSI[56]
S2.dPSI_dtau[56]
S2.dPSI_ddelta[56]
T_ho
C.DT_2
S2.phirdelta
C.LMTD
q
C.h_h2
S2.phirdelta_r2[8]
S2.phirtau
S2.tau
S2.delta

and I got the entry 'LMTD' showing up under 'S2'. Any further thoughts?

mish33Commented:
Sorry about that. Replace fold() and reduce() with

def fold(data):
    """ fold sorted numeric sequence data into ranged representation:
    >>> fold([1,  4,5,6, 10, 15,16,17,18, 22, 25,26,27,28])
    '[1,4-6,10,15-18,22,25-28]'
    """
    folded = []
    for k, g in groupby(enumerate(sorted(data)), lambda (i,x):i-x):
        seq = map(itemgetter(1), g)
        if len(seq) > 1:
            x = '%s-%s' % (seq[0], seq[-1])
        else:
            x = str(seq[0])
        folded.append(x)
    return folded and '[%s]' % ','.join(folded) or ''

def reduce(names):
    """reduce a list of items nto something more readable:
    >>> data = 'C.x C.p C.T C.delta[1] C.delta[2] C.delta[3] C.sat.x C.sat.p C.h C.delta[5]'.split()
    >>> res = reduce(data)
    >>> for k in sorted(res):
    ...   print '%s: %s' % (k, res[k])
    C: T, delta[1-3,5], h, p, x
    C.sat: p, x
    """
    data = sorted([n.split('.') for n in sorted(names)], key=len)
    res = {}
    for k, g in groupby(data, lambda x: len(x)):
        if k == 1:
            indexed = {}
            seq = set(get(indexed, item) for item in g)
            res['[global]'] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
        else:
            for key, g1 in groupby(g, lambda x: '.'.join(x[:-1])):
                indexed = {}
                seq = set(get(indexed, item) for item in g1)
                res[key] = ', '.join(i+fold(indexed.get(i, [])) for i in sorted(seq))
    return res
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.