asked on

How to identify duplicates in a list and update?

Hello all,

I'm trying to figure out how to identify duplicate entries in a list and rename them appropriately so that I have unique names across the list (all while maintaining the same order). I found a wonderful reference that nearly gets me there:

http://stackoverflow.com/questions/17202444/python-how-to-find-duplicates-in-a-list-and-update-these-duplicate-items-by-re

The actual code looks like this:

from collections import Counter
from string import ascii_uppercase as letters

def gen(L):
    c = Counter(L)
    for elt, count in c.items():
        if count == 1:
            yield elt
        else:
            for letter in letters[:count]:
                yield elt + letter

Open in new window

And input/output looks like this:

>>> L = ['T1','T2','T2','T2','T2','T3','T3']
>>> list(gen(L))
['T2A', 'T2B', 'T2C', 'T2D', 'T3A', 'T3B', 'T1']

Open in new window

What I'm trying to understand is why it's re-arranging the order of the list based on number of occurrences (I'm guessing it's doing a LIFO type of list insertion based on the for loop?). What I would like to do is maintain the original order of the list while these updates are applied. Has anyone tried anything like this before? I'm pretty new to Python but I'm getting up to speed pretty quickly and getting to be a fan of the language. Any help is appreciated!

Thanks in advance,
Glen

gelonida

dicts are unordered and in the documentation Counter is marked to be a subclass of dict.
So for your exact requirement you might have to roll your own class.

Insteead of giving you an answer I'd like to ask you some questions first, as they are
important to find the best solution for your exact use case.

will the list always be alphabetically sorted?
If yes, then you could change above code by changing line 6 from

    for elt, count in c.items():

Open in new window

    for elt, count in sorted(c.items()):

Open in new window

If not:
Let s assume
L = ['T2','T2','T3','T2','T2','T1','T3']
Would following output be satisfying for you?

L = ['T2','T2A','T3','T2B','T2C','T1','T3A']

Open in new window

or do you insist on

L = ['T2A','T2B','T3A','T2C','T2D','T1','T3B']

Open in new window

Could L ever be something like =

L = ['T2','T2','T3','T2','T2','T1','T3', 'T2A']

Open in new window

How would you like to handle this situation? (Unification of T2 would result in T2A and this would mean, that T2A would now exist twice.
should the last 'T2A' become 'T2AA'?

jisoo411

ASKER

Hi Gelonida,

Thanks for replying. To answer your questions, the list could indeed look like:

L = ['T2','T2A','T3','T2B','T2C','T1','T3A']

Open in new window

What matters most is being able to distinguish between the duplicated column names and keep all columns in the same order. In reference to the last question, I don't think I would see a column name coming in with 'T2A'. But if it did 'T2AA' or something to distinguish it would work.

Thanks,
Glen

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

gelonida

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

gelonida

@pepr:
Perhaps I misread, but I thought python for loops can have an else statement being entered whenever the for loop is completed without having hit a break statement

( http://stackoverflow.com/questions/9979970/why-does-python-use-else-after-for-and-while-loops )

Checking whether the resulting list is really unique is just a defensive practice without trying to code a fully generic fool proof solution. (I like to code as little as possible, but verify that my result is correct. If I get exceptions I start coding the more complete solution.
Having silently a non unique list causing an error somewhere completely else might be dangerous or difficult to detect)

I prefer an exception to a non unique list or to a potentially too complicated solution.

Your suggestion with an infinite iterator would of course solve the problem.

Instead of the else statement in the for loop one could also check between line 18 and 19 of your original code, that, len(used) == len(result) and if not raise an error.

pepr

@gelonida: Oh, I see. Well, I thought you mean the else of the if. The reason is that I am not used to the else clause of the for loop (or of the while loop). It seems strange to me, difficult to understand as it is not used in other programming languages. One have to thing hard what does it mean. I could bet, people would not find how it works without reading a documentation. And even many the people who know it exists have to look to the doc again if they meet the situation again. But you are right...

Errors should never pass silently.
Unless explicitly silenced.

gelonida

Fully agree, this else syntax is bizarre and probably it's better to avoid it as too many people (sometimes myself included ;-) ) have to look in the doc .