Solved

How to identify duplicates in a list and update?

Posted on 2014-04-22
8
294 Views
Last Modified: 2014-04-24
Hello all,

I'm trying to figure out how to identify duplicate entries in a list and rename them appropriately so that I have unique names across the list (all while maintaining the same order).  I found a wonderful reference that nearly gets me there:

http://stackoverflow.com/questions/17202444/python-how-to-find-duplicates-in-a-list-and-update-these-duplicate-items-by-re

The actual code looks like this:
from collections import Counter
from string import ascii_uppercase as letters

def gen(L):
    c = Counter(L)
    for elt, count in c.items():
        if count == 1:
            yield elt
        else:
            for letter in letters[:count]:
                yield elt + letter

Open in new window


And input/output looks like this:

>>> L = ['T1','T2','T2','T2','T2','T3','T3']
>>> list(gen(L))
['T2A', 'T2B', 'T2C', 'T2D', 'T3A', 'T3B', 'T1']

Open in new window


What I'm trying to understand is why it's re-arranging the order of the list based on number of occurrences (I'm guessing it's doing a LIFO type of list insertion based on the for loop?).  What I would like to do is maintain the original order of the list while these updates are applied.  Has anyone tried anything like this before?  I'm pretty new to Python but I'm getting up to speed pretty quickly and getting to be a fan of the language.  Any help is appreciated!

Thanks in advance,
Glen
0
Comment
Question by:jisoo411
  • 4
  • 3
8 Comments
 
LVL 16

Expert Comment

by:gelonida
Comment Utility
dicts are unordered and in the documentation Counter is marked to be a subclass of dict.
So for your exact requirement you might have to roll your own class.

Insteead of giving you an answer I'd like to ask you some questions first, as they are
 important to find the best solution for your exact use case.

will the list always be alphabetically sorted?
If yes, then you could change above code by changing line 6 from
    for elt, count in c.items():

Open in new window

to
    for elt, count in sorted(c.items()):

Open in new window


If not:
Let s assume
L = ['T2','T2','T3','T2','T2','T1','T3']
Would following output  be satisfying for you?
L = ['T2','T2A','T3','T2B','T2C','T1','T3A']

Open in new window

or do you insist on
L = ['T2A','T2B','T3A','T2C','T2D','T1','T3B']

Open in new window




Could L ever be something like  =
L = ['T2','T2','T3','T2','T2','T1','T3', 'T2A']

Open in new window

How would you like to handle this situation?  (Unification of T2 would result in T2A and this would mean, that T2A would now exist twice.
should the last 'T2A' become 'T2AA'?
0
 

Author Comment

by:jisoo411
Comment Utility
Hi Gelonida,

Thanks for replying.  To answer your questions, the list could indeed look like:

L = ['T2','T2A','T3','T2B','T2C','T1','T3A']

Open in new window


What matters most is being able to distinguish between the duplicated column names and keep all columns in the same order.  In reference to the last question, I don't think I would see a column name coming in with 'T2A'.  But if it did 'T2AA' or something to distinguish it would work.  

Thanks,
Glen
0
 
LVL 28

Accepted Solution

by:
pepr earned 333 total points
Comment Utility
Try the following function. It can also be converted to generator, but you probably want to insert it to the CSV file or the like:
#!python3

def make_unique(lst):
    used = set()             # set of the names used earlier
    result = []              # the modified list
    for name in lst:         # check every input name
        if name not in used:
            result.append(name)   # it was not used yet, so left it as is
            used.add(name)        # but it was used now
        else:
            # We have to modify the earlier used name. Say to add
            # one of the capital letters. Try from A and check.
            for extension in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
                xname = name + extension     # the modified name
                if xname not in used:
                    break                    # this one was not used
            result.append(xname)  # append to the result           
            used.add(xname)       # and remember as used
    return result        
        

L = ['T1','T2','T2','T2','T2','T3','T3']
LL = make_unique(L)
print(L)
print(LL)

Open in new window

It prints on my console
c:\_Python\jisoo411\Q_28417944>a.py
['T1', 'T2', 'T2', 'T2', 'T2', 'T3', 'T3']
['T1', 'T2', 'T2A', 'T2B', 'T2C', 'T3', 'T3A']

Open in new window

0
 
LVL 16

Assisted Solution

by:gelonida
gelonida earned 167 total points
Comment Utility
minor suggestion in order to detect potential (in real world probably non existing)
failures to uniqify your list:


add following after line 16:
        else:
            print "failed to uniqify list" 
           raise UniqifyException()

Open in new window


and add following lines to the beginning of the file

class UniqifyException(Exception):
    """ custom exception """"

Open in new window


The code could be sped up slightly if you used a dict with counter values instead of a set, but I think for most real world cases this will not matter and the current code is easier to read.
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 28

Assisted Solution

by:pepr
pepr earned 333 total points
Comment Utility
Well, in the case that 26 characters would not be enough to solve the duplicates, I would rather change the extension method to generate say the '_1', '_2', ..., '_156', etc.

There cannot be else after the line 16. The problem of repeating say T2Z (i.e. detection of the unsolved case) must be solved after the for loop finished via testing against used set (again). If used, then another extension method could be used - say generating 'AA', 'AB', 'AC', etc.

Even better solution would be to define a generator of the 'A', 'B', ..., 'Z', 'AA', ..., 'AZ', 'BA', etc. sequence instead of the 'ABCD...Z' string.
0
 
LVL 16

Expert Comment

by:gelonida
Comment Utility
@pepr:
Perhaps I misread, but I thought python for loops can have an else statement being entered whenever the for loop is completed without having hit a break statement

( http://stackoverflow.com/questions/9979970/why-does-python-use-else-after-for-and-while-loops )


Checking whether the resulting list is really unique is just a defensive practice without trying to code a fully generic fool proof solution. (I like to code as little as possible, but verify that my result is correct. If I get exceptions I start coding the more complete solution.
Having silently a non unique list causing an error somewhere completely else might be dangerous or difficult to detect)

I prefer an exception to a non unique list or to a potentially too complicated solution.

Your suggestion with an infinite iterator would of course solve the problem.


Instead of the else statement in the for loop one could also check between line 18 and 19 of your original code, that, len(used) == len(result) and if not raise an error.
0
 
LVL 28

Expert Comment

by:pepr
Comment Utility
@gelonida: Oh, I see. Well, I thought you mean the else of the if. The reason is that I am not used to the else clause of the for loop (or of the while loop). It seems strange to me, difficult to understand as it is not used in other programming languages. One have to thing hard what does it mean. I could bet, people would not find how it works without reading a documentation. And even many the people who know it exists have to look to the doc again if they meet the situation again. But you are right...
Errors should never pass silently.
Unless explicitly silenced.
0
 
LVL 16

Expert Comment

by:gelonida
Comment Utility
Fully agree, this else syntax is bizarre and probably it's better to avoid it as too many people (sometimes myself included ;-) ) have to look in the doc .
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now