How can I use python to find almost identical csv entries?

Posted on 2014-08-19
Last Modified: 2014-08-27
Hello there!

I am fiddling with python because I have a csv file with a lot of almost identical rows.

The csv is ordered by name...

I need to find rows that are identical for all the fields but one like those two:

['name', 'value', 'value', '1', 'value', 'value', 'value', 'value', 'value', 'value']
['name', 'value', 'value', '-1', 'value', 'value', 'value', 'value', 'value', 'value']

When I find such rows I need to delete both until the file is finished.

I've been trying a bit but I can't find a solution for the problem...

I can get data with no problems from csv and order it. When it comes to passing it (in form of list of lists) I can't think of how to build the function that should do the job:

def checkFile(file_csv):
    file_to_check = csv.reader(file(file_csv))
    file_to_check_ordered = sorted(file_to_check, key=operator.itemgetter(8), reverse=False)

    record_to_elaborate = []

    for field in file_to_check_ordered:
    previous_record = [0,0,0,0,0,0,0,0,0] # dummy record to start
    for record in record_to_elaborate:
        for number in range (0,10):
            if record[number] == previous_record[number]:
                print "equal record 1: " + str(record) + " equal record 2: " + str(previous_record)
            elif record[3] != record_precedente[3]: # this is where I compare the possibly different field
                print "different record 1: " + str(record) + " different record 2: " + str(previous_record)
        previous_record = record     

Open in new window


I am doing it wrong...

Can anybody provide help?

Question by:ltpitt
    LVL 20

    Expert Comment

    by:Mark Brady
    Can you post a sample of the data? I'll write something to parse it properly for you
    LVL 20

    Expert Comment

    by:Mark Brady
    Also send me the rules. Are these data records always in the same format and are you only looking for a record that is identical except the 1 and -1 ? I mean give me a bit more to work with and I will help you. Cheers
    LVL 1

    Author Comment

    Sure indeed, Mark and thanks for your kind help!

    I'll post data as soon as I'm back home.
    LVL 20

    Expert Comment

    by:Mark Brady
    Sounds great!
    LVL 1

    Author Comment

    Here I am!

    This is the file before transformation:

    And this is the file after I've worked it manually:

    When two rows are identical (except for the 1 and -1 field) they must be deleted
    LVL 1

    Accepted Solution

    OMG looks like now it works!

    It looks very bad and I would appreciate so much your kind comment / corrections and consider it as solution :)

    def controllaFile(file_csv):
        file_da_controllare = csv.reader(file(file_csv))
        file_ordinato_da_controllare = sorted(file_da_controllare, key=operator.itemgetter(8), reverse=False)
        record_da_elaborare = []
        for campo in file_ordinato_da_controllare:
        #print record_da_elaborare
        record_precedente = []
        record_elaborati = []
        record_da_cancellare = []
        for record in record_da_elaborare:
            if record_precedente == []:
                record_precedente = record
            if record[0] == record_precedente[0] and record[1] == record_precedente[1] and record[2] == record_precedente[2] and record[3] != record_precedente[3] and record[4] == record_precedente[4]:
                print "cancello:"
                print record
                print record_precedente
                record_precedente = []
                record_precedente = record    
        for record in record_da_cancellare:
        record_elaborati = record_da_elaborare
        print record_elaborati     

    Open in new window

    LVL 20

    Assisted Solution

    by:Mark Brady
    That code doesn't look bad at all. I would add some comments to it so someone reading it can see what is supposed to happen but other than that, it looks ok and if it works then great!
    LVL 1

    Author Comment


    I am the proudest man ever! :D

    Thanks for the kind help.
    LVL 1

    Author Closing Comment

    It simply works :)

    Featured Post

    Enabling OSINT in Activity Based Intelligence

    Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

    Join & Write a Comment

    Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" " You can also get information on Python scripting language from the above mentioned we…
    Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
    Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
    Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

    733 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    25 Experts available now in Live!

    Get 1:1 Help Now