How can I use python to find almost identical csv entries?

Hello there!

I am fiddling with python because I have a csv file with a lot of almost identical rows.

The csv is ordered by name...

I need to find rows that are identical for all the fields but one like those two:

['name', 'value', 'value', '1', 'value', 'value', 'value', 'value', 'value', 'value']
['name', 'value', 'value', '-1', 'value', 'value', 'value', 'value', 'value', 'value']

When I find such rows I need to delete both until the file is finished.

I've been trying a bit but I can't find a solution for the problem...

I can get data with no problems from csv and order it. When it comes to passing it (in form of list of lists) I can't think of how to build the function that should do the job:

def checkFile(file_csv):
    file_to_check = csv.reader(file(file_csv))
    file_to_check_ordered = sorted(file_to_check, key=operator.itemgetter(8), reverse=False)

    record_to_elaborate = []

    for field in file_to_check_ordered:
        record_to_elaborate.append([field[8],field[9],field[10],field[11],field[14],field[15],field[19],field[20],field[21],field[33]])
    
    previous_record = [0,0,0,0,0,0,0,0,0] # dummy record to start
    for record in record_to_elaborate:
        for number in range (0,10):
            if record[number] == previous_record[number]:
                print "equal record 1: " + str(record) + " equal record 2: " + str(previous_record)
            elif record[3] != record_precedente[3]: # this is where I compare the possibly different field
                print "different record 1: " + str(record) + " different record 2: " + str(previous_record)
        previous_record = record     

Open in new window

 



I am doing it wrong...

Can anybody provide help?

Thanks
LVL 1
ltpittAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Mark BradyPrincipal Data EngineerCommented:
Can you post a sample of the data? I'll write something to parse it properly for you
0
Mark BradyPrincipal Data EngineerCommented:
Also send me the rules. Are these data records always in the same format and are you only looking for a record that is identical except the 1 and -1 ? I mean give me a bit more to work with and I will help you. Cheers
0
ltpittAuthor Commented:
Sure indeed, Mark and thanks for your kind help!

I'll post data as soon as I'm back home.
0
Introducing Cloud Class® training courses

Tech changes fast. You can learn faster. That’s why we’re bringing professional training courses to Experts Exchange. With a subscription, you can access all the Cloud Class® courses to expand your education, prep for certifications, and get top-notch instructions.

Mark BradyPrincipal Data EngineerCommented:
Sounds great!
0
ltpittAuthor Commented:
Here I am!

This is the file before transformation:

https://dl.dropboxusercontent.com/u/3900156/before.xls

And this is the file after I've worked it manually:

https://dl.dropboxusercontent.com/u/3900156/after.xls

When two rows are identical (except for the 1 and -1 field) they must be deleted
0
ltpittAuthor Commented:
OMG looks like now it works!

It looks very bad and I would appreciate so much your kind comment / corrections and consider it as solution :)

def controllaFile(file_csv):
    file_da_controllare = csv.reader(file(file_csv))
    file_ordinato_da_controllare = sorted(file_da_controllare, key=operator.itemgetter(8), reverse=False)

    record_da_elaborare = []

    for campo in file_ordinato_da_controllare:
        record_da_elaborare.append([campo[8],campo[9],campo[10],campo[11],campo[14]])
    

    #print record_da_elaborare

    record_precedente = []
    record_elaborati = []
    record_da_cancellare = []
    for record in record_da_elaborare:
        if record_precedente == []:
            record_precedente = record
            continue
        if record[0] == record_precedente[0] and record[1] == record_precedente[1] and record[2] == record_precedente[2] and record[3] != record_precedente[3] and record[4] == record_precedente[4]:
            print "cancello:"
            print record
            print record_precedente
            record_da_cancellare.append(record)
            record_da_cancellare.append(record_precedente)
            record_precedente = []
        else:
            record_precedente = record    
    for record in record_da_cancellare:
        record_da_elaborare.remove(record)
    record_elaborati = record_da_elaborare
    print record_elaborati     

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Mark BradyPrincipal Data EngineerCommented:
That code doesn't look bad at all. I would add some comments to it so someone reading it can see what is supposed to happen but other than that, it looks ok and if it works then great!
0
ltpittAuthor Commented:
Really?

I am the proudest man ever! :D

Thanks for the kind help.
0
ltpittAuthor Commented:
It simply works :)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.