[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

How can I use python to find almost identical csv entries?

Posted on 2014-08-19
9
Medium Priority
?
285 Views
Last Modified: 2014-08-27
Hello there!

I am fiddling with python because I have a csv file with a lot of almost identical rows.

The csv is ordered by name...

I need to find rows that are identical for all the fields but one like those two:

['name', 'value', 'value', '1', 'value', 'value', 'value', 'value', 'value', 'value']
['name', 'value', 'value', '-1', 'value', 'value', 'value', 'value', 'value', 'value']

When I find such rows I need to delete both until the file is finished.

I've been trying a bit but I can't find a solution for the problem...

I can get data with no problems from csv and order it. When it comes to passing it (in form of list of lists) I can't think of how to build the function that should do the job:

def checkFile(file_csv):
    file_to_check = csv.reader(file(file_csv))
    file_to_check_ordered = sorted(file_to_check, key=operator.itemgetter(8), reverse=False)

    record_to_elaborate = []

    for field in file_to_check_ordered:
        record_to_elaborate.append([field[8],field[9],field[10],field[11],field[14],field[15],field[19],field[20],field[21],field[33]])
    
    previous_record = [0,0,0,0,0,0,0,0,0] # dummy record to start
    for record in record_to_elaborate:
        for number in range (0,10):
            if record[number] == previous_record[number]:
                print "equal record 1: " + str(record) + " equal record 2: " + str(previous_record)
            elif record[3] != record_precedente[3]: # this is where I compare the possibly different field
                print "different record 1: " + str(record) + " different record 2: " + str(previous_record)
        previous_record = record     

Open in new window

 



I am doing it wrong...

Can anybody provide help?

Thanks
0
Comment
Question by:ltpitt
  • 5
  • 4
9 Comments
 
LVL 20

Expert Comment

by:Mark Brady
ID: 40271615
Can you post a sample of the data? I'll write something to parse it properly for you
0
 
LVL 20

Expert Comment

by:Mark Brady
ID: 40271617
Also send me the rules. Are these data records always in the same format and are you only looking for a record that is identical except the 1 and -1 ? I mean give me a bit more to work with and I will help you. Cheers
0
 
LVL 1

Author Comment

by:ltpitt
ID: 40272974
Sure indeed, Mark and thanks for your kind help!

I'll post data as soon as I'm back home.
0
Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

 
LVL 20

Expert Comment

by:Mark Brady
ID: 40273445
Sounds great!
0
 
LVL 1

Author Comment

by:ltpitt
ID: 40277278
Here I am!

This is the file before transformation:

https://dl.dropboxusercontent.com/u/3900156/before.xls

And this is the file after I've worked it manually:

https://dl.dropboxusercontent.com/u/3900156/after.xls

When two rows are identical (except for the 1 and -1 field) they must be deleted
0
 
LVL 1

Accepted Solution

by:
ltpitt earned 0 total points
ID: 40277700
OMG looks like now it works!

It looks very bad and I would appreciate so much your kind comment / corrections and consider it as solution :)

def controllaFile(file_csv):
    file_da_controllare = csv.reader(file(file_csv))
    file_ordinato_da_controllare = sorted(file_da_controllare, key=operator.itemgetter(8), reverse=False)

    record_da_elaborare = []

    for campo in file_ordinato_da_controllare:
        record_da_elaborare.append([campo[8],campo[9],campo[10],campo[11],campo[14]])
    

    #print record_da_elaborare

    record_precedente = []
    record_elaborati = []
    record_da_cancellare = []
    for record in record_da_elaborare:
        if record_precedente == []:
            record_precedente = record
            continue
        if record[0] == record_precedente[0] and record[1] == record_precedente[1] and record[2] == record_precedente[2] and record[3] != record_precedente[3] and record[4] == record_precedente[4]:
            print "cancello:"
            print record
            print record_precedente
            record_da_cancellare.append(record)
            record_da_cancellare.append(record_precedente)
            record_precedente = []
        else:
            record_precedente = record    
    for record in record_da_cancellare:
        record_da_elaborare.remove(record)
    record_elaborati = record_da_elaborare
    print record_elaborati     

Open in new window

0
 
LVL 20

Assisted Solution

by:Mark Brady
Mark Brady earned 400 total points
ID: 40277735
That code doesn't look bad at all. I would add some comments to it so someone reading it can see what is supposed to happen but other than that, it looks ok and if it works then great!
0
 
LVL 1

Author Comment

by:ltpitt
ID: 40278520
Really?

I am the proudest man ever! :D

Thanks for the kind help.
0
 
LVL 1

Author Closing Comment

by:ltpitt
ID: 40287413
It simply works :)
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

829 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question