Ingesting and analysing file

Suppose I have 2 text files

Name, Age
A,3
B,5
C,7

Name,School
B,QQ
C,WW
D,EE

Using Python 3, I’d like to produce a file like this.

Name,Age,School
A,3,NA
B,5,QQ
C,7,QQ
D,NA,EE

The NA value is because A’s school is not in the second file and D’s age is not in the first file.
AlHal2Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

gelonidaCommented:
for this simple example you could read the lines and split them by ','
However I suggest to take the python module csv and use a csv reader. so that it will also handle more complicated boundary cases.

I assume, that the text files don't contain millions of lines and that you can afford to read  the first and second file completely into memory.

In which order do you want to write the output lines?
In alphabetical order of Name?

I assume also, that only one field ('Name') is common between the two files and that no 'Name' exists twice in the same file.

Are my assumptions right?
0
gelonidaCommented:
OK here a generic solution with my above assumptions:

I populated the columns, that should contain the text 'NA' manually.

However the DictWriter from the csv module detects automatically undefined colums and can populate them with 'NA'
Therefore there are four lines, that you could comment out. The code would still be working.

I added some print statements, so that it is easier to understand how the code is working.
Just remove or comment them after testing

there's a minor modification to do in case you want to be able to handle a completely empty file1 or file2

from csv import DictReader
from csv import DictWriter

NA_TEXT = "NA" # text to fill into undefined columns

def read_file_to_dict(fname, common_key):
    """ reads a file into a dict and returns the dict
        and the rows existing in that file

        each file must contain at least one line containing 
        the names of the columns
    """
    rows = {}

    with open(fname) as fin:
        reader = DictReader(fin)
        for row in reader:
            print(row)
            rows[row[common_key]] = row
    
    return rows, reader.fieldnames


def combine_csv(fname1, fname2, rslt_fname, common_key='Name'):
    """ reads rows from two csv files and combines them """

    file1_rows, field_names1 = read_file_to_dict(fname1, common_key)
    file2_rows, field_names2 = read_file_to_dict(fname2, common_key)

    # list of field names for the N/A cases
    na_fields1 = [name for name in field_names1 if name != common_key]
    print("fields to set to NA if row not in file1", na_fields1)

    na_fields2 = [name for name in field_names2 if name != common_key]
    print("fields to set to NA if row not in file2", na_fields2)

    # This are the fields of the destination file
    rslt_fields = field_names1 + na_fields2
    print("fields to write to result file", rslt_fields)

    # dicts for the NA case
    na_dict1 = dict([(name, NA_TEXT) for name in na_fields1])
    print("na_dict1", na_dict1)
    na_dict2 = dict([(name, NA_TEXT) for name in na_fields2])
    print("na_dict2", na_dict2)

    rows = {} # dict storing the results
    # process all rows from file1
    for key, row in file1_rows.items():
        if key in file2_rows: # key exists in both files?
            row2 = file2_rows[key]
            print("merge", row, "and", row2)
            row.update(row2) 
            file2_rows.pop(key) # remove row from file2_rows
        else:  # the else and the next two lines could be removed
            row.update(na_dict2)
            print("take", row, "from file1")
        rows[key] = row

    # process all rows from file2 that were not already merged
    for key, row in file2_rows.items():
        row.update(na_dict1) # this line could be removed
        rows[key] = row

    # create result file
    with open(rslt_fname, "w") as fout:
        writer = DictWriter(fout, fieldnames=rslt_fields, restval=NA_TEXT)
        for key, row in sorted(rows.items()):
            print("writing", row)
            writer.writerow(row)

if __name__ == '__main__':
    combine_csv('file1.csv', 'file2.csv', 'rslt.csv')

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
AlHal2Author Commented:
Thanks for this.  How do I print the dictionary to a text file?
0
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

gelonidaCommented:
I don't understand your question?

the result (comma separated values)
should be in the file 'rslt.csv' or whatevert you pass to the parameter rslt_fname.
0
gelonidaCommented:
if you really want to print the dict to another file, then you could either dump the dict into a json file or redirect the print statement to a file (redirecting print output to a file  varies slightly between python 2 and python 3).

Just tell me what you want exactly?
0
AlHal2Author Commented:
Thanks very much.  I had neglected to scroll through the code.
0
gelonidaCommented:
you're welcome.
I hope the code is clear enough.
I tried to put enough print statements to easier be able to follow what the code is doing.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
python3

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.