Start Free Trial

asked on

looking up values of 1 year into other years

Sample csv file:

X        year
120        2010
120        2011
120        2012
121        2010
121        2011
132        2010
132        2012
145        2010
145        2011
145        2012

We have some column named X in a csv file. There are same values of X for 3 years, some are missing from either of the year like 121 is not for 2012. I want to check if all values in 2010 are there in other years. If no, then I wanted to list those X values. Here, the output should be

X year
121 2012
132 2011

Subodh Tiwari (Neeraj)

There maybe a better way to accomplish this but here is what I tried and that worked.

Assuming you have columns X and Year (columns names are case sensitive, change them in the code if required) in the csv file called data.csv then the following code will print the missing values on the console and save a csv file called missingvalues.csv with the missing values in the data.

import csv
import pandas as pd

df = pd.read_csv('data.csv')
x = list(df.X.unique())
year = list(df.Year.unique())

all_possible = {(a, b) for a in x  
          for b in year if a != b}

original = set(list(zip(df.X, df.Year)))

missing = list(all_possible.difference(original))

print(missing)

with open("missingvalues.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(missing)

Open in new window

Here is the output when ran on Jupyter Notebook...

User generated image

Here is the output in the missingvalues.csv file...

User generated image

@tech

How big are these csv files? (how many lines, how many bytes)
How many columns are in the first/original file?
How many columns are in the second/matching candidate file?

It appears, from what you posted, that both files are sorted in x, year order. It it a valid assumption that the files ARE sorted?

@tech

Do you require a pandas solution?

Your most recent comment is a slightly different problem than your original problem statement. Please clarify.

...to see if there is any x in 2015 which is not there in other years

Shouldn't the second file determine which (x, year) pair you are seeking? This phrasing of the problem (above quote) seems to indicate that if we encounter an (x, year) pair, then we need to look for all (x, year) values in our first file for unique years in the first file.

Or is this only subsequent year values?

Both performance and memory might be considerations in the solution domain.

For instance, if x = 145 and year = 2010, we might match all three years from the first file. When we encounter x = 145 and year = 2011, we would repeat two of the three matches we did for (145, 2010). So, there could be a lot of unnecessary work we would want to avoid.

As your first file grows each year, it may encroach on the limits of your physical RAM. Don't know if/when that might occur.

So, is there one CSV file or two (or more)?

thanks

This may seem like a weird question, but is the data well behaved?

For example, would we ever see something like the following? In this example, 120 has five rows, but the high year is 2015, not 2014. I'm thinking of possible typos in the data.

Open in new window

Maybe that is why you wanted to add two parameters, limiting the years being processed.

Please test this. You will need to change the parameter values, especially the csv file path.

"""
Created on Wed Dec 11 15:56:15 2019

@author: aikimark
@EE problem: 29166343
"""
import csv

def findMissingYears(startYear, 
                     endYear,
                     csvFilePath):
    uniqueX = set()
    presentXSubsequentYears = {}
    subsequentYears = set([x for x in range(startYear+1, endYear+1)])
    subsequentYearCount = len(subsequentYears)
    
    with open(csvFilePath, 'r') as f:
        d = csv.DictReader(f)
        '''
           get all the X values for the starting year
        '''
        for line in d:
            if int(line['year']) == startYear:
                uniqueX.add(line["X"])

        f.seek(0)
        d = csv.DictReader(f)
        '''
           populate a dictionary object for matching X values in the following
           years.  X is the key and the years will be the value
        '''
        for line in d:
            if startYear+1 <= int(line['year']) <= endYear:
                if line["X"] in presentXSubsequentYears:
                    presentXSubsequentYears[line["X"]].add(int(line['year']))
                else:
                    presentXSubsequentYears[line["X"]] = set([int(line['year'])])

    '''
       for each item in the dictionary, compare the count of unique year values
       against the parameter-defined range.  If count not equal, print the
       difference set along with the key (X)
    '''
    for k, years in presentXSubsequentYears.items():
        if len(years) != subsequentYearCount:
            print(k, subsequentYears.difference(years))

findMissingYears(2010, 2014,
                 r'c:\users\new user\downloads\Q_29166343.csv')

Open in new window

What is the 0..7 column? That wasn't in your initial post. I didn't test my code against data that looked like that.

Please post a small csv file that is a representative sample of what you are actually using. It doesn't have to be big, but does need to represent the data.

What were the first two parameter values you supplied to the findMissingYears function?

but none of the data you posted is in that range, just 2010-2012

The problem is that I don't have any data that includes those years or an X value of 12460.

I think I can tweak the code, but I need to be able to test the code.

Also, you are working with a CSV file, but you're posting some formatted listing of the data, which is space/tab separated. I created a csv file from your initial post, but I can't be sure it is an accurate transformation.

ASKER

for this data,

X      year
13750      2018
15143      2018
15161      2017
15161      2018
15024      2018
19358      2017
9817      2016
1908      2015
1908      2016

output is

13750 {2016, 2017}
15143 {2016, 2017}
15161 {2016}
15024 {2016, 2017}
19358 {2016, 2018}
9817 {2017, 2018}
1908 {2017, 2018}

output should be
1908 {2017, 2018}

as only 1908 is in 2015.

ASKER CERTIFIED SOLUTION

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

my test for start/end years = 2015,2018 produced the following:

1908 {2017, 2018}

Open in new window

which is what you stated was the desired output.

ASKER

I have checked your code and it seems to work fine. Thanks a million for your effort.
I needed help with few things in this.

I wanted to delete all unwanted records. Like what we are printing now are the missing values from the subsequent years which I wanted to delete and come up with a new csv file with records for all years which are in start year.

like for this data

X year
13750 2015
15143 2018
15161 2015
15161 2016
15024 2018
19358 2017
9817 2016
9817 2017
9817 2018
1908 2015
1908 2016
1908 2017
1908 2018

should give this output

1908 2015
1908 2016
1908 2017
1908 2018

as only 1908 is in 2015 which has X for all other years.

Since the matching problem has been solved, probably best to open a new question about deleting records/lines.

ASKER

ok. Thanks. Can you please take a look at the new question?

https://www.experts-exchange.com/questions/29166962/Delete-unwanted-records.html#questionAdd

will do

basically, iterate the years difference.

    for k, years in presentXSubsequentYears.items():
        if len(years) != subsequentYearCount:
            for yr in subsequentYears.difference(years):
                print(k, yr)

Open in new window