Link to home
Start Free TrialLog in
Avatar of tech
tech

asked on

looking up values of 1 year into other years

Sample csv file:

X            year
120           2010
120           2011
120           2012
121           2010
121           2011
132           2010
132           2012
145           2010
145           2011
145           2012


We have some column named X in a csv file. There are same values of X for 3 years, some are missing from either of the year like 121 is not for 2012. I want to check if all values in 2010 are there in other years. If no, then I wanted to list those X values. Here, the output should be

X      year
121   2012
132  2011
Avatar of Subodh Tiwari (Neeraj)
Subodh Tiwari (Neeraj)
Flag of India image

There maybe a better way to accomplish this but here is what I tried and that worked.

Assuming you have columns X and Year (columns names are case sensitive, change them in the code if required) in the csv file called data.csv then the following code will print the missing values on the console and save a csv file called missingvalues.csv with the missing values in the data.

import csv
import pandas as pd

df = pd.read_csv('data.csv')
x = list(df.X.unique())
year = list(df.Year.unique())

all_possible = {(a, b) for a in x  
          for b in year if a != b}

original = set(list(zip(df.X, df.Year)))

missing = list(all_possible.difference(original))

print(missing)

with open("missingvalues.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerows(missing)

Open in new window


Here is the output when ran on Jupyter Notebook...
User generated image
Here is the output in the missingvalues.csv file...
User generated image
@tech

How big are these csv files? (how many lines, how many bytes)
How many columns are in the first/original file?
How many columns are in the second/matching candidate file?
It appears, from what you posted, that both files are sorted in x, year order. It it a valid assumption that the files ARE sorted?
@tech

Do you require a pandas solution?

Your most recent comment is a slightly different problem than your original problem statement.  Please clarify.
...to see if there is any x in 2015 which is not there in other years

Shouldn't the second file determine which (x, year) pair you are seeking?  This phrasing of the problem (above quote) seems to indicate that if we encounter an (x, year) pair, then we need to look for all (x, year) values in our first file for unique years in the first file.  

Or is this only subsequent year values?
Both performance and memory might be considerations in the solution domain.

For instance, if x = 145 and year = 2010, we might match all three years from the first file.  When we encounter x = 145 and year = 2011, we would repeat two of the three matches we did for (145, 2010).  So, there could be a lot of unnecessary work we would want to avoid.

As your first file grows each year, it may encroach on the limits of your physical RAM.  Don't know if/when that might occur.
So, is there one CSV file or two (or more)?
thanks
This may seem like a weird question, but is the data well behaved?

For example, would we ever see something like the following?  In this example, 120 has five rows, but the high year is 2015, not 2014.  I'm thinking of possible typos in the data.
X	year
120	2010
120	2011
120	2012
120	2013
120	2015
121	2010
121	2011
121	2012
121	2013
121	2014

Open in new window

Maybe that is why you wanted to add two parameters, limiting the years being processed.
Please test this.  You will need to change the parameter values, especially the csv file path.
"""
Created on Wed Dec 11 15:56:15 2019

@author: aikimark
@EE problem: 29166343
"""
import csv

def findMissingYears(startYear, 
                     endYear,
                     csvFilePath):
    uniqueX = set()
    presentXSubsequentYears = {}
    subsequentYears = set([x for x in range(startYear+1, endYear+1)])
    subsequentYearCount = len(subsequentYears)
    
    with open(csvFilePath, 'r') as f:
        d = csv.DictReader(f)
        '''
           get all the X values for the starting year
        '''
        for line in d:
            if int(line['year']) == startYear:
                uniqueX.add(line["X"])

        f.seek(0)
        d = csv.DictReader(f)
        '''
           populate a dictionary object for matching X values in the following
           years.  X is the key and the years will be the value
        '''
        for line in d:
            if startYear+1 <= int(line['year']) <= endYear:
                if line["X"] in presentXSubsequentYears:
                    presentXSubsequentYears[line["X"]].add(int(line['year']))
                else:
                    presentXSubsequentYears[line["X"]] = set([int(line['year'])])

    '''
       for each item in the dictionary, compare the count of unique year values
       against the parameter-defined range.  If count not equal, print the
       difference set along with the key (X)
    '''
    for k, years in presentXSubsequentYears.items():
        if len(years) != subsequentYearCount:
            print(k, subsequentYears.difference(years))

findMissingYears(2010, 2014,
                 r'c:\users\new user\downloads\Q_29166343.csv')

Open in new window

What is the 0..7 column?  That wasn't in your initial post.  I didn't test my code against data that looked like that.

Please post a small csv file that is a representative sample of what you are actually using.  It doesn't have to be big, but does need to represent the data.
What were the first two parameter values you supplied to the findMissingYears function?
but none of the data you posted is in that range, just 2010-2012
The problem is that I don't have any data that includes those years or an X value of 12460.

I think I can tweak the code, but I need to be able to test the code.
Also, you are working with a CSV file, but you're posting some formatted listing of the data, which is space/tab separated.  I created a csv file from your initial post, but I can't be sure it is an accurate transformation.
Avatar of tech
tech

ASKER

for this data,

X      year
13750      2018
15143      2018
15161      2017
15161      2018
15024      2018
19358      2017
9817      2016
1908      2015
1908      2016


output is

13750 {2016, 2017}
15143 {2016, 2017}
15161 {2016}
15024 {2016, 2017}
19358 {2016, 2018}
9817 {2017, 2018}
1908 {2017, 2018}

output should be
1908 {2017, 2018}

as only 1908 is in 2015.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
my test for start/end years = 2015,2018 produced the following:
1908 {2017, 2018}

Open in new window

which is what you stated was the desired output.
Avatar of tech

ASKER

I have checked your code and it seems to work fine. Thanks a million for your effort.
I needed help with few things in this.

I wanted to delete all unwanted records. Like what we are printing now are the missing values from the subsequent years which I wanted to delete and come up with a new csv file with records for all years which are in start year.

like for this data

X              year
13750      2015
15143      2018
15161      2015
15161      2016
15024      2018
19358      2017
9817        2016
9817        2017
9817        2018
1908        2015
1908        2016
1908        2017
1908        2018

should give this output

1908       2015
1908       2016
1908       2017
1908       2018

as only 1908 is in 2015 which has X for all other years.
Since the matching problem has been solved, probably best to open a new question about deleting records/lines.
Avatar of tech

ASKER

ok. Thanks. Can you please take a look at the new question?

https://www.experts-exchange.com/questions/29166962/Delete-unwanted-records.html#questionAdd
will do
basically, iterate the years difference.
    for k, years in presentXSubsequentYears.items():
        if len(years) != subsequentYearCount:
            for yr in subsequentYears.difference(years):
                print(k, yr)

Open in new window