tech
asked on
looking up values of 1 year into other years
Sample csv file:
X year
120 2010
120 2011
120 2012
121 2010
121 2011
132 2010
132 2012
145 2010
145 2011
145 2012
We have some column named X in a csv file. There are same values of X for 3 years, some are missing from either of the year like 121 is not for 2012. I want to check if all values in 2010 are there in other years. If no, then I wanted to list those X values. Here, the output should be
X year
121 2012
132 2011
X year
120 2010
120 2011
120 2012
121 2010
121 2011
132 2010
132 2012
145 2010
145 2011
145 2012
We have some column named X in a csv file. There are same values of X for 3 years, some are missing from either of the year like 121 is not for 2012. I want to check if all values in 2010 are there in other years. If no, then I wanted to list those X values. Here, the output should be
X year
121 2012
132 2011
@tech
How big are these csv files? (how many lines, how many bytes)
How many columns are in the first/original file?
How many columns are in the second/matching candidate file?
How big are these csv files? (how many lines, how many bytes)
How many columns are in the first/original file?
How many columns are in the second/matching candidate file?
It appears, from what you posted, that both files are sorted in x, year order. It it a valid assumption that the files ARE sorted?
@tech
Do you require a pandas solution?
Your most recent comment is a slightly different problem than your original problem statement. Please clarify.
Do you require a pandas solution?
Your most recent comment is a slightly different problem than your original problem statement. Please clarify.
...to see if there is any x in 2015 which is not there in other years
Shouldn't the second file determine which (x, year) pair you are seeking? This phrasing of the problem (above quote) seems to indicate that if we encounter an (x, year) pair, then we need to look for all (x, year) values in our first file for unique years in the first file.
Or is this only subsequent year values?
Both performance and memory might be considerations in the solution domain.
For instance, if x = 145 and year = 2010, we might match all three years from the first file. When we encounter x = 145 and year = 2011, we would repeat two of the three matches we did for (145, 2010). So, there could be a lot of unnecessary work we would want to avoid.
As your first file grows each year, it may encroach on the limits of your physical RAM. Don't know if/when that might occur.
For instance, if x = 145 and year = 2010, we might match all three years from the first file. When we encounter x = 145 and year = 2011, we would repeat two of the three matches we did for (145, 2010). So, there could be a lot of unnecessary work we would want to avoid.
As your first file grows each year, it may encroach on the limits of your physical RAM. Don't know if/when that might occur.
So, is there one CSV file or two (or more)?
thanks
This may seem like a weird question, but is the data well behaved?
For example, would we ever see something like the following? In this example, 120 has five rows, but the high year is 2015, not 2014. I'm thinking of possible typos in the data.
For example, would we ever see something like the following? In this example, 120 has five rows, but the high year is 2015, not 2014. I'm thinking of possible typos in the data.
X year
120 2010
120 2011
120 2012
120 2013
120 2015
121 2010
121 2011
121 2012
121 2013
121 2014
Maybe that is why you wanted to add two parameters, limiting the years being processed.
Please test this. You will need to change the parameter values, especially the csv file path.
"""
Created on Wed Dec 11 15:56:15 2019
@author: aikimark
@EE problem: 29166343
"""
import csv
def findMissingYears(startYear,
endYear,
csvFilePath):
uniqueX = set()
presentXSubsequentYears = {}
subsequentYears = set([x for x in range(startYear+1, endYear+1)])
subsequentYearCount = len(subsequentYears)
with open(csvFilePath, 'r') as f:
d = csv.DictReader(f)
'''
get all the X values for the starting year
'''
for line in d:
if int(line['year']) == startYear:
uniqueX.add(line["X"])
f.seek(0)
d = csv.DictReader(f)
'''
populate a dictionary object for matching X values in the following
years. X is the key and the years will be the value
'''
for line in d:
if startYear+1 <= int(line['year']) <= endYear:
if line["X"] in presentXSubsequentYears:
presentXSubsequentYears[line["X"]].add(int(line['year']))
else:
presentXSubsequentYears[line["X"]] = set([int(line['year'])])
'''
for each item in the dictionary, compare the count of unique year values
against the parameter-defined range. If count not equal, print the
difference set along with the key (X)
'''
for k, years in presentXSubsequentYears.items():
if len(years) != subsequentYearCount:
print(k, subsequentYears.difference(years))
findMissingYears(2010, 2014,
r'c:\users\new user\downloads\Q_29166343.csv')
What is the 0..7 column? That wasn't in your initial post. I didn't test my code against data that looked like that.
Please post a small csv file that is a representative sample of what you are actually using. It doesn't have to be big, but does need to represent the data.
Please post a small csv file that is a representative sample of what you are actually using. It doesn't have to be big, but does need to represent the data.
What were the first two parameter values you supplied to the findMissingYears function?
but none of the data you posted is in that range, just 2010-2012
The problem is that I don't have any data that includes those years or an X value of 12460.
I think I can tweak the code, but I need to be able to test the code.
I think I can tweak the code, but I need to be able to test the code.
Also, you are working with a CSV file, but you're posting some formatted listing of the data, which is space/tab separated. I created a csv file from your initial post, but I can't be sure it is an accurate transformation.
ASKER
for this data,
X year
13750 2018
15143 2018
15161 2017
15161 2018
15024 2018
19358 2017
9817 2016
1908 2015
1908 2016
output is
13750 {2016, 2017}
15143 {2016, 2017}
15161 {2016}
15024 {2016, 2017}
19358 {2016, 2018}
9817 {2017, 2018}
1908 {2017, 2018}
output should be
1908 {2017, 2018}
as only 1908 is in 2015.
X year
13750 2018
15143 2018
15161 2017
15161 2018
15024 2018
19358 2017
9817 2016
1908 2015
1908 2016
output is
13750 {2016, 2017}
15143 {2016, 2017}
15161 {2016}
15024 {2016, 2017}
19358 {2016, 2018}
9817 {2017, 2018}
1908 {2017, 2018}
output should be
1908 {2017, 2018}
as only 1908 is in 2015.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
my test for start/end years = 2015,2018 produced the following:
1908 {2017, 2018}
which is what you stated was the desired output.
ASKER
I have checked your code and it seems to work fine. Thanks a million for your effort.
I needed help with few things in this.
I wanted to delete all unwanted records. Like what we are printing now are the missing values from the subsequent years which I wanted to delete and come up with a new csv file with records for all years which are in start year.
like for this data
X year
13750 2015
15143 2018
15161 2015
15161 2016
15024 2018
19358 2017
9817 2016
9817 2017
9817 2018
1908 2015
1908 2016
1908 2017
1908 2018
should give this output
1908 2015
1908 2016
1908 2017
1908 2018
as only 1908 is in 2015 which has X for all other years.
I needed help with few things in this.
I wanted to delete all unwanted records. Like what we are printing now are the missing values from the subsequent years which I wanted to delete and come up with a new csv file with records for all years which are in start year.
like for this data
X year
13750 2015
15143 2018
15161 2015
15161 2016
15024 2018
19358 2017
9817 2016
9817 2017
9817 2018
1908 2015
1908 2016
1908 2017
1908 2018
should give this output
1908 2015
1908 2016
1908 2017
1908 2018
as only 1908 is in 2015 which has X for all other years.
Since the matching problem has been solved, probably best to open a new question about deleting records/lines.
ASKER
ok. Thanks. Can you please take a look at the new question?
https://www.experts-exchange.com/questions/29166962/Delete-unwanted-records.html#questionAdd
https://www.experts-exchange.com/questions/29166962/Delete-unwanted-records.html#questionAdd
will do
basically, iterate the years difference.
for k, years in presentXSubsequentYears.items():
if len(years) != subsequentYearCount:
for yr in subsequentYears.difference(years):
print(k, yr)
Assuming you have columns X and Year (columns names are case sensitive, change them in the code if required) in the csv file called data.csv then the following code will print the missing values on the console and save a csv file called missingvalues.csv with the missing values in the data.
Open in new window
Here is the output when ran on Jupyter Notebook...
Here is the output in the missingvalues.csv file...