Solved

Loop in SAS

Posted on 2011-02-23
38
1,106 Views
Last Modified: 2013-11-16
Hi,

I have a dataset that is structured this way:

Acct_Nmbr            Date                   Count
        1              01/01/2011                3
        1              01/01/2011                3
        1              01/04/2011                -
        1              01/05/2011                4
        2              01/01/2011                6
        2              01/02/2011                -
        2              01/02/2011                -
        2              01/03/2011                -
        2              01/05/2011                7
        2              01/06/2011                7

Whenever the count is blank it is either equal to the last count, or the next future count.

For instance for acct 2:
on 01/02/2011 it is almost sure that count should be 6.

On the other hand for acct 1, it is more likely that count on 01/04/2011 is 4 and not 3 because the blank count is closer in time to the count 4 record.

That being said here is what I need to achieve for each account number:
1-fill in all the blanks with the last or next non blank count
2-preferably make the count take the closest value in time

If 2- is not possible then use the last value to fill in the blank

If someone knows how to do that that's awesome.

Thanks.

A
0
Comment
Question by:appc
  • 18
  • 15
  • 3
  • +1
38 Comments
 
LVL 40

Expert Comment

by:Sharath
ID: 34963957
Can you check this?
SELECT t1.Acct_Nmbr,t1.Date1,ISNULL(t1.Count1,t2.Count1) Count1 
  FROM your_table AS t1 
       LEFT JOIN (SELECT t1.Acct_Nmbr,t2.Date1,t1.Count1 
                    FROM (SELECT *, 
                                 CASE 
                                   WHEN INTCK(DAY,Pre_Date,Date1) < INTCK(DAY,Date1,Next_Date) THEN Pre_Date
                                   ELSE Next_Date 
                                 END Near_Date 
                            FROM (SELECT *, 
                                         (SELECT MAX(Date1) 
                                            FROM your_table t2 
                                           WHERE t1.Acct_Nmbr = t2.Acct_Nmbr 
                                                 AND t2.Date1 < t1.Date1 
                                                 AND t2.Count1 IS NOT NULL) Pre_Date, 
                                         (SELECT MIN(Date1) 
                                            FROM your_table t2 
                                           WHERE t1.Acct_Nmbr = t2.Acct_Nmbr 
                                                 AND t2.Date1 > t1.Date1 
                                                 AND t2.Count1 IS NOT NULL) Next_Date 
                                    FROM your_table t1 
                                   WHERE Count1 IS NULL) t1) AS t2 
                         JOIN your_table t1 
                           ON t2.Acct_Nmbr = t1.Acct_Nmbr 
                              AND t2.Near_Date = t1.Date1) AS t2 
         ON t1.Acct_Nmbr = t2.Acct_Nmbr 
            AND t1.Date1 = t2.Date1

Open in new window

0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 34975449
Hi there A,

Without using proc SQL you can code this using data sets and proc sort.  Proc sorts are (very) quick so that it  doesnt matter much doing 2 sorst.  If you have a really big input table then instead of sorting you can put an index on the table.

As you have set out the rules, it is a bit ad-hoc, trying to get a value, but it there could be the case (of a string of missing counts) where you may find that your suggested fill in method gives misleading information.  There is probably scope here for using statistical techniques for imputation which aims at deriving unbiases answers the questions you may need answers for from datasets with missing values.

All that said, the attached script does what you say, but note that can be problems for you if an account has a string of missing values or a missing value for the first or last row.

You could put the code into a macro and run it multiple times, but that will show you how much you are streteching the information you already have (and in biased ways).

Ian


EE26842472.sas.txt
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 34994972
Hi there A,

Here is the SQL code from Sharath_123:, put into proc SQL.  In the first part the code is unchanged except for formatting and allowance for the fact that SAS does not have the ISNULL function and the INTCK (interval arithemetic) requires the first argument to be a value not a symbolic name.

This produces the same results (well almost) as the previous solution using data setps and proc sorts.  The difference is due to the SQL "joins" where you have 2 records for acct 2 on 2nd Jan 2011 where the SQL code assumes that there is only 1 combination of each account and day.

Because the SQL gets a little involves (especially with all the multiplt uses of t1 and t2, I have broken it into parts so that you can see hat is going on.  The parts will fit back, each  section

        (
        select ....
        )    

being replaced by
        <our temp table name>

and the  <temp table name> being defined in a previous create clause.

Ian


defined table being
EE26842472-SQL.sas.txt
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 34994991
Sent incorrect version of SQL for SAS code, works OK but without final documentation.
New version attached

EE26842472-SQL.sas.txt
0
 
LVL 7

Assisted Solution

by:d507201
d507201 earned 20 total points
ID: 34999546
I don't use SQL for SAS very much, so here's a solution that uses DATA steps.  The first DATA step reads the data set twice, the second time starting with the second observation so as to move both the data and count up 1 to get the 'next' date and count.  The lag function is then used to get the previous date and count.  Then a second step calculates the date interval and assigns a newCount variable.  

A tricky part was what to do with sequential days of the same date where both have missing Count.  The solution I used was to use Proc Sort to retain only the first instance of the date.

I've left all the interim variables in the final data set but you can clean those up when you're comfortable with the solution and use it.

data test; infile cards;
input @9 acct 3. @24 when mmddyy10. @50 count 3.;
cards;
        1              01/01/2011                3
        1              01/01/2011                3
        1              01/04/2011                
        1              01/05/2011                4
        2              01/01/2011                6
        2              01/02/2011                
        2              01/02/2011                
        2              01/03/2011                
        2              01/05/2011                7
        2              01/06/2011                7
;
run;

proc sort data=test;
      by acct when;
run;

data new; merge test test(firstObs=2 keep=when count rename=(when=nextDay count=nextCount) );
      previousDay=lag1(when);
      previousCount=lag1(count);
      format when previousDay nextDay mmddyy10.;
run;

proc sort data=new nodupkey dupout=dups;
      by acct when;
run;

data final;
      retain acct when count newCount nextDay nextCount previousDay previousCount daysPast daysForward;
      set new;      
      by acct when;
      if count ne . then newCount=count;
      else do;
            daysPast=when-previousDay;
            daysForward=nextDay=when;

            if daysPast=daysForward then newCount=previousCount;
            else if daysPast > daysForward then newCount=nextCount;
            else newCount=previousCount;
      end;
run;

title 'Final Data Set for comparing Count and New Count';
proc print data=final;
run;
0
 
LVL 7

Expert Comment

by:d507201
ID: 34999601
Rats, I see an error in my solution.  It doesn't work when I fix it.  daysForward=nextDay=when should be daysForward=nextDay-when.
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35001835
Dear appc,

Please confirm that the count is only within account numbers and that the count does not accumulating across account numbers.  That is, if for a particular account number the first row (for that account number) in the table has a missing count then it is incorrect to consider the count of the last day of the previous account number.  Similarly if the count of the last day of a particular account number is missing then we should NOT consider the count of the first day of the next account number.

The SQL solution by "Sharath_123:" and the data step/sort solution by myself both assume that accounts form a block and the Count does not run from one account to the next.  (As well as my reworking of the SQL solution into SAS PROC SQL).  The solution by "d507201:" assumes a cumulative count going across accounts.

Hey d507201:, nice technique to merge the same dataset twice using the second instance starting from day 2. I like it!

Ian
0
 

Author Comment

by:appc
ID: 35003675
Guys,

thanks a lot for your help.

I haven't really been able to look at the proposed answers because my boss has given a higher priority project so I am a little hesitant as to what to do.

Ian,

the count is not cumulative accross account numbers. A count for a specific acc is completetely independent from the count of another account. And the count of an account can only increase by increments of 1 over time.

Thanks
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35003854
appc,

1/
"the count is not cumulative across account numbers".
This means that the "d507201" solution needs a slight adjustment to detect the change in account numbers.  (In the data step for new, when assigning previous and next count).

2/
"the count of an account can only increase by increments of 1 over time".
Does this means that the count can go up by either 0 or 1 each day?
Because not all days are shown this could mean that count can increase by 2 or 3 (say)  when there is a gap (in days) between records. Correct?

If this is the case then you can get better estimates for the missing counts - in some cases exact values - and in other cases just intelligent estimates.  However, depending on the importance of any summary calculations you are performing on the "final" dataset, there are statistical imputation techniques that will give you better estimates as well as a measure of the accuracy of the derived results.

Ian
0
 

Author Comment

by:appc
ID: 35003900
Ian,

2/

The increment can only be 1.  Short dates gaps are probably due to weekends, but longer gaps mean that there has been a change in count.
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35003946
appc,

does that mean that if (for each account) you were to remove all the records with blank counts that what is left will be a sequence (with repeats) but with no missing counts?

That is
3 3 4 5 5 5 5 6 6 6 6 7 8 9 10 10 11 12
would be possible, but
 3 3 4 5 5 5 5 6 6 6 6 7 9 9 10 12 12    (missing 8 and 11)
could not occur.
(Maybe not with those number of repeats)

Ian

0
 

Author Comment

by:appc
ID: 35033475
That's right Ian
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35033737
Appc,

Then the best way to fill in missing counts would be to first estimate the rate of rise in the count. Then use that and both backward and forward distances (imeasured in production days) to fill in values.  Current rules are close to that ideal for a lone missing value (count data exists on BOTH sides).
However for
a string (more than one in a row) of missing counts; or
when the first or last in an account number grouping is missing the count
it will be sub-optimal.

It all depends of how much importance needs to be attached to the data. and number of the more complicated cases.

Ian
0
 

Author Comment

by:appc
ID: 35033759
I really don't want to estimate anything.

All I want is to fill in the blanks with the nearest(in time) previous or next value.

If we can't account for time in the filling process then I am fine making a call like first we will fill in the blanks with the previous counts

So let's say we have:
2
2
.
.
.
3
3
3
3

The blanks would become 2.

But then if I have :

.
.
2
2
.
.
.
3
3

All the blanks would be 3

And if we had
2
2
2
.
.
.
3
3
3
.
.

It would 2s in the first string of blanks, 3s in the second one.

Please let me which code to use!

Thanks!!!
0
 
LVL 7

Expert Comment

by:d507201
ID: 35038189
I'm thinking that this might take multiple passes thru the data.  I presume that this code is for a repeating production process.  How many observations do you estimate will be in your largest data set?  Will the data contain variables other account number, date, and count?
0
 

Author Comment

by:appc
ID: 35038835
About 10,000 account numbers
0
 

Author Comment

by:appc
ID: 35038846
For each account number maybe between 20 and 100 observations
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35066081
appc,

Sorry for the break, the weekend got in thwe way.

Answer to question on what method works.

If we assume (for this display below using your second example) that each rows increment by exactly one day, then I calculate that we will have the following results.

If we represent by
A  the of Sharath_123 modified for proc SQL.
B  the data step and proc sort version I did
C  the data step version thermoduric did after modification to allow for changing account numbers.

then we will have the following values -

So let's say we have:
          A          B          C
2
2
.         2          2           2
.         2          2           .
.         3          3           3
3
3
3
3

The blanks would become 2.

But then if I have :
           A             B           C
.          2             .            .
.          2             2           2
2
2
.          2             2             2
.          2             2             .
.          3             3             3
3
3

All the blanks would be 3

And if we had
             A          B            C
2
2
2
.           2            2            2
.           2            2            .
.           3            3            3
3
3
3
.            3           3            3
.            3           3            .

It would 2s in the first string of blanks, 3s in the second one.

Another example

             A          B            C
2
2
2
.           2            2            2
.           2            2            .
.           2            .             .
.           3            .            .
.           3            3            3
3
3

Unless I am missing something vital, your suggested fill in values dont follow the rules, and none of the solutions would reproduce your given solution.

Solution A always gives the value closest in time irrespective of number of consecutive missing values.  It doesn't get the answers you set out, but I think the answers obeys the (bent) rules. Proc SQL can sometimes be very CPU intensive, especially with multiple joins.


Solution B gives the value just before or just after determined by number of days difference, with extra proviso that if it cant get an answer that way it takes the preceding value - as per rules.

Solutions B and C can easily be modified to continue to carry forward the last value, and with only a little extra effort solution B can carry backward the next known count values to always have a "nearest" answer except in the case of no count values present at all for a given account number. Solutions B & C are closest to the "rules" you specified.
And can be easily modified to always have a non-missing value with no extra compute time.

Both B & C should run reasonably quickly even with 1,000,000 records.

B & C closely follow the rules you set out, solution A took a gamble on what perhaps you wanted and doesn't follow the rules.  The difference only occurs when there is a sequence of more than 2 days with missing Count values.

Ian
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 
LVL 8

Expert Comment

by:ShannonEE
ID: 35066154
Hold your horses!,

Solution C was by    d507201  not  thermoduric.

Sorry there guys.

Ian
0
 

Author Comment

by:appc
ID: 35069452
From what I see solution A is what I want! As long as the blanks are filled in with a minimum of logic that's perfect.

I can do a remote submit on a server so I don't mind if the answer CPU sensitive.

So I can use A as it is and it will do what you described?
0
 

Author Comment

by:appc
ID: 35073692
Here is what I get in my log when I run solution A:

  proc sql;
261  create table remove as
262  SELECT a.Acct_Nmbr,a.Date_maint,ISNULL(a.Count2,b.Count2) as Count2
263    FROM Selection AS a
264         LEFT JOIN (SELECT a.Acct_Nmbr,b.Date_maint,a.Count2
265                      FROM (SELECT *,
266                                   CASE
267                                     WHEN INTCK(DAY,Pre_Date,Date_maint) <
267! INTCK(DAY,Date_maint,Next_Date) THEN Pre_Date
268                                     ELSE Next_Date
269                                   END as Near_Date
270                              FROM (SELECT *,
271                                           (SELECT MAX(Date_maint)
272                                              FROM Selection b
273                                             WHERE a.Acct_Nmbr = b.Acct_Nmbr
274                                                   AND b.Date_maint < a.Date_maint
275                                                   AND b.Count2 IS NOT NULL) as Pre_Date,
276                                           (SELECT MIN(Date_maint)
277                                              FROM Selection b
278                                             WHERE a.Acct_Nmbr = b.Acct_Nmbr
279                                                   AND b.Date_maint > a.Date_maint
280                                                   AND b.Count2 IS NOT NULL) as Next_Date
281                                      FROM Selection a
282                                     WHERE Count2 IS NULL) a) AS b
283                           JOIN Selection a
284                             ON b.Acct_Nmbr = a.Acct_Nmbr
285                                AND b.Near_Date = a.Date_maint) AS b
286           ON a.Acct_Nmbr = b.Acct_Nmbr
287              AND a.Date_maint = b.Date_maint
288  ;
ERROR: Function ISNULL could not be located.
ERROR: The following columns were not found in the contributing tables: DAY.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
289  quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.09 seconds
      cpu time            0.04 seconds
0
 

Author Comment

by:appc
ID: 35073926
Here is my code so that it is easier to work with.

Thanks for your help!
proc sql;
create table remove as
SELECT a.Acct_Nmbr,a.Date_maint,ISNULL(a.Count2,b.Count2) as Count2 
  FROM Selection a 
       LEFT JOIN (SELECT a.Acct_Nmbr,b.Date_maint,a.Count2 
                    FROM (SELECT *, 
                                 CASE 
                                   WHEN INTCK(DAY,Pre_Date,Date_maint) < INTCK(DAY,Date_maint,Next_Date) THEN Pre_Date
                                   ELSE Next_Date 
                                 END as Near_Date 
                            FROM (SELECT *, 
                                         (SELECT MAX(Date_maint) 
                                            FROM Selection b 
                                           WHERE a.Acct_Nmbr = b.Acct_Nmbr 
                                                 AND b.Date_maint < a.Date_maint 
                                                 AND b.Count2 IS NOT NULL) as Pre_Date, 
                                         (SELECT MIN(Date_maint) 
                                            FROM Selection b 
                                           WHERE a.Acct_Nmbr = b.Acct_Nmbr 
                                                 AND b.Date_maint > a.Date_maint 
                                                 AND b.Count2 IS NOT NULL) as Next_Date 
                                    FROM Selection a 
                                   WHERE Count2 IS NULL) a) AS b 
                         JOIN Selection a 
                           ON b.Acct_Nmbr = a.Acct_Nmbr 
                              AND b.Near_Date = a.Date_maint) b 
         ON a.Acct_Nmbr = b.Acct_Nmbr 
            AND a.Date_maint = b.Date_maint 
;
quit;

Open in new window

0
 

Author Comment

by:appc
ID: 35077170
I have made some adjustments but here is what I get now:



195  proc sql;
196  create table remove as
197  SELECT a.Acct_Nmbr,
198          a.Date_maint,
199              case when a.count2 in ('') then b.count2
200                  else a.count2
201              end as count2
202    FROM Selection a
203         LEFT JOIN (SELECT a.Acct_Nmbr,b.Date_maint,a.Count2
204                      FROM (SELECT *,
205                                   CASE
206                                     WHEN INTCK('DAY',Pre_Date,Date_maint) <
206! INTCK('DAY',Date_maint,Next_Date) THEN Pre_Date
207                                     ELSE Next_Date
208                                   END as Near_Date
209                              FROM (SELECT *,
210                                           (SELECT MAX(Date_maint)
211                                              FROM Selection b
212                                             WHERE a.Acct_Nmbr = b.Acct_Nmbr
213                                                   AND b.Date_maint < a.Date_maint
214                                                   AND b.Count2 IS NOT NULL) as Pre_Date,
215                                           (SELECT MIN(Date_maint)
216                                              FROM Selection b
217                                             WHERE a.Acct_Nmbr = b.Acct_Nmbr
218                                                   AND b.Date_maint > a.Date_maint
219                                                   AND b.Count2 IS NOT NULL) as Next_Date
220                                      FROM Selection a
221                                     WHERE Count2 IS NULL) a) AS b
222                           JOIN Selection a
223                             ON b.Acct_Nmbr = a.Acct_Nmbr
224                                AND b.Near_Date = a.Date_maint) b
225           ON a.Acct_Nmbr = b.Acct_Nmbr
226              AND a.Date_maint = b.Date_maint
227  ;
ERROR: Expression using equals (=) has components that are of different data types.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
228  quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.15 seconds
      cpu time            0.00 seconds

0
 

Author Comment

by:appc
ID: 35077511
Can someone help me?
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35078387
appc,

1/ The following code is essentially the code I posted above (ref 28/02/11 06:56 PM, ID: 34994991).  At the time I said that the SQL solution had an error which shows up for repeated days. The "proc sql" solution corrected that error, and fixed up the incompatable null test <<count2 in ('') >>.

I have taken the liberty of changing the table name alias from a, b, etc to meanginful names so you have a slightly better chance of knowing what is going on.

2/ There is much to be said for breaking the solution down into pieces (as I did before) and creating sub-tables, however that may reduce the chance of the code being optimised.  However if instead of creating a series of tables and doing the solution in bits, it would still be possible to get the benefits of a structured approach but keep run-time efficiencies by using views.  I will post that code next.

3/ I have added an "order by" clause because there is no gaurentee that the dataset will
come out in the same order as it is imput.

4/ I have added the "as" keyword to make it easier to read and less prone to errors for the able alias.


Code is
=====================================================

*****************************************
* Create a new table from the example 	*
* data to match the required table and	*
* column names.							*
****************************************; 
data	Selection;
	set	DataSet(rename=(Date=Date_maint Count=Count2));
run;

*****************************************
* Now run the SQL.						*
****************************************;
proc sql;


create table remove as
SELECT	base.Acct_Nmbr,
 		base.Date_maint,
		case	when base.count2 = . then bestCount.count2
				else base.count2
		end as count2
FROM	Selection	as	base

LEFT JOIN
		(
		SELECT	unique
				baseForMatching.Acct_Nmbr,
				bestDate.Date_maint,
				baseForMatching.Count2
		FROM	(
				SELECT	*,
						CASE	WHEN	INTCK('DAY',Pre_Date,Date_maint) <
										INTCK('DAY',Date_maint,Next_Date) THEN
											Pre_Date
								ELSE		Next_Date
						END as Near_Date
				FROM	(
						SELECT	*,
								(
								SELECT	MAX(Date_maint)
								FROM	Selection as pre
								WHERE	baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
										pre.Date_maint < baseNull.Date_maint AND
										pre.Count2 IS NOT NULL
								) as Pre_Date,
								(
								SELECT	MIN(Date_maint)
								FROM	Selection as post
								WHERE	baseNull.Acct_Nmbr = post.Acct_Nmbr AND
										post.Date_maint > baseNull.Date_maint AND
										post.Count2 IS NOT NULL
								) as Next_Date
						FROM	Selection as baseNull
 						WHERE	Count2 IS NULL
						) as preAndNextDate
				) AS bestDate
		JOIN	Selection as baseForMatching
		ON		bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
				bestDate.Near_Date = baseForMatching.Date_maint
		) as bestCount

ON		base.Acct_Nmbr = bestCount.Acct_Nmbr AND
		base.Date_maint = bestCount.Date_maint

order by
		base.Acct_Nmbr, base.Date_maint;

quit;

Open in new window

================================================
0
 
LVL 8

Accepted Solution

by:
ShannonEE earned 480 total points
ID: 35078670
appc,

Neater code with comments interspersed, which uses views.

======================================================

proc sql;

*********************************************
* First get all the previous and next dates	*
* for all rows with missing counts			*
********************************************;
create	view	preAndNextDate as
SELECT	*,
		(
		SELECT	MAX(Date_maint)
		FROM	Selection as pre
		WHERE	baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
				pre.Date_maint < baseNull.Date_maint AND
				pre.Count2 IS NOT NULL
		) as Pre_Date,
		(
		SELECT	MIN(Date_maint)
		FROM	Selection as post
		WHERE	baseNull.Acct_Nmbr = post.Acct_Nmbr AND
				post.Date_maint > baseNull.Date_maint AND
				post.Count2 IS NOT NULL
		) as Next_Date
FROM	Selection as baseNull
WHERE	Count2 IS NULL;
				
*****************************************
* Next choose the closest of the before	*
* and after date.						*
****************************************;
create	view	bestDate	as
SELECT	*,
		CASE	WHEN	INTCK('DAY',Pre_Date,Date_maint) <
						INTCK('DAY',Date_maint,Next_Date) THEN
							Pre_Date
				ELSE		Next_Date
		END as Near_Date
FROM	preAndNextDate;

*****************************************
* From best date find the associated	*
* count value.							*
* Note that for each account/date pair	*
* we only want one row regardless of	*
* the number of repeats in the source	*
* data in the "selection" table. Hence	*
* we specify "unique" at the top of the	*
* selection	list.						*
****************************************;
create	view	bestCount	as
SELECT	unique
		baseForMatching.Acct_Nmbr,
		bestDate.Date_maint,
		baseForMatching.Count2
FROM	bestDate
JOIN	Selection as baseForMatching
ON		bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
		bestDate.Near_Date = baseForMatching.Date_maint;

*****************************************
* Now create the final table putting in	*
* the best count values we found above	*
* for all the missing values.			*
*****************************************
* All the views created above will be	*
* put into the query for optimisation	*
* before the code is run.				*
****************************************;
create table remove as
SELECT	base.Acct_Nmbr,
 		base.Date_maint,
		case	when base.count2 = . then bestCount.count2
				else base.count2
		end as count2

FROM	Selection	as	base
LEFT JOIN
		bestCount

ON		base.Acct_Nmbr = bestCount.Acct_Nmbr AND
		base.Date_maint = bestCount.Date_maint

order by
		base.Acct_Nmbr, base.Date_maint;

quit;

Open in new window

============================================================

The code is formatted with TABs, which should display OK in the SAS editor.

Ian
0
 

Author Comment

by:appc
ID: 35078867
Look at the code I used: it is almost perfect, but some blanks are still in the dataset.


106
107  proc sql;
108
109
110  create table remove as
111  SELECT  base.Acct_Nmbr,
112          base.Date_maint,
113          case    when base.count2 = . then bestCount.count2
114                  else base.count2
115          end as count2
116  FROM    Selection   as  base
117
118  LEFT JOIN
119          (
120          SELECT  unique
121                  baseForMatching.Acct_Nmbr,
122                  bestDate.Date_maint,
123                  baseForMatching.Count2
124          FROM    (
125                  SELECT  *,
126                          CASE    WHEN    INTCK('DAY',Pre_Date,Date_maint) <
127                                          INTCK('DAY',Date_maint,Next_Date) THEN
128                                              Pre_Date
129                                  ELSE        Next_Date
130                          END as Near_Date
131                  FROM    (
132                          SELECT  *,
133                                  (
134                                  SELECT  MAX(Date_maint)
135                                  FROM    Selection as pre
136                                  WHERE   baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
137                                          pre.Date_maint < baseNull.Date_maint AND
138                                          pre.Count2 IS NOT NULL
139                                  ) as Pre_Date,
140                                  (
141                                  SELECT  MIN(Date_maint)
142                                  FROM    Selection as post
143                                  WHERE   baseNull.Acct_Nmbr = post.Acct_Nmbr AND
144                                          post.Date_maint > baseNull.Date_maint AND
145                                          post.Count2 IS NOT NULL
146                                  ) as Next_Date
147                          FROM    Selection as baseNull
148                          WHERE   Count2 IS NULL
149                          ) as preAndNextDate
150                  ) AS bestDate
151          JOIN    Selection as baseForMatching
152          ON      bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
153                  bestDate.Near_Date = baseForMatching.Date_maint
154          ) as bestCount
155
156  ON      base.Acct_Nmbr = bestCount.Acct_Nmbr AND
157          base.Date_maint = bestCount.Date_maint
158
159  order by
160          base.Acct_Nmbr, base.Date_maint;
NOTE: Invalid (or missing) arguments to the INTCK function have caused the function to return a
      missing value.
NOTE: Invalid argument to function INTCK. Missing values may be generated.
NOTE: Table WORK.REMOVE created, with 343428 rows and 3 columns.

161
162  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           22:50.61
      cpu time            22:46.92


proc sql;


create table remove as
SELECT	base.Acct_Nmbr,
 		base.Date_maint,
		case	when base.count2 = . then bestCount.count2
				else base.count2
		end as count2
FROM	Selection	as	base

LEFT JOIN
		(
		SELECT	unique
				baseForMatching.Acct_Nmbr,
				bestDate.Date_maint,
				baseForMatching.Count2
		FROM	(
				SELECT	*,
						CASE	WHEN	INTCK('DAY',Pre_Date,Date_maint) <
										INTCK('DAY',Date_maint,Next_Date) THEN
											Pre_Date
								ELSE		Next_Date
						END as Near_Date
				FROM	(
						SELECT	*,
								(
								SELECT	MAX(Date_maint)
								FROM	Selection as pre
								WHERE	baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
										pre.Date_maint < baseNull.Date_maint AND
										pre.Count2 IS NOT NULL
								) as Pre_Date,
								(
								SELECT	MIN(Date_maint)
								FROM	Selection as post
								WHERE	baseNull.Acct_Nmbr = post.Acct_Nmbr AND
										post.Date_maint > baseNull.Date_maint AND
										post.Count2 IS NOT NULL
								) as Next_Date
						FROM	Selection as baseNull
 						WHERE	Count2 IS NULL
						) as preAndNextDate
				) AS bestDate
		JOIN	Selection as baseForMatching
		ON		bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
				bestDate.Near_Date = baseForMatching.Date_maint
		) as bestCount

ON		base.Acct_Nmbr = bestCount.Acct_Nmbr AND
		base.Date_maint = bestCount.Date_maint

order by
		base.Acct_Nmbr, base.Date_maint;

quit;

Open in new window

0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35079207
appc,

Try the following on your raw data.  I suspect that there is something funny in the file.

***********************************************
* Find out where there is invalid dates *
***********************************************;
data test1;    
    set  Selection;
   row = _N_;
   if Date_maint = . then output;
run;
proc print data=test1 noobs; run;
*******************************************
* Find out account number groups *
* with no valid count data              *
*****************************************;
proc sql;

    create table test2 as
    select   Acct_Nmbr ,
                 count(Count2) as N,
                 sum(Count2 = .) as invalid
    from  Selection
    group  Acct_Nmbr;

    create table test3 as
    select  Acct_Nmbr
    from  Test2
    where invalid = N;

quit;

proc print data=test3 noobs; run;

Open in new window


==========================


Ian

0
 

Author Comment

by:appc
ID: 35086183
Actually I am fine with I got; some accounts have blanks because they had no record with a count, which is fine.

There is just one problem that seems to have affected a lot of accounts:

Acct           Date           count
1               05/25             4
1                05/26            4
4               05/25             6
4                05/26            6
4               05/27             6
4                05/28            6
4               05/29             .
7               05/27             8
7                05/28            8
7               05/29             .

For some reason accounts with blanks while having at least one count not blank all have the blank with a date that is exactly May 29 2010.

The rest is fine.

0
 

Author Comment

by:appc
ID: 35087535
Ian:

Here is what I got from your code:



64   *******************************************
65   * Find out account number groups *
66   * with no valid count data              *
67   *****************************************;
68   proc sql;
69
70       create table test2 as
71       select   Acct_Nmbr ,
72                    count(Count2) as N,
73                    sum(Count2 = .) as invalid
74       from  Selection
75       group  Acct_Nmbr;
NOTE: Table WORK.TEST2 created, with 10203 rows and 3 columns.

76
77       create table test3 as
78       select  Acct_Nmbr
79       from  Test2
80       where invalid = N;
NOTE: Table WORK.TEST3 created, with 208 rows and 1 columns.

81
82   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.18 seconds
      cpu time            0.17 seconds


83
84   proc print data=test3 noobs; run;

NOTE: There were 208 observations read from the data set WORK.TEST3.
NOTE: The PROCEDURE PRINT printed pages 1-4.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

0
 

Author Comment

by:appc
ID: 35088471
To be more precise, the unwanted blanks are on the records with the latest available date by acct number.

Example:

Acct           Date           count
1               05/25             4
1                05/26            4
4               05/25             6
4                05/26            6
4               05/27             6
4                05/28            6
4               05/29             .
7               05/27             8
7                05/28            .
7               05/28             .
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35089407
Hi there appc,

Yes I admit that the SQL code sometimes gets it wrong.

However please try the slight changes to the dataset/sort solution that II noted above. It now carries backward and forward values to the maximum extent.  It will only produce missing values if all counts are missing.  It also checks for missing dates which your tests show do occur.

Note that the rules are slightly different to those you first said, but are what I believe are what you now want.


First run the code below colplete on the example dataset, then if you are satisfied hook it up to your large dataset. - Also check the time taken.

On a production run you might want to change the statement on line 277

	out=final(keep=Acct_Nmbr Date_maint Count2 inputCount);

Open in new window


to

	out=final(keep=Acct_Nmbr Date_maint Count2);

Open in new window


Meanwhile I will check out the SQL and see where the problem is.
Ian
((Note code uses TAB for indenting. The tab spacing is set for SAS enhanced editor)).


/* ExpertsExchange26842472.sas */


*************************************************
* The data as supplied uses a minus sign (-)	*
* to mark missing values rather than the usual	*
* SAS value of period (.)						*
* Need to make up a format for inputting these	*
* values.										*
************************************************;
proc format;
	invalue	funnyCount
	'.' = .
	'-' = .
	0-hi	= best20.
	;
	**********************************************
	** This sets . and - to the usual missing	**
	** and all other values it uses normal SAS	**
	** format processing.						**
	** The new format is called  "funnyCount"	**
	*********************************************;
run;

*************************
* Read in our test data	*
************************;
data Selection;

	length	Acct_Nmbr
			Date_maint
			Count2		8;

	informat	Date_maint	mmddyy10.
				Count2		funnyCount.;

	format		Date_maint	date9.;

	input	Acct_Nmbr Date_maint Count2;

cards;
        1              01/01/2011                3
        1              01/01/2011                3
        1              01/04/2011                -
        1              01/05/2011                4
        2              01/01/2011                6
        2              01/02/2011                -
        2              01/02/2011                -
        2              01/03/2011                -
        2              01/05/2011                7
        2              01/06/2011                7
       11              05/28/2011                3
       11              05/28/2011                -
       11              05/28/2011                -
       11              05/29/2011                -
       12              05/24/2011                6
       12              05/25/2011                -
       12              05/26/2011                -
       12              05/27/2011                -
       12              05/28/2011                -
       12              05/29/2011                -
       13              05/28/2011                -
       13              05/28/2011                -
       13              05/28/2011                -
       13              05/29/2011               59
       14              05/24/2011                -
       14              05/25/2011                -
       14              05/26/2011                -
       14              05/27/2011                -
       14              05/28/2011                -
       14              05/29/2011              100
       16              05/28/2011                -
       16              05/28/2011               12
       16              05/28/2011                -
       16              05/29/2011                -
       17              05/24/2011                -
       17              05/25/2011                -
       17              05/26/2011                -
       17              05/27/2011               10
       17              05/28/2011                -
       17              05/29/2011                -
       18              05/25/2011                3
       18              05/27/2011                -
       18              05/28/2011                -
       18              05/28/2011                4
       18              05/29/2011                -
       19              05/24/2011                -
       19              05/25/2011               17
       19              05/26/2011                -
       19              05/27/2011                -
       19              05/28/2011                -
       19              05/29/2011               18
       21              05/25/2011                3
       21              05/27/2011                3
       21              05/28/2011                4
       21              05/28/2011                4
       21              05/29/2011                -
       22              05/24/2011                -
       22              05/25/2011                -
       22              05/26/2011                -
       22              05/27/2011                -
       22              05/28/2011                -
       22              05/29/2011                -
;;;;
run;


*********************************************
* Begin the processing.						*
* This assumes that the data set is already	*
* sorted by Acct_Nmbr and Date				*
* We get the Count and date from the last	*
* row with a valid date and count by using	*
* the retain statement to keep the value	*
* across rows. For this to produce useful	*
* results data set must be sorted by date.	*
* By asserting "by	Acct_Nmbr Date_maint"	*
* we can be sure the dataset is sorted.		*
********************************************;
data	backVals;
	set	Selection;
	by	Acct_Nmbr Date_maint;
	retain	prev_Count prev_Date;
	format	prev_Date	date9.;

	*********************************************
	* DO NOT copy date from previous account.	*
	********************************************;
	if first.Acct_Nmbr then
		do;
		prev_Count = .;
		prev_Date = .;
		end;

	*************************************
	* Put in a value to sort by.		*
	* This is necessary if there is two	*
	* records with the same account		*
	* number and date.					*
	* Otherwise we could just sort by	*
	* the account number and date.		*
	************************************;
	sortPosition = _N_;
	inputCount = Count2;

	*************************************
	* _N_ is a sas automatic variable	*
	* which gives input record number	*
	************************************;

	*************************************
	* Write out the record here before	*
	* updating "prev" values, which are	*
	* for subsequent records read in	*
	************************************;
	output;

	*********************************
	* Update our "prev" valid count	*
	* information, provided this	*
	* row has valid date and count.	*
	********************************;
	if	(Date_maint ^= .) and
		(Count2 ^= .) then
		do;
		prev_Date = Date_maint;
		prev_Count = Count2;
		end;

run;

*********************************************
* Next do a backward sort to get the values	*
* from the next row.						*
********************************************;

proc sort	data=backVals	out=backSorted;
	by	descending	sortPosition;
run;

*********************************************
* Now do the same as above when the data	*
* was sorted forward. because the dataset	*
* is now sorted backward we can pick up the	*
* the values form the "next" row.			*
********************************************;

data	filledInVals;
	set	backSorted;
	by	descending	Acct_Nmbr descending Date_maint descending sortPosition;

	retain next_Date next_Count;
	format		next_Date	date9.;

	*****************************************
	** BY statement not necessary here but	*
	* useful to check on correct sort		*
	****************************************;

	*********************************************
	* DO NOT copy date from NEXT account.		*
	********************************************;
	if first.Acct_Nmbr then
		do;
		next_Count = .;
		next_Date = .;
		end;

	*********************************
	* Now fill in any missing Count	*
	* using smallest day difference	*
	*********************************
	* Cant do calcs if Date missing	*
	********************************;
	if	(Count2 = .) and
		(Date_maint ^= .) then
		do;
		*****************************************
		* Dont strictly need the do - end pair	*
		* but with nested "if"s it is good for	*
		* you to have do - end so that you can	*
		* exactly see what the "else" applies	*
		* to.									*
		****************************************;


		*****************************************
		* Copy next_Count over if				*
		* it is available and is the shortest	*
		* interval								*
		* or if prev_Count is not available		*
		****************************************;

		if	(next_Date = .) or
			(next_Count = .) then	Count2 = prev_Count;
		else

		if	(prev_Date = .) or
			(prev_Count = .) then	Count2 = next_Count;
		else

		if	((next_Date - Date_maint) < (Date_maint - prev_Date)) then
					Count2 = next_Count;
		else		Count2 = prev_Count;

		end;

	*************************************
	* Write out the record here before	*
	* updating "next" values, which are	*
	* for subsequent records read in	*
	************************************;
	output;


	*********************************
	* Update our "next" valid count	*
	* information, provided this	*
	* row has valid date and count.	*
	********************************;
	if	(Date_maint ^= .) and
		(inputCount ^= .) then
		do;
		next_Count = Count2;
		next_Date = Date_maint;
		end;

run;

*************************************
* We now need the data sorted in a	*
* forward direction to put it back	*
* in original order.				*
************************************;

proc sort	data=filledInVals
			out=final(keep=Acct_Nmbr Date_maint Count2 inputCount);
	by	sortPosition;
run;


title"Final data set";
proc print data=final noobs;
by Acct_Nmbr;
run;

Open in new window

0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35091428
Hi there appc,

SQL code which appears to work.

Code uses views but is equivalent to all-in-one query.

Changes to definition of "Near Date" when only one od "Pre date" and "Next Date" are available.

Changed action if pre and next date are equi-distant.

Changed placement of "unique" to better reflect requirement if there are successive missing values.

Added code for situation where multiple rows had the same date.

Added checks for missing dates.

The run time using tables instead of views may be better/worse  (I'm not sure). You could try on your big data set.

With the views query structure it is equivalent to previous proc SQL solution.

Ian


/* EE26842472_sql.sas */

*********************************************
* SQL solution cast into proc SQL, using	*
* views to help debug and understand the	*
* process.									*
********************************************;


proc sql;
*********************************************
* First get all the previous and next dates	*
* for all dates with missing counts			*
********************************************;
create	view	preAndNextDate as
SELECT	unique
		*,
		(
		SELECT	MAX(Date_maint)
		FROM	Selection as pre
		WHERE	baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
				pre.Date_maint <= baseNull.Date_maint AND
				pre.Count2 ^= .
		) format=date9. as Pre_Date,
		(
		SELECT	MIN(Date_maint)
		FROM	Selection as post
		WHERE	baseNull.Acct_Nmbr = post.Acct_Nmbr AND
				post.Date_maint >= baseNull.Date_maint AND
				post.Count2 ^= .
		) format=date9. as Next_Date
FROM	Selection as baseNull
WHERE	Count2 = .;
				
*****************************************
* Next choose the closest of the before	*
* and after date.						*
****************************************;
create	view	bestDate	as
SELECT	*,
		CASE	WHEN	Pre_Date = .	then	Next_Date
				when	Next_Date = .	then	Pre_Date
				when	INTCK('DAY', Pre_Date, Date_maint) <=
						INTCK('DAY', Date_maint, Next_Date)
										THEN	Pre_Date
				ELSE							Next_Date
		END format=date9. as Near_Date
FROM	preAndNextDate;

*****************************************
* From best date find the associated	*
* count value.							*
* Note that for each account/date pair	*
* we only want one row regardless of	*
* the number of repeats in the source	*
* data in the "selection" table. First	*
* we reduce the rows by specify Count	*
* is not to be missing and then take	*
* the minimum value in case there is 2	*
* or more rows with the same date and a	*
* valid count.							*
****************************************;
create	view	bestCount	as
SELECT	baseForMatching.Acct_Nmbr,
		bestDate.Date_maint,
		min(baseForMatching.Count2) as Count2
FROM	bestDate
full JOIN
		Selection as baseForMatching
on		bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
		bestDate.Near_Date = baseForMatching.Date_maint
where	baseForMatching.Count2 ^= .
group by
		baseForMatching.Acct_Nmbr, bestDate.Date_maint;

*****************************************
* Now create the final table putting in	*
* the best count values we found above	*
* for all the missing values.			*
*****************************************
* All the views created above will be	*
* put into the query for optimisation	*
* before the code is run.				*
****************************************;
create table remove as
SELECT	base.Acct_Nmbr,
 		base.Date_maint,
		case	when base.count2 = . then	bestCount.count2
				else						base.count2
		end as count2,
		base.Count2 as initialCount

FROM	Selection	as	base
LEFT JOIN
		bestCount

ON		base.Acct_Nmbr = bestCount.Acct_Nmbr AND
		base.Date_maint = bestCount.Date_maint

order by
		base.Acct_Nmbr, base.Date_maint;

quit;

title "final data";
proc print data=remove noobs; by Acct_Nmbr; run;

Open in new window


0
 

Author Comment

by:appc
ID: 35095568
Thanks a lot for your help guys!

It worked. I used the following code and I just made a little adjustment.

Thanks again!


proc sql;

*********************************************
* First get all the previous and next dates	*
* for all rows with missing counts			*
********************************************;
create	view	preAndNextDate as
SELECT	*,
		(
		SELECT	MAX(Date_maint)
		FROM	Selection as pre
		WHERE	baseNull.Acct_Nmbr = pre.Acct_Nmbr AND
				pre.Date_maint < baseNull.Date_maint AND
				pre.Count2 IS NOT NULL
		) as Pre_Date,
		(
		SELECT	MIN(Date_maint)
		FROM	Selection as post
		WHERE	baseNull.Acct_Nmbr = post.Acct_Nmbr AND
				post.Date_maint > baseNull.Date_maint AND
				post.Count2 IS NOT NULL
		) as Next_Date
FROM	Selection as baseNull
WHERE	Count2 IS NULL;
				
*****************************************
* Next choose the closest of the before	*
* and after date.						*
****************************************;
create	view	bestDate	as
SELECT	*,
		CASE	WHEN    Next_date=. then Pre_date
				when	INTCK('DAY',Pre_Date,Date_maint) <
						INTCK('DAY',Date_maint,Next_Date) THEN
							Pre_Date
				ELSE		Next_Date
		END as Near_Date
FROM	preAndNextDate;

*****************************************
* From best date find the associated	*
* count value.							*
* Note that for each account/date pair	*
* we only want one row regardless of	*
* the number of repeats in the source	*
* data in the "selection" table. Hence	*
* we specify "unique" at the top of the	*
* selection	list.						*
****************************************;
create	view	bestCount	as
SELECT	unique
		baseForMatching.Acct_Nmbr,
		bestDate.Date_maint,
		baseForMatching.Count2
FROM	bestDate
JOIN	Selection as baseForMatching
ON		bestDate.Acct_Nmbr = baseForMatching.Acct_Nmbr AND
		bestDate.Near_Date = baseForMatching.Date_maint;

*****************************************
* Now create the final table putting in	*
* the best count values we found above	*
* for all the missing values.			*
*****************************************
* All the views created above will be	*
* put into the query for optimisation	*
* before the code is run.				*
****************************************;
create table testi.remove as
SELECT	base.Acct_Nmbr,
		base.system_completion_code,
		base.agent_completion_code,
 		base.Date_maint,
		base.event_time,
		case	when base.count2 = . then bestCount.count2
				else base.count2
		end as count2,
		base.balance,
		base.count1

FROM	Selection	as	base
LEFT JOIN
		bestCount

ON		base.Acct_Nmbr = bestCount.Acct_Nmbr AND
		base.Date_maint = bestCount.Date_maint

order by
		base.Acct_Nmbr, base.Date_maint;

quit;

Open in new window

0
 

Author Comment

by:appc
ID: 35095581
FYI I just added stuff on line 32.

Antoine
0
 

Author Closing Comment

by:appc
ID: 35095628
Thanks again
0
 
LVL 8

Expert Comment

by:ShannonEE
ID: 35100687
Antoine,

Happy the solution meets your requirements.

Did you also try the solution in post
ID: 35089407?  I believe it should run substantially faster (nowhere like 22 minutes).

If you plan to put this into production or run it again then you should consider either

the data step or latest SQL version because they both protect against

1/  First day for an account number having a missing count.
CASE	WHEN	Pre_Date = .		then	Next_Date
	when	Next_Date = .	then	Pre_Date

Open in new window


2/ Rows having the same date,
     rows with the same data having a mix of missing count and valid count

/3 better treats the situation where a missing count is exactly midway between the
two nearest valid counts

================

Ian
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

A short article about problems I had with the new location API and permissions in Marshmallow
Many companies are looking to get out of the datacenter business and to services like Microsoft Azure to provide Infrastructure as a Service (IaaS) solutions for legacy client server workloads, rather than continuing to make capital investments in h…
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now