Link to home
Start Free TrialLog in
Avatar of fb1990
fb1990

asked on

Cohort Analysis

Hello EE,

Can someone please help me with a SQL code to generate a cohort analysis from the Oracle database?  I have attached a sample data.  I want to see calculate what % of the customers acquired in a specified month are still active in Month1, Month2, Month3, etc. The goal is to take the data to excel for final rollup and visualization, but I need the code to model the data
Cohort-Dataset.xlsx
Avatar of awking00
awking00
Flag of United States of America image

Can you take the first 20 records for example and show the results you would expect?

I ran this with the first 19 records and tested for customers acquired in January, 2017 and only showed three months, but you can expand it to meet your requirements -
with cte as
(select trunc(registrationdt,'mm') registrationdt, lastactivedt, ceil(months_between(lastactivedt, last_day(registrationdt))) monthsactive,
 count(*) over (order by 1) tot
 from yourtable
 where trunc(registrationdt,'mm') = to_date('201701','YYYYMM')),--Select the date of the month to check
cte2 as
(select trunc(registrationdt,'mm') month,
 sum(case when monthsactive >= 1 then 1 else 0 end) as month1,
 sum(case when monthsactive >= 2 then 1 else 0 end) as month2,
 sum(case when monthsactive >= 3 then 1 else 0 end) as month3
 from cte
 group by trunc(registrationdt,'mm')
 order by trunc(registrationdt,'mm'))
select distinct cte2.month,
to_char(round(cte2.month1/cte.tot,2)*100)||'%' month1pct,
to_char(round(cte2.month2/cte.tot,2)*100)||'%' month2pct,
to_char(round(cte2.month3/cte.tot,2)*100)||'%' month3pct
from cte inner join cte2
on trunc(cte.registrationdt,'mm') = cte2.month;

Avatar of fb1990
fb1990

ASKER

Thanks, awking00.  I really appreciate your help.  Can you please help add record count to the resultset to show the number of records in the cohort?
My example produced the following result fro customers acquired in January, 2017:
MONTH     MONTH1P MONTH2P MONTH3P
--------- ------- ------- -------
01-JAN-17 100%    100%    40%

Open in new window

There were five records that were acquired in January, 2017. So, would you like to see something like this?
MONTH     MONTH1P MONTH2P MONTH3P   COHORTS
--------- ------- ------- -------   -------
01-JAN-17 100%    100%    40%             5

Open in new window


If that is what you want change the final query removing the distinct keyword, adding a count(*) and group by statements -
select cte2.month, -- Distinct keyword removed
to_char(round(cte2.month1/cte.tot,2)*100)||'%' month1pct,
to_char(round(cte2.month2/cte.tot,2)*100)||'%' month2pct,
to_char(round(cte2.month3/cte.tot,2)*100)||'%' month3pct,
count(*) cohorts
from cte inner join cte2
on trunc(cte.registrationdt,'mm') = cte2.month
group by cte2.month,
to_char(round(cte2.month1/cte.tot,2)*100)||'%',
to_char(round(cte2.month2/cte.tot,2)*100)||'%',
to_char(round(cte2.month3/cte.tot,2)*100)||'%';
Avatar of fb1990

ASKER

Thanks again, awking00 . The result should look like this for 201701 and 201702 cohorts

 
CohortCountMonth1Month2Month3Month4Month5Month6Month7Month8Month9Month10Month11
201701422100%91%84%76%69%61%54%45%39%29%21%
201702342100%94%88%80%74%65%57%49%40%31%23%

and the raw numbers should look like this for the number of customers who are still active each month from the original cohort
 
CohortCountMonth1Month2Month3Month4Month5Month6Month7Month8Month9Month10Month11
20170142242038635332029325922819216412490
20170234234232230127425222419616613810578
ASKER CERTIFIED SOLUTION
Avatar of awking00
awking00
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
To get headings exactly as you wanted -
to get raw numbers
with cte as
(select trunc(registrationdt,'mm') registrationdt, count(*) over (partition by trunc(registrationdt,'mm') order by 1) tot, lastactivedt,
 ceil(months_between(lastactivedt, last_day(registrationdt))) monthsactive
 from yourtable
 where trunc(registrationdt,'mm') in (to_date('201701','YYYYMM'),to_date('201707','YYYYMM')))
select to_char(registrationdt,'YYYYMM') cohort, tot as count,
 sum(case when monthsactive >= 1 then 1 else 0 end) as month1,
 sum(case when monthsactive >= 2 then 1 else 0 end) as month2,
 sum(case when monthsactive >= 3 then 1 else 0 end) as month3,
 sum(case when monthsactive >= 4 then 1 else 0 end) as month4,
 sum(case when monthsactive >= 5 then 1 else 0 end) as month5,
 sum(case when monthsactive >= 6 then 1 else 0 end) as month6,
 sum(case when monthsactive >= 7 then 1 else 0 end) as month7,
 sum(case when monthsactive >= 8 then 1 else 0 end) as month8,
 sum(case when monthsactive >= 9 then 1 else 0 end) as month9,
 sum(case when monthsactive >= 10 then 1 else 0 end) as month10,
 sum(case when monthsactive >= 11 then 1 else 0 end) as month11
 from cte
 group by to_char(registrationdt,'YYYYMM'),tot
 order by to_char(registrationdt,'YYYYMM'),tot;

Open in new window

to get percentages
with cte as
(select trunc(registrationdt,'mm') registrationdt, lastactivedt, ceil(months_between(lastactivedt, last_day(registrationdt))) monthsactive,
 count(*) over (order by 1) tot
 from yourtable
 where trunc(registrationdt,'mm') in (to_date('201701','YYYYMM'),to_date('201707','YYYYMM'))),
cte2 as
(select trunc(registrationdt,'mm') month,
 sum(case when monthsactive >= 1 then 1 else 0 end) as month1,
 sum(case when monthsactive >= 2 then 1 else 0 end) as month2,
 sum(case when monthsactive >= 3 then 1 else 0 end) as month3,
 sum(case when monthsactive >= 4 then 1 else 0 end) as month4,
 sum(case when monthsactive >= 5 then 1 else 0 end) as month5,
 sum(case when monthsactive >= 6 then 1 else 0 end) as month6,
 sum(case when monthsactive >= 7 then 1 else 0 end) as month7,
 sum(case when monthsactive >= 8 then 1 else 0 end) as month8,
 sum(case when monthsactive >= 9 then 1 else 0 end) as month9,
 sum(case when monthsactive >= 10 then 1 else 0 end) as month10,
 sum(case when monthsactive >= 11 then 1 else 0 end) as month11
 from cte
 group by trunc(registrationdt,'mm')
 order by trunc(registrationdt,'mm'))
select to_char(cte2.month,'YYYYMM') cohort,count(*) count,
to_char(round(cte2.month1/cte.tot,2)*100)||'%' month1pct,
to_char(round(cte2.month2/cte.tot,2)*100)||'%' month2pct,
to_char(round(cte2.month3/cte.tot,2)*100)||'%' month3pct,
to_char(round(cte2.month4/cte.tot,2)*100)||'%' month4pct,
to_char(round(cte2.month5/cte.tot,2)*100)||'%' month5pct,
to_char(round(cte2.month6/cte.tot,2)*100)||'%' month6pct,
to_char(round(cte2.month7/cte.tot,2)*100)||'%' month7pct,
to_char(round(cte2.month8/cte.tot,2)*100)||'%' month8pct,
to_char(round(cte2.month9/cte.tot,2)*100)||'%' month9pct,
to_char(round(cte2.month10/cte.tot,2)*100)||'%' month10pct,
to_char(round(cte2.month11/cte.tot,2)*100)||'%' month11pct
from cte inner join cte2
on trunc(cte.registrationdt,'mm') = cte2.month
group by cte2.month,
to_char(round(cte2.month1/cte.tot,2)*100)||'%',
to_char(round(cte2.month2/cte.tot,2)*100)||'%',
to_char(round(cte2.month3/cte.tot,2)*100)||'%',
to_char(round(cte2.month4/cte.tot,2)*100)||'%',
to_char(round(cte2.month5/cte.tot,2)*100)||'%',
to_char(round(cte2.month6/cte.tot,2)*100)||'%',
to_char(round(cte2.month7/cte.tot,2)*100)||'%',
to_char(round(cte2.month8/cte.tot,2)*100)||'%',
to_char(round(cte2.month9/cte.tot,2)*100)||'%',
to_char(round(cte2.month10/cte.tot,2)*100)||'%',
to_char(round(cte2.month11/cte.tot,2)*100)||'%';

Open in new window

Avatar of fb1990

ASKER

Hello awking00, Can check this part of the code for me:

count(*) over (partition by trunc(registrationdt,'mm') order by 1)
and  
count(*) over (order by 1) tot

The numbers are higher than the ones displayed in the example cohort that I gave above.

Thanks for all your help
Sorry about that. My first attempt took the position that you would only be entering one month to check. Try changing that to count(*) over (partition by trunc(registrationdt,'mm') order by 1) tot.
Avatar of fb1990

ASKER

Hello  awking00,

Yes, I am checking one month at a time. But the numbers are still high thereby creating lower active % in outer periods. 
I may be wrong about changing that count statement. Can you send me insert statements for the cohorts that I can use to test to see if we're coming up with the same results. I don't know why the active percentages would be lower and need to look at actual data.
I know that was a fair amount of effort, but what I would really like to see is the inserts in a text file that I could simply copy and paste in sqlplus (the code snippet shows 5:Insert ..., for example so I can't copy and paste from there). Also the inserts as you have created them show errors. There's no apostrophe after the registrationdt plus the inserts need to include the to_date function. The following format works for me:
insert into test_tab values (to_date('1/1/2017','mm/dd/yyyy'), to_date('3/22/2017','mm/dd/yyyy'));
Note - the attribute list after test_tab is optional since the table only has the two fields.
Avatar of fb1990

ASKER

Ok, thank you. Do you still want me to send the inserts?  Thanks again for your help.

Sorry I couldn't get back to you sooner. I've been out all afternoon for a medical appointment. I would still like to have those inserts to see if I get the expected results or if I also get higher numbers and, subsequently, lower percentages.

Avatar of fb1990

ASKER

Thank you so much. My hope is, you are well.  Here is the sample insert

sample cohort 201701.txt
Running the query with the data provided from your inserts returns the following results:
For the raw numbers -
COHORT      COUNT MONTH1 MONTH2 MONTH3 MONTH4 MONTH5 MONTH6 MONTH7 MONTH8 MONTH9 MONTH10 MONTH11
------ ---------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- -------
201701        401    387    348    314    288    264    229    195    168    137      98      71

Open in new window

For the percentages -
COHORT      COUNT MONT MONT MONT MONT MONT MONT MONT MONT MONT MONT MONT
------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
201701        401 97%  87%  78%  72%  66%  57%  49%  42%  34%  24%  18%

Open in new window

It seems the only thing that differs from what you indicated earlier for results comes from the total number of records where you showed 422 but there were only 401 inserts.
Avatar of fb1990

ASKER

Yes, this is better.  Can you please send me the code?  WOW

Or did use the code you already posted.
Avatar of fb1990

ASKER

Thank you so much for all your help.  I really appreciate you. Thanks.
This was an intriguing problem. Glad I could help :-)
Avatar of fb1990

ASKER

Indeed.  Thank you.