Extrapolate from range, count unique rows

Given a visitDate range, for every day covered by a search that occured in that range, count unique records.

Sample data (see end of question for SQL to create this table):
+---------+--------------+----------------+--------------+
| visitor  | visitDate     | searchStart   | searchEnd  |
+---------+--------------+----------------+--------------+
|       1   | 2006-03-30 | 2006-05-01  | 2006-05-03 |
|       2   | 2006-03-30 | 2006-04-28  | 2006-05-02 |
|       1   | 2006-03-30 | 2006-05-01  | 2006-05-03 |
|       2   | 2006-03-30 | 2006-04-30  | 2006-05-02 |
|       3   | 2006-03-31 | 2006-05-01  | 2006-05-05 |
+---------+--------------+---------------+---------------+

In the sample data, the two rows for visitor 1 are identical. They count as only one. The two rows for vistor 2 have different searchStart values. They count as two.

Given: visitDate = 2006-03-30
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-04-28 |      1   | // visitor 2's first search
|  2006-04-29 |      1   |
|  2006-04-30 |      2   | // visitor 2's two searches
|  2006-05-01 |      3   | // visitor 1's search and visitor 2's two searches
|  2006-05-02 |      3   |
|  2006-05-03 |      1   | // visitor 1's search
+---------------+--------+

Given: visitDate = 2006-03-31
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-05-01 |      1   | // all from visitor 3
|  2006-05-02 |      1   |
|  2006-05-03 |      1   |
|  2006-05-04 |      1   |
|  2006-05-05 |      1   |
+---------------+--------+

Given: visitDate BETWEEN 2006-03-30 AND 2006-03-31
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-04-28 |      1   | // the previous two results added together
|  2006-04-29 |      1   |
|  2006-04-30 |      2   |
|  2006-05-01 |      4   |
|  2006-05-02 |      4   |
|  2006-05-03 |      2   |
|  2006-05-04 |      1   |
|  2006-05-05 |      1   |
+---------------+--------+

Restrictions:
MySQL 4.0.12
Temporary tables not allowed.
Multiple queries, user variables, and use of a scripting language (PHP) are acceptable.
The table contains 500K rows.  The query (or queries) should complete in a reasonable time. 10 seconds is ok, 5 minutes is not.  To that end, table structure may be changed, indexes added, as appropriate.

SQL to create sample data:

CREATE TABLE visitData (
   visitor int,
   visitDate date,
   searchStart date,
   searchEnd date
);
INSERT INTO visitData VALUES (1, '2006-03-30', '2006-05-01', '2006-05-03');
INSERT INTO visitData VALUES (2, '2006-03-30', '2006-04-28', '2006-05-02');
INSERT INTO visitData VALUES (1, '2006-03-30', '2006-05-01', '2006-05-03');
INSERT INTO visitData VALUES (2, '2006-03-30', '2006-04-30', '2006-05-02');
INSERT INTO visitData VALUES (2, '2006-03-31', '2006-05-01', '2006-05-05');
LVL 33
snoyes_jwAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

star_trekCommented:
Query should be
select searchStart,searchEnd,count(*) as cnt from visitData where visitDate = '2006-03-30' group by
visitor,visitDate,searchStart,searchEnd;
//for the visistData between you can use the following
select searchStart,searchEnd,count(*) as cnt from visitData where visitDate >= '2006-03-30' and visitDate <= '2006-03-31'  group by  visitor,visitDate,searchStart,searchEnd;
using your data it return the following result

+-------------+------------+----------+
| searchStart | searchEnd  | cnt   |
+-------------+------------+----------+
| 2006-05-01  | 2006-05-03 |        2 |
| 2006-04-28  | 2006-05-02 |        1 |
| 2006-04-30  | 2006-05-02 |        1 |
+-------------+------------+----------+

In PHP Code, you can you can use a while loop to get the recordset, the code is something like
$dt_array = array(); //array to store the each date
while($row = mysql_fetch_array($result)) {
      $dt1 = $row[0];
      $dt2 = $row[1];
      for($p=$dt1;$p<=$dt2;) {
               $dt_array[$p] = $row[2];
               $p = date("Y-m-d",strtotime($p." +1 days"));
      }
}

//print dt_array, dt_array contains all the dates and count
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
star_trekCommented:
In above the statement
$dt_array[$p] = $row[2];

should be
$dt_array[$p] += $row[2];
0
snoyes_jwAuthor Commented:
Might work.  I was reluctant to do that at the beginning of the project because I feared it would would be too much data for our web servers to handle in a PHP array.  But it turns out I can trim some outlying data, so I won't have more than a thousand or so entries in the $dt_array, and since each is just an integer, I think we have the memory to handle that.

One concern is the query.  I'd need to change it to COUNT(DISTINCT visitor, visitDate, searchStart, searchEnd), and when I've done that, it takes longer to finish than I can allow.
0
psadacCommented:
if you create a table containing days :

CREATE TABLE days (
  day date NOT NULL,
  PRIMARY KEY  (day)
);

INSERT INTO days VALUES ('2006-04-26'), ('2006-04-27'), ('2006-04-28'), ('2006-04-29'), ('2006-04-30'), ('2006-05-01'), ('2006-05-02'), ('2006-05-03'), ('2006-05-04'), ('2006-05-05'), ('2006-05-06'), ('2006-05-07');

you could have your result with this simple query :

SELECT d.day, COUNT(DISTINCT v.visitor, v.visitdate, v.searchstart, v.searchend)
FROM days AS d
  LEFT JOIN visitdata AS v ON d.day BETWEEN v.searchstart AND v.searchend
WHERE d.day BETWEEN $begindate AND $enddate
GROUP BY d.day
ORDER BY d.day;

but you will probably have the same performance problems you encountered with COUNT(DISTINCT visitor, visitDate, searchStart, searchEnd)
i have created a visit table with 2.5 M rows with random data and it takes 3 minutes to complete, with an index on  (visitor, visitDate, searchStart, searchEnd)
0
snoyes_jwAuthor Commented:
Yeah, I've tried a solution very much like that, and as you stated, it's too slow.  I think I'm going to have to switch gears and go with some sort of summary table.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
MySQL Server

From novice to tech pro — start learning today.