?
Solved

Extrapolate from range, count unique rows

Posted on 2006-03-30
5
Medium Priority
?
1,677 Views
Last Modified: 2008-01-09
Given a visitDate range, for every day covered by a search that occured in that range, count unique records.

Sample data (see end of question for SQL to create this table):
+---------+--------------+----------------+--------------+
| visitor  | visitDate     | searchStart   | searchEnd  |
+---------+--------------+----------------+--------------+
|       1   | 2006-03-30 | 2006-05-01  | 2006-05-03 |
|       2   | 2006-03-30 | 2006-04-28  | 2006-05-02 |
|       1   | 2006-03-30 | 2006-05-01  | 2006-05-03 |
|       2   | 2006-03-30 | 2006-04-30  | 2006-05-02 |
|       3   | 2006-03-31 | 2006-05-01  | 2006-05-05 |
+---------+--------------+---------------+---------------+

In the sample data, the two rows for visitor 1 are identical. They count as only one. The two rows for vistor 2 have different searchStart values. They count as two.

Given: visitDate = 2006-03-30
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-04-28 |      1   | // visitor 2's first search
|  2006-04-29 |      1   |
|  2006-04-30 |      2   | // visitor 2's two searches
|  2006-05-01 |      3   | // visitor 1's search and visitor 2's two searches
|  2006-05-02 |      3   |
|  2006-05-03 |      1   | // visitor 1's search
+---------------+--------+

Given: visitDate = 2006-03-31
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-05-01 |      1   | // all from visitor 3
|  2006-05-02 |      1   |
|  2006-05-03 |      1   |
|  2006-05-04 |      1   |
|  2006-05-05 |      1   |
+---------------+--------+

Given: visitDate BETWEEN 2006-03-30 AND 2006-03-31
Result:
+---------------+--------+
| searchDate  | visits   |
+---------------+--------+
|  2006-04-28 |      1   | // the previous two results added together
|  2006-04-29 |      1   |
|  2006-04-30 |      2   |
|  2006-05-01 |      4   |
|  2006-05-02 |      4   |
|  2006-05-03 |      2   |
|  2006-05-04 |      1   |
|  2006-05-05 |      1   |
+---------------+--------+

Restrictions:
MySQL 4.0.12
Temporary tables not allowed.
Multiple queries, user variables, and use of a scripting language (PHP) are acceptable.
The table contains 500K rows.  The query (or queries) should complete in a reasonable time. 10 seconds is ok, 5 minutes is not.  To that end, table structure may be changed, indexes added, as appropriate.

SQL to create sample data:

CREATE TABLE visitData (
   visitor int,
   visitDate date,
   searchStart date,
   searchEnd date
);
INSERT INTO visitData VALUES (1, '2006-03-30', '2006-05-01', '2006-05-03');
INSERT INTO visitData VALUES (2, '2006-03-30', '2006-04-28', '2006-05-02');
INSERT INTO visitData VALUES (1, '2006-03-30', '2006-05-01', '2006-05-03');
INSERT INTO visitData VALUES (2, '2006-03-30', '2006-04-30', '2006-05-02');
INSERT INTO visitData VALUES (2, '2006-03-31', '2006-05-01', '2006-05-05');
0
Comment
Question by:snoyes_jw
  • 2
  • 2
5 Comments
 
LVL 11

Accepted Solution

by:
star_trek earned 1000 total points
ID: 16337038
Query should be
select searchStart,searchEnd,count(*) as cnt from visitData where visitDate = '2006-03-30' group by
visitor,visitDate,searchStart,searchEnd;
//for the visistData between you can use the following
select searchStart,searchEnd,count(*) as cnt from visitData where visitDate >= '2006-03-30' and visitDate <= '2006-03-31'  group by  visitor,visitDate,searchStart,searchEnd;
using your data it return the following result

+-------------+------------+----------+
| searchStart | searchEnd  | cnt   |
+-------------+------------+----------+
| 2006-05-01  | 2006-05-03 |        2 |
| 2006-04-28  | 2006-05-02 |        1 |
| 2006-04-30  | 2006-05-02 |        1 |
+-------------+------------+----------+

In PHP Code, you can you can use a while loop to get the recordset, the code is something like
$dt_array = array(); //array to store the each date
while($row = mysql_fetch_array($result)) {
      $dt1 = $row[0];
      $dt2 = $row[1];
      for($p=$dt1;$p<=$dt2;) {
               $dt_array[$p] = $row[2];
               $p = date("Y-m-d",strtotime($p." +1 days"));
      }
}

//print dt_array, dt_array contains all the dates and count
0
 
LVL 11

Expert Comment

by:star_trek
ID: 16338609
In above the statement
$dt_array[$p] = $row[2];

should be
$dt_array[$p] += $row[2];
0
 
LVL 33

Author Comment

by:snoyes_jw
ID: 16346022
Might work.  I was reluctant to do that at the beginning of the project because I feared it would would be too much data for our web servers to handle in a PHP array.  But it turns out I can trim some outlying data, so I won't have more than a thousand or so entries in the $dt_array, and since each is just an integer, I think we have the memory to handle that.

One concern is the query.  I'd need to change it to COUNT(DISTINCT visitor, visitDate, searchStart, searchEnd), and when I've done that, it takes longer to finish than I can allow.
0
 
LVL 14

Assisted Solution

by:psadac
psadac earned 1000 total points
ID: 16352634
if you create a table containing days :

CREATE TABLE days (
  day date NOT NULL,
  PRIMARY KEY  (day)
);

INSERT INTO days VALUES ('2006-04-26'), ('2006-04-27'), ('2006-04-28'), ('2006-04-29'), ('2006-04-30'), ('2006-05-01'), ('2006-05-02'), ('2006-05-03'), ('2006-05-04'), ('2006-05-05'), ('2006-05-06'), ('2006-05-07');

you could have your result with this simple query :

SELECT d.day, COUNT(DISTINCT v.visitor, v.visitdate, v.searchstart, v.searchend)
FROM days AS d
  LEFT JOIN visitdata AS v ON d.day BETWEEN v.searchstart AND v.searchend
WHERE d.day BETWEEN $begindate AND $enddate
GROUP BY d.day
ORDER BY d.day;

but you will probably have the same performance problems you encountered with COUNT(DISTINCT visitor, visitDate, searchStart, searchEnd)
i have created a visit table with 2.5 M rows with random data and it takes 3 minutes to complete, with an index on  (visitor, visitDate, searchStart, searchEnd)
0
 
LVL 33

Author Comment

by:snoyes_jw
ID: 16360632
Yeah, I've tried a solution very much like that, and as you stated, it's too slow.  I think I'm going to have to switch gears and go with some sort of summary table.
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog post, we’ll look at how ClickHouse performs in a general analytical workload using the star schema benchmark test.
In this article, I’ll talk about multi-threaded slave statistics printed in MySQL error log file.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

601 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question