Link to home
Start Free TrialLog in
Avatar of awalkinthepark
awalkinthepark

asked on

Position of an item in a large recordset

Hope this is not too convoluted. I have shown one solution, but I question the approach.
I'm hoping that someone will have a better idea.  There is probably some repetition in the description,
I wanted to convey the idea, not optimize it to fewest words.

Problem description:  
Determine what ordinal position a single value occurs in an ordered list of all values, and, express this as a percentage.  
Essentially asking:  "what's the rating of my value in the entire set of values?"
Mean and standard deviation don't really work on this because the data set is not close to normal.


Simple example :
There are 100 rows, values 1 to 10, lots of duplicates. But containing only the distinct values 1 to 10.
Your number is 8
So your value is in the top 80% of all values.
Real case - there are million rows of real numbers. Ranging from  -75.234234 ( whatever) to +92.23423 ( again, whatever)
------------------------------------------------------------------
Keep reading...trying to make it clear. Thanks

Assume table of n rows, with  numeric column containing random numbers.  There are  duplicate values.
Problem is to  determine what position a specific value is in this set of numbers, when it's reduced to a duplicate free set.

I can get a list of distinct ordered values and the rowid like this:

select  Targetvalue  , DENSE_RANK() over (ORDER BY Targetvalue  ) as rowid  from  mydatatable

Let's say there are one million rows, now it's reduced to distinct 400,000.

I want to find the rowid, or placement of a specific value in the ordered set of all values,
then  divide by the total rows ( the max rowid will be the same value)  and get the  percentage placement of the single sought value.

This is does work, but maybe someone has a better approach:

-- Find the value, easy:
1) Make a prepared table containing the distinct ordered values and the row ids.
select  Targetvalue  , DENSE_RANK() over (ORDER BY Targetvalue  ) as rowid  into _OrderedValues  from  Mytable

2) Then just do a look up, get the rowid, or absolute placement in the set.
select rowid from _OrderedValues   where  targetvalue = X


Ok, now, what percentage placement is my target value in  the set?
I need the total number of rows, and the rowid for the sought value.
rowid / max(rowid) = %

This works but again,  I question the efficiency of the entire approach:

select rowid,targetvalue , cast(rowid as real)/ cast(maxrow as real) from
_OrderedValues  join (select  MAX(rowid)maxrow  from  _OrderedValues  )  A  on  maxrow > 0
where  targetvalue = 8.322303 ( example random value sought)

This  example returns  0.9123968 in my data.
I can state that  target value is in the top 91% of the entire set.

As these values don't change very often, I'm thinking it's better to make various lookup tables to avoid having
the full data set crunched over every time a user makes a request.

Or does anyone have a efficient way to do this on a pretty large table without having
to make some canned tables for look ups?

Hope this is clearly enough stated.
Thanks
Avatar of Ioannis Paraskevopoulos
Ioannis Paraskevopoulos
Flag of Greece image

Hi,

I really do not know if my approach would be better or more efficient, but it might get you away of some writes.

1. Use a CTE for your distinct values:

;
WITH MyDenseRankCTE AS
(
    SELECT Targetvalue  , 
                  DENSE_RANK() over (ORDER BY Targetvalue  ) as rowid  
    FROM   mydatatable
)

Open in new window


If you would select from MyDenseRankCTE (you can name it whatever else you like) it would get the exact same results as _OrderedValues table.

You may now do your calculation:

SELECT @rowid / MAX(rowid) AS Percentage 
FROM   MyDenseRankCTE 

Open in new window


Summing up:

DECLARE @rowid AS FLOAT ;
SET @rowid = 8.322303 ;

WITH MyDenseRankCTE AS
(
    SELECT Targetvalue  , 
                  DENSE_RANK() over (ORDER BY Targetvalue  ) as rowid  
    FROM   mydatatable
)

SELECT @rowid / MAX(rowid) AS Percentage 
FROM   MyDenseRankCTE ;

Open in new window


Again, i do not know if this would be more efficient. You could have a look at the execution plans maybe.

Giannis
ASKER CERTIFIED SOLUTION
Avatar of DBAduck - Ben Miller
DBAduck - Ben Miller
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You could also create a separate table that held all unique values with a count of each value.

The advantage of this is that triggers could maintain that table dynamically.  Then, when you looked up a value, you'd just have to add predetermined totals less than the current value, not resequence the entire data set over and over.

Periodically, "just to be safe", you could do a complete rebuild of the unique values total table to make sure it accurately matches the data.
Avatar of awalkinthepark
awalkinthepark

ASKER

thanks.