hbiz
asked on
MySQL Database design and query help
I have the following tables:
users (id, first_name, last_name)
category (id, name)
rank(id, user_id, rank)
Each user can belong to several categories. And all users are in the rank table and have a value between 0.0 and 1.0, where 0 is the lowest rank and 1 is the highest. I’d like to setup additional tables to create the following webpage:
A visitor to the page (identified by either one of the recorded ids in the user table, or a numeric representation of their ip address) chooses a category and is presented with two randomly chosen users from the users table such that:
1) the visiting user_id has not seen this pairing in a period of 24 hours
2) the two users belong to the chosen category
3) the two users are within 1 rank value of each other. Let me explain that last criteria - if all the ranks were sorted, the two chosen users would have adjacent ranks.
This is a hard one and I can’t for the life of me figure it out how to design my tables and query to answer this efficiently
I truly appreciate any help on this front.
Thanks
users (id, first_name, last_name)
category (id, name)
rank(id, user_id, rank)
Each user can belong to several categories. And all users are in the rank table and have a value between 0.0 and 1.0, where 0 is the lowest rank and 1 is the highest. I’d like to setup additional tables to create the following webpage:
A visitor to the page (identified by either one of the recorded ids in the user table, or a numeric representation of their ip address) chooses a category and is presented with two randomly chosen users from the users table such that:
1) the visiting user_id has not seen this pairing in a period of 24 hours
2) the two users belong to the chosen category
3) the two users are within 1 rank value of each other. Let me explain that last criteria - if all the ranks were sorted, the two chosen users would have adjacent ranks.
This is a hard one and I can’t for the life of me figure it out how to design my tables and query to answer this efficiently
I truly appreciate any help on this front.
Thanks
ASKER
mwvisa1,
thanks for your responses but this is def not an academic question, it's for a personal project I'm working on. I have tried many variations already but they all suffer from poor efficiency. I phrased the question in that matter and omitted the details of my specific project in an effort to get to an answer to the core algorithm.
but to prove that here's my query at the moment which I don't think is efficient enough (note it is using different tables that are specific to my project but the general concept is the same as my initial question)
SELECT lowerImages.image_id AS lower_image,
max(higherImages.image_id) AS higher_image
FROM global_rank AS lowerImages, global_rank AS higherImages
WHERE lowerImages.rank < higherImages.rank
AND 1 NOT IN (select 1 from ranked_up where
lowerImages.image_id = ranked_up.image_id
AND ranked_up.user_id = $user_id
AND ranked_up.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY))
AND 1 NOT IN (
SELECT 1 from matchups where user_id = $userId
AND lower_image_id = lowerImages.image_id
AND higher_image_id = higherImages.image_id
UNION
SELECT 1 from matchups where user_id = $user_id
AND lower_image_id = higherImages.image_id
AND higher_image_id = lowerImages.image_id
)
GROUP BY 1
Hopefully this proves that I'm trying to solve a real world problem and not doing homework :)
thanks for your time.
PS I forgot to add one more criteria to the initial quesiton:
4) the visiting user is presented with two users and with the one on the left having lower rank the one on the right. The user can click on the left image and that will swap the two users' ranks (essentially ranking up user on the left). This is tracked in a table called rankedUp.
The final criteria in selecting the two random users is that the one on the left (lower rank) must not have been ranked up by the visiting user in the last 24 hours.
thanks for your responses but this is def not an academic question, it's for a personal project I'm working on. I have tried many variations already but they all suffer from poor efficiency. I phrased the question in that matter and omitted the details of my specific project in an effort to get to an answer to the core algorithm.
but to prove that here's my query at the moment which I don't think is efficient enough (note it is using different tables that are specific to my project but the general concept is the same as my initial question)
SELECT lowerImages.image_id AS lower_image,
max(higherImages.image_id)
FROM global_rank AS lowerImages, global_rank AS higherImages
WHERE lowerImages.rank < higherImages.rank
AND 1 NOT IN (select 1 from ranked_up where
lowerImages.image_id = ranked_up.image_id
AND ranked_up.user_id = $user_id
AND ranked_up.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY))
AND 1 NOT IN (
SELECT 1 from matchups where user_id = $userId
AND lower_image_id = lowerImages.image_id
AND higher_image_id = higherImages.image_id
UNION
SELECT 1 from matchups where user_id = $user_id
AND lower_image_id = higherImages.image_id
AND higher_image_id = lowerImages.image_id
)
GROUP BY 1
Hopefully this proves that I'm trying to solve a real world problem and not doing homework :)
thanks for your time.
PS I forgot to add one more criteria to the initial quesiton:
4) the visiting user is presented with two users and with the one on the left having lower rank the one on the right. The user can click on the left image and that will swap the two users' ranks (essentially ranking up user on the left). This is tracked in a table called rankedUp.
The final criteria in selecting the two random users is that the one on the left (lower rank) must not have been ranked up by the visiting user in the last 24 hours.
hbiz, thanks for sharing the query. I am not sure it is clear to me how you are storing the pairings still, but it sounds like from your comment that the 24 hour test is based on rank up and not necessarily pairing. As far as performance, here are some notes (some are best practices and not necessarily required for efficiency):
SELECT lowerImages.image_id AS lower_image,
max(higherImages.image_id) AS higher_image
-- I would use ANSI-compliant joins
-- global_rank AS lowerImages JOIN global_rank AS higherImages
FROM global_rank AS lowerImages, global_rank AS higherImages
-- this would become the ON of the JOIN above
-- the Cartesian Product based on "<" will have poorer performance, typically *[1]
WHERE lowerImages.rank < higherImages.rank
-- I would try out NOT EXISTS or JOIN
-- A JOIN to rows that are "<=" may work better *[2]
AND 1 NOT IN (select 1 from ranked_up where
lowerImages.image_id = ranked_up.image_id
AND ranked_up.user_id = $user_id
AND ranked_up.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY))
-- I see now, these are the pairings. *[3]
AND 1 NOT IN (
SELECT 1 from matchups where user_id = $userId
AND lower_image_id = lowerImages.image_id
AND higher_image_id = higherImages.image_id
UNION
SELECT 1 from matchups where user_id = $user_id
AND lower_image_id = higherImages.image_id
AND higher_image_id = lowerImages.image_id
)
-- This is VERY poor practice that you are allowed to get away with in MySQL
-- First, using column index does not work here, so this is not column 1, but literal '1'
-- MySQL lets you NOT include all columns in SELECT that are not aggregated
-- We may get rid of the need for this totally, but otherwise include ALL non-aggregate columns
GROUP BY 1
One way to escape the GROUP BY and possibly help with performance:
SELECT lo.image_id AS lower_image
-- rank should be indexed DESC or combination of rank, image_id if you truly need MAX image_id
-- sub-queries sometimes perform better, depending on the scenario.
, (SELECT hi.image_id
FROM global_rank AS hi
WHERE hi.rank > lo.rank
AND NOT EXISTS (
SELECT 1
FROM matchups AS m
WHERE m.lower_image_id IN (lo.image_id, hi.image_id)
AND m.higher_image_id IN (hi.image_id, lo.image_id)
)
ORDER BY hi.rank DESC LIMIT 1) AS higher_image
FROM global_rank AS lo
-- this works if every user will be in ranked_up
JOIN (
SELECT image_id
FROM ranked_up
WHERE user_id = $user_id
AND created_at <= DATE_SUB(NOW(), INTERVAL 1 DAY)
) AS rk ON rk.image_id = lo.image_id
;
Hopefully the combination of techniques shown helps you find the best combination for performance. I would use EXPLAIN in front of your current SELECT to get a better idea of what it is doing or not doing performance wise. Then you can try on something like I showed.
Kevin
SELECT lowerImages.image_id AS lower_image,
max(higherImages.image_id)
-- I would use ANSI-compliant joins
-- global_rank AS lowerImages JOIN global_rank AS higherImages
FROM global_rank AS lowerImages, global_rank AS higherImages
-- this would become the ON of the JOIN above
-- the Cartesian Product based on "<" will have poorer performance, typically *[1]
WHERE lowerImages.rank < higherImages.rank
-- I would try out NOT EXISTS or JOIN
-- A JOIN to rows that are "<=" may work better *[2]
AND 1 NOT IN (select 1 from ranked_up where
lowerImages.image_id = ranked_up.image_id
AND ranked_up.user_id = $user_id
AND ranked_up.created_at > DATE_SUB(NOW(), INTERVAL 1 DAY))
-- I see now, these are the pairings. *[3]
AND 1 NOT IN (
SELECT 1 from matchups where user_id = $userId
AND lower_image_id = lowerImages.image_id
AND higher_image_id = higherImages.image_id
UNION
SELECT 1 from matchups where user_id = $user_id
AND lower_image_id = higherImages.image_id
AND higher_image_id = lowerImages.image_id
)
-- This is VERY poor practice that you are allowed to get away with in MySQL
-- First, using column index does not work here, so this is not column 1, but literal '1'
-- MySQL lets you NOT include all columns in SELECT that are not aggregated
-- We may get rid of the need for this totally, but otherwise include ALL non-aggregate columns
GROUP BY 1
One way to escape the GROUP BY and possibly help with performance:
SELECT lo.image_id AS lower_image
-- rank should be indexed DESC or combination of rank, image_id if you truly need MAX image_id
-- sub-queries sometimes perform better, depending on the scenario.
, (SELECT hi.image_id
FROM global_rank AS hi
WHERE hi.rank > lo.rank
AND NOT EXISTS (
SELECT 1
FROM matchups AS m
WHERE m.lower_image_id IN (lo.image_id, hi.image_id)
AND m.higher_image_id IN (hi.image_id, lo.image_id)
)
ORDER BY hi.rank DESC LIMIT 1) AS higher_image
FROM global_rank AS lo
-- this works if every user will be in ranked_up
JOIN (
SELECT image_id
FROM ranked_up
WHERE user_id = $user_id
AND created_at <= DATE_SUB(NOW(), INTERVAL 1 DAY)
) AS rk ON rk.image_id = lo.image_id
;
Hopefully the combination of techniques shown helps you find the best combination for performance. I would use EXPLAIN in front of your current SELECT to get a better idea of what it is doing or not doing performance wise. Then you can try on something like I showed.
Kevin
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
seems to be working for me.
Hopefully we can help. It appears this is academic in nature, so will try my best to explain so you learn HOW for yourself. Therefore, some of my responses will be probing questions, whose answers should help you determine the table design.
1) if you need to keep pairings unique over a specific time period, you need to consider add a column that has a timestamp. Additionally, where are you storing the "pairings?"
2) how are users associated to categories?
3) after handling the other aspects of this, it would seem like you can handle the adjacent requirement by selecting candidates for pairing based on an order by rank. I guess you could use that to pick a pair, then if they have been together, then move to the next two. Remember MySQL has LIMIT which has an OFFSET. So you could skip two rows, then grab the next two easily.
Hope that helps get started.