MySQL join with group / limit / random

Ross Edwards
Ross Edwards used Ask the Experts™
on
I have two tables (simplified for clarity):

tblCustomer (id, my_brand, my_size, my_style, product_id)

tblProducts (id, size, brand, style)

I need to update tblCustomer.product_id with the id of a row from tblProducts, which matches on size, brand and style.

For example

update tblCustomers c set c.product_id =
(select p.id from tblProducts where
c.my_size = p.size and
c.my_style = p.style and
c.my_brand = p.brand
order by rand() limit1)

This works OK, but takes far too long to execute.  My tables are all correctly indexed.  There are around 500,000 cuxtomers to update, and 10,000 possible products, and for each customer I need to execute this query 30 times to populate different product options.

Is this the most efficient way of doing this query?

At the moment it takes about 20 mins for the above query, and my full process takes about 3 hours, which is not acceptable.

Many thanks!

Ross.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Expert of the Quarter 2010
Expert of the Year 2010
Commented:
That query is not able to use indexes at all.
It will scan tblCustomers, and for each one, COMPLETELY scan tblProducts assigning a random number to EVERY row then taking the least [random] number.
Ross EdwardsTechnical Director

Author

Commented:
That might explain something then.  How to improve speed of this query then?
Commented:
You just have a lot of data to process i think it is the best way. because you have about 15000000000000 compares.
Acronis in Gartner 2019 MQ for datacenter backup

It is an honor to be featured in Gartner 2019 Magic Quadrant for Datacenter Backup and Recovery Solutions. Gartner’s MQ sets a high standard and earning a place on their grid is a great affirmation that Acronis is delivering on our mission to protect all data, apps, and systems.

Expert of the Quarter 2010
Expert of the Year 2010

Commented:
Do you have covering indexes on both sides?

c.my_size = p.size and
c.my_style = p.style and
c.my_brand = p.brand

e.g. tblcustomer (my_size, my_style, my_brand) ?

Make sure the most selective (that which narrows results down the most is put in front.  What this means is that if you have 10 sizes but 1000 brands and 4 styles, create the index thus

tblcustomer (my_brand, my_size, my_style)

That's all I can think of to make it faster
Commented:
Why are you ordering it by rand() ? If you are using a programming language (such as php or mysql stored procedures) you could create two sorted lists of the customers and products based on size, style, and brand, and then do a merge on the two lists which should result in subsecond response time on your query. Are you using a utility that could accomplish this?
I don't know if this is acceptable, (depending on the reason you are doing the update), but you can give a try to this:

create a new column for each table, called compo_key and fill it in this way:
UPDATE  tblCustomers SET compo_key=CONCAT(myBrand,myStyle,mySize);
UPDATE tblProduct SET compo_key=CONCAT(Brand,Style,Size)

The column compo_key must be VARCHAR (lenght accordingly to your data). Be sure to index both tables on compo_key;

Now your update statemente should be:

UPDATE tblCustomers c SET c.product_id =
(select p.id from tblProducts where
c.compo_key = p.compo_key  limit1)

Let me know if you try it.
Ross EdwardsTechnical Director

Author

Commented:
I will accept multiple solutions as various people contributed useful information.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial