Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

get 25 random rows from huge mysql table - but fast....

Posted on 2013-12-06
10
Medium Priority
?
986 Views
Last Modified: 2014-06-16
Hi

How do i retrieve 25 random rows from a big mysql table, but fast?

The table has 100 mio rows - and the queries/solutions I tried (found by googling around) took very long or gave errors about the tmp table that mysql tried to create for the query.

Any ideas how to solve this?

(Some background information: I have multiple java clients accessing the table and they process the table data in batches of 25 rows. If I don't have random rows all clients process the same rows. I need multiple clients to speed things up, just one client would have weeks to process all rows.)

Thanks!
0
Comment
Question by:netslt
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +1
10 Comments
 
LVL 15

Expert Comment

by:JimFive
ID: 39701418
A better solution would be to have your clients mark the rows they are working and have the clients select the next 25 unmarked rows.

Alternatively, give each client a number from 0 to numclients-1 and only allow them to get rows where the rowid mod numclients = their own number

Finally, to answer the asked question:
write a query that returns the top 25 rows ordered by a column of random numbers.
something like:
SELECT Rand(), * FROM TABLE order by 1 limit 1,25
0
 
LVL 27

Expert Comment

by:Cornelia Yoder
ID: 39701511
Is there a primary key, and especially is it numeric?

If so, use php to generate 25 random numbers (very fast) and then retrieve those 25 rows with a single MySQL query.
0
 

Author Comment

by:netslt
ID: 39701607
Thank you for the 2 comments.

I have to add that after processing each row has a field

processed = 1

as when it is not yet processed then it is

processed = 0

So I the SQL query is always "..... where processed = 0"

Over time many more rows have processed = 1, so it is difficult to access rows with a predefinded random number (yodercm solution) or with the mod numclients (JimFive solution).

So if you have another solution similar to that but that works with the "processed" flag, then please let me know....

JimFive: the query "SELECT Rand(), * FROM rowpool order by 1 limit 1,25;" gives me an error.

JimFive: I already had tried to mark the rows as soon as a client has got them, but its not fast enough, other clients can already get it while client1 gets it & updates the mark (there are a few miliseconds between retrieving and updating)

Thanks
0
Will your db performance match your db growth?

In Percona’s white paper “Performance at Scale: Keeping Your Database on Its Toes,” we take a high-level approach to what you need to think about when planning for database scalability.

 
LVL 27

Accepted Solution

by:
Cornelia Yoder earned 2000 total points
ID: 39701732
If you need to ultimately process every row, then you need a better method than randomly selecting them.  Random methods do not guarantee that a given row will ever be processed.

Suppose you are using 100 java clients to process 100 million rows.   Create a field in the table and fill it in with a random number between 1 and 100.  

That will divide your database into 100 1-million row groups.  Then each client can randomly (or better yet sequentially) process its own million rows with no overlap.

When a row is created, generate a random number from 1 to 100 (your number of clients) for that field.

You can of course change the numbering to fit the actual number of clients you have running.

That should cut your search time down quite a lot and avoid the possibility of a row never being randomly chosen for processing.
0
 
LVL 15

Expert Comment

by:JimFive
ID: 39701828
I already had tried to mark the rows as soon as a client has got them, but its not fast enough
Mark the rows first with the client id then retrieve the rows to the client.

e.g.
update table set processing = myclientid

THEN
select from table where processing = myclientid.

JimFive: the query "SELECT Rand(), * FROM rowpool order by 1 limit 1,25;" gives me an error.
Ok, I looked it up, in mysql you use:  SELECT * FROM rowpool order by RAND() limit 25
0
 
LVL 27

Expert Comment

by:Cornelia Yoder
ID: 39701918
RAND still doesn't solve the problem of a row never being selected.

And yes, fill in a client id at the time the row is created.  Let the client select it later for processing.
0
 

Author Comment

by:netslt
ID: 39701947
SELECT * FROM rowpool order by RAND() limit 25 is correct but gives me an error:

Incorrect key file for table '/tmp/#sql_2b_0.MYI'; try to repair it.

I think it tries to generate a temp table but times out because of the amount of data...



Regarding "update table set processing = myclientid" - how do I update just 25 next records that do not have processed = 1 without the other clients interfering?
0
 
LVL 25

Expert Comment

by:Tomas Helgi Johannsson
ID: 39701963
Hi!

Your N clients should process the data from first row to the last, dividing the rows into M chunks of equal size X rows.
For this you could create a sequence function (similar to Oracle's nextval ) that tells client A(i) how many rows has been processed and marks (sets the sequence to be seqno = seqno + Xrows + 1

Then each client can select their chunk of data by executing

select *
from table LIMIT secno , X

so for secno = 1000 and X = 100
you will get 100 rows starting from rowid = 1000 to 1100

Unfortunatly MySQL doesn't have this kind of sequence built in but here is a good solution that you could use as a substitute

http://www.microshell.com/database/mysql/emulating-nextval-function-to-get-sequence-in-mysql/

This will help your N clients to process X rows of your Y million row table fast and accurately and no rows will be processed twice or left out from the process.
Hope this helps.

Regards,
    Tomas Helgi
0
 
LVL 15

Expert Comment

by:JimFive
ID: 39702014
Regarding "update table set processing = myclientid" - how do I update just 25 next records that do not have processed = 1 without the other clients interfering?
UPDATE tablename
SET processing = myclientID
WHERE tablename.processed = 0 and tablename.processing =0
LIMIT 25

Open in new window


Since this is a single statement, it is atomic and will set all 25 at once with no interference from the other clients.
0
 
LVL 27

Expert Comment

by:Cornelia Yoder
ID: 39702016
>>>Regarding "update table set processing = myclientid" - how do I update just 25 next records that do not have processed = 1 without the other clients interfering?

Once you have a client id in each row, assigning that row to a particular client, each client just uses

SELECT ... WHERE processed = 0 AND clientid = (client id processing program) LIMIT 25
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
In part one, we reviewed the prerequisites required for installing SQL Server vNext. In this part we will explore how to install Microsoft's SQL Server on Ubuntu 16.04.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

721 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question