Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Sorting of big table in mySQL

Posted on 2013-11-26
12
Medium Priority
?
444 Views
Last Modified: 2013-11-27
I have a table which consists purely of number values but it's big in terms of records (400K) and columns (80). The data on the table do not change at all.

Here is the structure:

id, sorting_column1, sorting_column2, ... sorting_column80

sorting_columns are decimal(5,2).

The queries I perform are like:

select id from mytable order by (sorting_column1*4 + sorting_column6*2 + sorting_column1*7) desc limit 100

however, the sorting statements is 100% dynamic so I can't have all the possible sorting combinations indexed on other columns. I do however have index on every column separately which helps with the calculation.

What is the best way to make this as fast as possible? I can store the data in memory since they do not change at all. Shall I use a search server instead of mysql for example?

Thank you a lot for your time!
0
Comment
Question by:infodigger
  • 7
  • 4
12 Comments
 
LVL 30

Expert Comment

by:IanTh
ID: 39676920
try

select id from mytable order by ((sorting_column1*4) + (sorting_column6*2) + (sorting_column1*7)) desc limit 100
0
 

Author Comment

by:infodigger
ID: 39676926
Hi IanTh and thanks for the reply,

Unfortunately, that didn't make any difference.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39677327
Where can we get the test data for this?  Can you give us a URL that will expose the test data in the form of a CSV file?  If so we can probably help you come up with the fastest queries.

Also, please tell us a little more about the application.  Where does this data come from?  How often is it updated?  What is the business purpose, etc?

Thanks, ~Ray
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:infodigger
ID: 39677408
Hi Ray,

I am attaching part of the table. The actual table is the same but with 400K records. In the attachment I have the first 1K records only.

Some sample sql queries that I run are:

select id from mytable order by ((ranking_factor_1*2) + (ranking_factor_32*4) + (ranking_factor_22*8)) desc limit 100

select id from mytable order by ((ranking_factor_8*6) + (ranking_factor_42*4) + (ranking_factor_30*2)) desc limit 100

The order by statement is generated in realtime so it's impossible to have all the combinations prefetched somehow.

However what is possible is to have the whole data in cache, but I really don't know if that could help.

Thank you very much!
table-example.csv
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39677423
Thanks, I'll take a look at it and post back with whatever I can discern... More anon, ~Ray
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39678437
I don't think you can get much faster than what you have.  Couple of thoughts...

You said the data was DECIMAL(5,2) but the test data in the CSV is all integer.  If you really have integers, you may find that the processing is faster if the columns are defined as INT.  I shortened the column names, too.  Doubtful that mattered much, if at all.  I did not add any indexes to these columns.

My query timings were between 1.2 and 1.4 milliseconds using this script.

<?php // RAY_temp_infodigger.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);


$sql = "DROP TABLE EE_infodigger";
$res = $mysqli->query($sql);

// CREATING A TABLE FOR OUR TEST DATA
$sql = "CREATE TABLE EE_infodigger
( id    INT         NOT NULL AUTO_INCREMENT PRIMARY KEY";
$num = 1;
while ($num < 81)
{
    $pad = str_pad($num, 2, '0', STR_PAD_LEFT);
    $sql .= PHP_EOL . ", rf_$pad INT NOT NULL DEFAULT 0";
    $num++;
}
$sql .= ')';

// RUN THE QUERY TO CREATE THE TABLE
$res = $mysqli->query($sql);

// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res)
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// SHOW THE RESULTS OF THE QUERY
var_dump($res);


// TEST DATA, SAVED FROM THE POST AT EE
$csv = "http://laprbass.com/RAY_temp_infodigger.csv";
$fpo = fopen($csv, 'r');
if (!$fpo ) die("CRAP.  CANNOT OPEN $csv");

// GET THE FIELD NAMES FROM THE TOP OF THE CSV FILE
$top = fgetcsv($fpo);
$cnt = count($top);

// SET UP KEY NAMES FOR USE IN OUR QUERY
$query_cols = implode(',', $top);

// SET A ROW COUNTER
$counter = 0;

// KEEP TRACK OF ROWS THAT HAVE THE WRONG NUMBER OF FIELDS
$errors = array();

// LOOP THROUGH THE CSV RECORDS PERFORMING CERTAIN TESTS
while (!feof($fpo))
{
    $counter++;

    // GET A RECORD
    $csvdata = fgetcsv($fpo);

    // SKIP OVER EMPTY ROWS
    if (empty($csvdata)) continue;

    // CHECK THE NUMBER OF FIELDS
    if ($cnt != count($csvdata))
    {
        $errors[] = $counter;
        continue;
    }

    // ESCAPE THE INFORMATION FOR USE IN THE QUERY
    foreach ($csvdata as $ptr => $value)
    {
        $csvdata[$ptr] = $mysqli->real_escape_string($value);
    }

    // SET UP VALUE FIELDS
    $query_data = "'" . implode("', '", $csvdata) . "'";

    // SET UP A QUERY
    $sql = "REPLACE INTO EE_infodigger ( $query_cols ) VALUES ( $query_data )";

    if (!$res = $mysqli->query($sql))
	{
	    $err
	    = 'QUERY FAILURE:'
	    . ' ERRNO: '
	    . $mysqli->errno
	    . ' ERROR: '
	    . $mysqli->error
	    . ' QUERY: '
	    . $sql
	    ;
	    trigger_error($err, E_USER_ERROR);
	}
}


// SHOW THE NUMBER OF ROWS PROCESSED
echo "<br/>RECORDS PROCESSED $counter " . PHP_EOL;

// SHOW THE NUMBERS OF THE ROWS WITH THE WRONG NUMBER OF FIELDS
if (count($errors))
{
    echo "<br/>ROWS WITH THE WRONG NUMBER OF FIELDS: " . PHP_EOL;
    var_dump($errors);
}

// START THE TIMER AND RUN TWO QUERIES
$sw = new Stopwatch;
$sw->start('q1');
$sql = "select id from EE_infodigger order by ((rf_01*2) + (rf_32*4) + (rf_22*8)) desc limit 100";
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q1');

$sw->start('q2');
$sql = "select id from EE_infodigger order by ((rf_08*6) + (rf_42*4) + (rf_30*2)) desc limit 100";
$res = $mysqli->query($sql);
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q2');

// GET THE DISPLAY
echo PHP_EOL;
unset($sw);





class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo
          __CLASS__
        . '::'
        . __FUNCTION__
        . '() '
        ;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39678448
Hmm... Maybe a burst of inspiration.  You've got about 400X the amount of data I am testing with.  Wonder if it would be possible to reduce the size of the data set, maybe with a down-select using CREATE TEMPORARY TABLE ... ENGINE=MEMORY.

Can you tell us a little more about the nature of the data and the queries?  This looks very, very academic.  If possible, tell us about the business rules, thanks.
0
 

Author Comment

by:infodigger
ID: 39678635
Hi Ray,

First of all thank you for your deep interest into my question. I really appreciate the time you spend in answering.

The data is used for a recommendation engine. Our algorithm populates this table once every few hours, but then the data stays the same.

Let's say you have 400.000 different items that you want to choose the best ones by your preferences. You preferences can be any of the 80 different ratings that we have pre-calculated for each item. So depending on which preference you select first, second, third, etc. we create an order by statement that will sort all items so that we can select the best ones matching your personal preferences.

You are right about the columns, we have left them decimal because we initially thought that the rating will have decimal places, so we could make this faster by making them INT.

Do you think that getting the whole table into memory could help? I don't think we can break it into smaller pieces since then again to get the top records we will need to sort the top records from every piece.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39678677
Using my small test data set, it looks like ENGINE=MEMORY shaved a fraction of a millisecond off the query time.  Maybe I'll try to scale the data up and see if there are more noticeable differences at scale.
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 39679171
I scaled the collection up to 100,000 rows.  I was unable to put the table into ENGINE=MEMORY at that size (failed at 48,500 rows), but I could put it on disk.  Without adding indexes, the queries tok 40ms to 50ms.  With indexes on the columns, the queries took 33ms to 38ms

About all I can say is, "Wow you have a lot of data!"  I think the key to higher performance is going to be anything you can do to winnow down the 400,000 rows.
0
 

Author Closing Comment

by:infodigger
ID: 39680025
Thank you very much for all the time spent on this question and for providing such detailed answers. As you mention in the last post, I will try to figure out some ways to split the data and make the new tables smaller!
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 39680481
Thanks for the points and thanks for using EE.  Best of luck with the project, ~Ray
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Backups and Disaster RecoveryIn this post, we’ll look at strategies for backups and disaster recovery.
Recursive SQL is one of the most fascinating and powerful and yet dangerous feature offered in many modern databases today using a Common Table Expression (CTE) first introduced in the ANSI SQL 99 standard. The first implementations of CTE began ap…
The viewer will learn how to count occurrences of each item in an array.
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…
Suggested Courses

886 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question