Solved

Sorting of big table in mySQL

Posted on 2013-11-26
12
418 Views
Last Modified: 2013-11-27
I have a table which consists purely of number values but it's big in terms of records (400K) and columns (80). The data on the table do not change at all.

Here is the structure:

id, sorting_column1, sorting_column2, ... sorting_column80

sorting_columns are decimal(5,2).

The queries I perform are like:

select id from mytable order by (sorting_column1*4 + sorting_column6*2 + sorting_column1*7) desc limit 100

however, the sorting statements is 100% dynamic so I can't have all the possible sorting combinations indexed on other columns. I do however have index on every column separately which helps with the calculation.

What is the best way to make this as fast as possible? I can store the data in memory since they do not change at all. Shall I use a search server instead of mysql for example?

Thank you a lot for your time!
0
Comment
Question by:infodigger
  • 7
  • 4
12 Comments
 
LVL 30

Expert Comment

by:IanTh
ID: 39676920
try

select id from mytable order by ((sorting_column1*4) + (sorting_column6*2) + (sorting_column1*7)) desc limit 100
0
 

Author Comment

by:infodigger
ID: 39676926
Hi IanTh and thanks for the reply,

Unfortunately, that didn't make any difference.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39677327
Where can we get the test data for this?  Can you give us a URL that will expose the test data in the form of a CSV file?  If so we can probably help you come up with the fastest queries.

Also, please tell us a little more about the application.  Where does this data come from?  How often is it updated?  What is the business purpose, etc?

Thanks, ~Ray
0
 

Author Comment

by:infodigger
ID: 39677408
Hi Ray,

I am attaching part of the table. The actual table is the same but with 400K records. In the attachment I have the first 1K records only.

Some sample sql queries that I run are:

select id from mytable order by ((ranking_factor_1*2) + (ranking_factor_32*4) + (ranking_factor_22*8)) desc limit 100

select id from mytable order by ((ranking_factor_8*6) + (ranking_factor_42*4) + (ranking_factor_30*2)) desc limit 100

The order by statement is generated in realtime so it's impossible to have all the combinations prefetched somehow.

However what is possible is to have the whole data in cache, but I really don't know if that could help.

Thank you very much!
table-example.csv
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39677423
Thanks, I'll take a look at it and post back with whatever I can discern... More anon, ~Ray
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39678437
I don't think you can get much faster than what you have.  Couple of thoughts...

You said the data was DECIMAL(5,2) but the test data in the CSV is all integer.  If you really have integers, you may find that the processing is faster if the columns are defined as INT.  I shortened the column names, too.  Doubtful that mattered much, if at all.  I did not add any indexes to these columns.

My query timings were between 1.2 and 1.4 milliseconds using this script.

<?php // RAY_temp_infodigger.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);


$sql = "DROP TABLE EE_infodigger";
$res = $mysqli->query($sql);

// CREATING A TABLE FOR OUR TEST DATA
$sql = "CREATE TABLE EE_infodigger
( id    INT         NOT NULL AUTO_INCREMENT PRIMARY KEY";
$num = 1;
while ($num < 81)
{
    $pad = str_pad($num, 2, '0', STR_PAD_LEFT);
    $sql .= PHP_EOL . ", rf_$pad INT NOT NULL DEFAULT 0";
    $num++;
}
$sql .= ')';

// RUN THE QUERY TO CREATE THE TABLE
$res = $mysqli->query($sql);

// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res)
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// SHOW THE RESULTS OF THE QUERY
var_dump($res);


// TEST DATA, SAVED FROM THE POST AT EE
$csv = "http://laprbass.com/RAY_temp_infodigger.csv";
$fpo = fopen($csv, 'r');
if (!$fpo ) die("CRAP.  CANNOT OPEN $csv");

// GET THE FIELD NAMES FROM THE TOP OF THE CSV FILE
$top = fgetcsv($fpo);
$cnt = count($top);

// SET UP KEY NAMES FOR USE IN OUR QUERY
$query_cols = implode(',', $top);

// SET A ROW COUNTER
$counter = 0;

// KEEP TRACK OF ROWS THAT HAVE THE WRONG NUMBER OF FIELDS
$errors = array();

// LOOP THROUGH THE CSV RECORDS PERFORMING CERTAIN TESTS
while (!feof($fpo))
{
    $counter++;

    // GET A RECORD
    $csvdata = fgetcsv($fpo);

    // SKIP OVER EMPTY ROWS
    if (empty($csvdata)) continue;

    // CHECK THE NUMBER OF FIELDS
    if ($cnt != count($csvdata))
    {
        $errors[] = $counter;
        continue;
    }

    // ESCAPE THE INFORMATION FOR USE IN THE QUERY
    foreach ($csvdata as $ptr => $value)
    {
        $csvdata[$ptr] = $mysqli->real_escape_string($value);
    }

    // SET UP VALUE FIELDS
    $query_data = "'" . implode("', '", $csvdata) . "'";

    // SET UP A QUERY
    $sql = "REPLACE INTO EE_infodigger ( $query_cols ) VALUES ( $query_data )";

    if (!$res = $mysqli->query($sql))
	{
	    $err
	    = 'QUERY FAILURE:'
	    . ' ERRNO: '
	    . $mysqli->errno
	    . ' ERROR: '
	    . $mysqli->error
	    . ' QUERY: '
	    . $sql
	    ;
	    trigger_error($err, E_USER_ERROR);
	}
}


// SHOW THE NUMBER OF ROWS PROCESSED
echo "<br/>RECORDS PROCESSED $counter " . PHP_EOL;

// SHOW THE NUMBERS OF THE ROWS WITH THE WRONG NUMBER OF FIELDS
if (count($errors))
{
    echo "<br/>ROWS WITH THE WRONG NUMBER OF FIELDS: " . PHP_EOL;
    var_dump($errors);
}

// START THE TIMER AND RUN TWO QUERIES
$sw = new Stopwatch;
$sw->start('q1');
$sql = "select id from EE_infodigger order by ((rf_01*2) + (rf_32*4) + (rf_22*8)) desc limit 100";
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q1');

$sw->start('q2');
$sql = "select id from EE_infodigger order by ((rf_08*6) + (rf_42*4) + (rf_30*2)) desc limit 100";
$res = $mysqli->query($sql);
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q2');

// GET THE DISPLAY
echo PHP_EOL;
unset($sw);





class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo
          __CLASS__
        . '::'
        . __FUNCTION__
        . '() '
        ;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39678448
Hmm... Maybe a burst of inspiration.  You've got about 400X the amount of data I am testing with.  Wonder if it would be possible to reduce the size of the data set, maybe with a down-select using CREATE TEMPORARY TABLE ... ENGINE=MEMORY.

Can you tell us a little more about the nature of the data and the queries?  This looks very, very academic.  If possible, tell us about the business rules, thanks.
0
 

Author Comment

by:infodigger
ID: 39678635
Hi Ray,

First of all thank you for your deep interest into my question. I really appreciate the time you spend in answering.

The data is used for a recommendation engine. Our algorithm populates this table once every few hours, but then the data stays the same.

Let's say you have 400.000 different items that you want to choose the best ones by your preferences. You preferences can be any of the 80 different ratings that we have pre-calculated for each item. So depending on which preference you select first, second, third, etc. we create an order by statement that will sort all items so that we can select the best ones matching your personal preferences.

You are right about the columns, we have left them decimal because we initially thought that the rating will have decimal places, so we could make this faster by making them INT.

Do you think that getting the whole table into memory could help? I don't think we can break it into smaller pieces since then again to get the top records we will need to sort the top records from every piece.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39678677
Using my small test data set, it looks like ENGINE=MEMORY shaved a fraction of a millisecond off the query time.  Maybe I'll try to scale the data up and see if there are more noticeable differences at scale.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39679171
I scaled the collection up to 100,000 rows.  I was unable to put the table into ENGINE=MEMORY at that size (failed at 48,500 rows), but I could put it on disk.  Without adding indexes, the queries tok 40ms to 50ms.  With indexes on the columns, the queries took 33ms to 38ms

About all I can say is, "Wow you have a lot of data!"  I think the key to higher performance is going to be anything you can do to winnow down the 400,000 rows.
0
 

Author Closing Comment

by:infodigger
ID: 39680025
Thank you very much for all the time spent on this question and for providing such detailed answers. As you mention in the last post, I will try to figure out some ways to split the data and make the new tables smaller!
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39680481
Thanks for the points and thanks for using EE.  Best of luck with the project, ~Ray
0

Featured Post

Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

Join & Write a Comment

Popularity Can Be Measured Sometimes we deal with questions of popularity, and we need a way to collect opinions from our clients.  This article shows a simple teaching example of how we might elect a favorite color by letting our clients vote for …
Things That Drive Us Nuts Have you noticed the use of the reCaptcha feature at EE and other web sites?  It wants you to read and retype something that looks like this.Insanity!  It's not EE's fault - that's just the way reCaptcha works.  But it is …
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now