Solved

Sorting of big table in mySQL

Posted on 2013-11-26
12
428 Views
Last Modified: 2013-11-27
I have a table which consists purely of number values but it's big in terms of records (400K) and columns (80). The data on the table do not change at all.

Here is the structure:

id, sorting_column1, sorting_column2, ... sorting_column80

sorting_columns are decimal(5,2).

The queries I perform are like:

select id from mytable order by (sorting_column1*4 + sorting_column6*2 + sorting_column1*7) desc limit 100

however, the sorting statements is 100% dynamic so I can't have all the possible sorting combinations indexed on other columns. I do however have index on every column separately which helps with the calculation.

What is the best way to make this as fast as possible? I can store the data in memory since they do not change at all. Shall I use a search server instead of mysql for example?

Thank you a lot for your time!
0
Comment
Question by:infodigger
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 4
12 Comments
 
LVL 30

Expert Comment

by:IanTh
ID: 39676920
try

select id from mytable order by ((sorting_column1*4) + (sorting_column6*2) + (sorting_column1*7)) desc limit 100
0
 

Author Comment

by:infodigger
ID: 39676926
Hi IanTh and thanks for the reply,

Unfortunately, that didn't make any difference.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39677327
Where can we get the test data for this?  Can you give us a URL that will expose the test data in the form of a CSV file?  If so we can probably help you come up with the fastest queries.

Also, please tell us a little more about the application.  Where does this data come from?  How often is it updated?  What is the business purpose, etc?

Thanks, ~Ray
0
Get Database Help Now w/ Support & Database Audit

Keeping your database environment tuned, optimized and high-performance is key to achieving business goals. If your database goes down, so does your business. Percona experts have a long history of helping enterprises ensure their databases are running smoothly.

 

Author Comment

by:infodigger
ID: 39677408
Hi Ray,

I am attaching part of the table. The actual table is the same but with 400K records. In the attachment I have the first 1K records only.

Some sample sql queries that I run are:

select id from mytable order by ((ranking_factor_1*2) + (ranking_factor_32*4) + (ranking_factor_22*8)) desc limit 100

select id from mytable order by ((ranking_factor_8*6) + (ranking_factor_42*4) + (ranking_factor_30*2)) desc limit 100

The order by statement is generated in realtime so it's impossible to have all the combinations prefetched somehow.

However what is possible is to have the whole data in cache, but I really don't know if that could help.

Thank you very much!
table-example.csv
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39677423
Thanks, I'll take a look at it and post back with whatever I can discern... More anon, ~Ray
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39678437
I don't think you can get much faster than what you have.  Couple of thoughts...

You said the data was DECIMAL(5,2) but the test data in the CSV is all integer.  If you really have integers, you may find that the processing is faster if the columns are defined as INT.  I shortened the column names, too.  Doubtful that mattered much, if at all.  I did not add any indexes to these columns.

My query timings were between 1.2 and 1.4 milliseconds using this script.

<?php // RAY_temp_infodigger.php
ini_set('display_errors', TRUE);
error_reporting(E_ALL);
echo '<pre>';

// DATABASE CONNECTION AND SELECTION VARIABLES - GET THESE FROM YOUR HOSTING COMPANY
$db_host = "localhost"; // PROBABLY THIS IS OK
$db_name = "??";
$db_user = "??";
$db_word = "??";

// OPEN A CONNECTION TO THE DATA BASE SERVER AND SELECT THE DB
$mysqli = new mysqli($db_host, $db_user, $db_word, $db_name);

// DID THE CONNECT/SELECT WORK OR FAIL?
if ($mysqli->connect_errno)
{
    $err
    = "CONNECT FAIL: "
    . $mysqli->connect_errno
    . ' '
    . $mysqli->connect_error
    ;
    trigger_error($err, E_USER_ERROR);
}

// ACTIVATE THIS TO SHOW WHAT THE DB CONNECTION OBJECT LOOKS LIKE
// var_dump($mysqli);


$sql = "DROP TABLE EE_infodigger";
$res = $mysqli->query($sql);

// CREATING A TABLE FOR OUR TEST DATA
$sql = "CREATE TABLE EE_infodigger
( id    INT         NOT NULL AUTO_INCREMENT PRIMARY KEY";
$num = 1;
while ($num < 81)
{
    $pad = str_pad($num, 2, '0', STR_PAD_LEFT);
    $sql .= PHP_EOL . ", rf_$pad INT NOT NULL DEFAULT 0";
    $num++;
}
$sql .= ')';

// RUN THE QUERY TO CREATE THE TABLE
$res = $mysqli->query($sql);

// IF mysqli::query() RETURNS FALSE, LOG AND SHOW THE ERROR
if (!$res)
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}

// SHOW THE RESULTS OF THE QUERY
var_dump($res);


// TEST DATA, SAVED FROM THE POST AT EE
$csv = "http://laprbass.com/RAY_temp_infodigger.csv";
$fpo = fopen($csv, 'r');
if (!$fpo ) die("CRAP.  CANNOT OPEN $csv");

// GET THE FIELD NAMES FROM THE TOP OF THE CSV FILE
$top = fgetcsv($fpo);
$cnt = count($top);

// SET UP KEY NAMES FOR USE IN OUR QUERY
$query_cols = implode(',', $top);

// SET A ROW COUNTER
$counter = 0;

// KEEP TRACK OF ROWS THAT HAVE THE WRONG NUMBER OF FIELDS
$errors = array();

// LOOP THROUGH THE CSV RECORDS PERFORMING CERTAIN TESTS
while (!feof($fpo))
{
    $counter++;

    // GET A RECORD
    $csvdata = fgetcsv($fpo);

    // SKIP OVER EMPTY ROWS
    if (empty($csvdata)) continue;

    // CHECK THE NUMBER OF FIELDS
    if ($cnt != count($csvdata))
    {
        $errors[] = $counter;
        continue;
    }

    // ESCAPE THE INFORMATION FOR USE IN THE QUERY
    foreach ($csvdata as $ptr => $value)
    {
        $csvdata[$ptr] = $mysqli->real_escape_string($value);
    }

    // SET UP VALUE FIELDS
    $query_data = "'" . implode("', '", $csvdata) . "'";

    // SET UP A QUERY
    $sql = "REPLACE INTO EE_infodigger ( $query_cols ) VALUES ( $query_data )";

    if (!$res = $mysqli->query($sql))
	{
	    $err
	    = 'QUERY FAILURE:'
	    . ' ERRNO: '
	    . $mysqli->errno
	    . ' ERROR: '
	    . $mysqli->error
	    . ' QUERY: '
	    . $sql
	    ;
	    trigger_error($err, E_USER_ERROR);
	}
}


// SHOW THE NUMBER OF ROWS PROCESSED
echo "<br/>RECORDS PROCESSED $counter " . PHP_EOL;

// SHOW THE NUMBERS OF THE ROWS WITH THE WRONG NUMBER OF FIELDS
if (count($errors))
{
    echo "<br/>ROWS WITH THE WRONG NUMBER OF FIELDS: " . PHP_EOL;
    var_dump($errors);
}

// START THE TIMER AND RUN TWO QUERIES
$sw = new Stopwatch;
$sw->start('q1');
$sql = "select id from EE_infodigger order by ((rf_01*2) + (rf_32*4) + (rf_22*8)) desc limit 100";
if (!$res = $mysqli->query($sql))
{
    $err
    = 'QUERY FAILURE:'
    . ' ERRNO: '
    . $mysqli->errno
    . ' ERROR: '
    . $mysqli->error
    . ' QUERY: '
    . $sql
    ;
    trigger_error($err, E_USER_ERROR);
}
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q1');

$sw->start('q2');
$sql = "select id from EE_infodigger order by ((rf_08*6) + (rf_42*4) + (rf_30*2)) desc limit 100";
$res = $mysqli->query($sql);
while ($row = $res->fetch_object()) { echo $row->id . ','; }
$sw->stop('q2');

// GET THE DISPLAY
echo PHP_EOL;
unset($sw);





class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo
          __CLASS__
        . '::'
        . __FUNCTION__
        . '() '
        ;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39678448
Hmm... Maybe a burst of inspiration.  You've got about 400X the amount of data I am testing with.  Wonder if it would be possible to reduce the size of the data set, maybe with a down-select using CREATE TEMPORARY TABLE ... ENGINE=MEMORY.

Can you tell us a little more about the nature of the data and the queries?  This looks very, very academic.  If possible, tell us about the business rules, thanks.
0
 

Author Comment

by:infodigger
ID: 39678635
Hi Ray,

First of all thank you for your deep interest into my question. I really appreciate the time you spend in answering.

The data is used for a recommendation engine. Our algorithm populates this table once every few hours, but then the data stays the same.

Let's say you have 400.000 different items that you want to choose the best ones by your preferences. You preferences can be any of the 80 different ratings that we have pre-calculated for each item. So depending on which preference you select first, second, third, etc. we create an order by statement that will sort all items so that we can select the best ones matching your personal preferences.

You are right about the columns, we have left them decimal because we initially thought that the rating will have decimal places, so we could make this faster by making them INT.

Do you think that getting the whole table into memory could help? I don't think we can break it into smaller pieces since then again to get the top records we will need to sort the top records from every piece.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39678677
Using my small test data set, it looks like ENGINE=MEMORY shaved a fraction of a millisecond off the query time.  Maybe I'll try to scale the data up and see if there are more noticeable differences at scale.
0
 
LVL 110

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 39679171
I scaled the collection up to 100,000 rows.  I was unable to put the table into ENGINE=MEMORY at that size (failed at 48,500 rows), but I could put it on disk.  Without adding indexes, the queries tok 40ms to 50ms.  With indexes on the columns, the queries took 33ms to 38ms

About all I can say is, "Wow you have a lot of data!"  I think the key to higher performance is going to be anything you can do to winnow down the 400,000 rows.
0
 

Author Closing Comment

by:infodigger
ID: 39680025
Thank you very much for all the time spent on this question and for providing such detailed answers. As you mention in the last post, I will try to figure out some ways to split the data and make the new tables smaller!
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39680481
Thanks for the points and thanks for using EE.  Best of luck with the project, ~Ray
0

Featured Post

PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
database connection error mysql stops 7 82
How difficult would it be build a cookie scanner in PHP? 21 62
Joomla Editing 3 23
Why is my select returning NaN 23 43
Does the idea of dealing with bits scare or confuse you? Does it seem like a waste of time in an age where we all have terabytes of storage? If so, you're missing out on one of the core tools in every professional programmer's toolbox. Learn how to …
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question