Analyse This -- A Database Optimization Procedure

AID: 3836
  • Status: Published

11910 points

  • By
  • TypeResource
  • Posted on2010-09-27 at 13:11:36
Awards
  • Community Pick
  • Experts Exchange Approved
  • Editor's Choice
As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task.


Preface


The other day, I was involved in a question by wasifg on MySQLQuestions.com (a MySQL dedicated site offered by the owners of EE, so usernames and passwords from EE can be utilized there for any interested).  The gist of the question is to determine:

An auditing tool that compares the schema definition with the data present in a table (e.g., a column is defined as VARCHAR(255), but the maximum data value is only 100 characters in size).


My original suggestion, not having previously played with the procedure analyse extension, was to query the INFORMATION_SCHEMA to build a dynamic SQL query.  This gets the job done of course, but MySQL has a potentially better way, at least one already intended for this purpose; therefore, this Article will act as a resource for my old way and the new way discovered using procedure analyse.

""]Analyze That
If you came here interested in Analyze Table syntax, then unfortunately that is not the scope of this particular article.  If you want to ensure that the most up-to-date key distribution information is stored for your table to aid in the performance of queries/joins, then please check this manual page: http://dev.mysql.com/doc/refman/5.1/en/analyze-table.html.

analyze table bookings; 
                                    
1:

Select allOpen in new window



Okay!  Back to this.


Contents


  • It's Easy…Use Dynamic SQL
  • Oh!  It's Even Easier…Use Procedure Analyse
  • Summary


So let's get started...

 
1

It's Easy…

Use Dynamic SQL.


Procedure analyse_table is born:
delimiter $$

drop procedure if exists `util`.`analyse_table`;
create procedure `util`.`analyse_table`(in schemaname varchar(255), 
                          in tablename varchar(255))
begin

-- Build dynamic SQL string
select 
group_concat(
   concat("select '", table_name, "' as tbl,'", column_name, "' as col,'"
        , data_type, "' as typ,", character_maximum_length
        , " as len, max(char_length(`", column_name
        , "`)) as len_data, max(length(`", column_name
        , "`)) as len_bytes from `", table_schema, "`.`", table_name, "`")
   separator ' union '
)
from information_schema.columns
where (schemaname is null or table_schema = schemaname)
and table_name = tablename
-- and data_type like '%char'
and not character_maximum_length is null
order by ordinal_position
into @sqlstring;

-- If no table found, skip processing.
if (not @sqlstring is null) then
   -- Execute dynamic SQL string.   
   prepare s1 from @sqlstring;
   execute s1;
   deallocate prepare s1;
end if;

end $$

delimiter ;
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:

Select allOpen in new window



How it works:

  • Focusing on the example of VARCHAR fields, the original variation of this procedure began by searching INFORMATION_SCHEMA for columns of the specified table that had data type like CHAR.  Since this is character data, char_length can be used to determine the number of characters in each value in this table.

    For example, both these symbols, representing our bookings record's transaction currency, are one character in size:
select char_length(bk_currency) as num_chars
from (
   select _utf8 '€' as bk_currency
   union select '$'
) bookings
;
                                    
1:
2:
3:
4:
5:
6:

Select allOpen in new window

Analyse-CharLength.PNG
  • 1 KB
  • Sample Char_Length Results
Sample Char_Length Results

  • Subsequently, length can be used to determine the number of bytes of data in each value.  This is significant for character data as it reveals if there are any double-byte characters used, which would make the values returned from the aforementioned functions different.

    Unlike our previous, example, notice the difference in the function results of this query:
select bk_currency, length(bk_currency) as num_bytes
from (
   select _utf8 '€' as bk_currency
   union select '$'
) bookings
; 
                                    
1:
2:
3:
4:
5:
6:

Select allOpen in new window

Analyse-Length.PNG
  • 1 KB
  • Sample Length Results
Sample Length Results


Consequently, although the euro (€) and usd ($) symbols are both only single characters, you may opt to use CHAR(3) or VARCHAR(3) as your column's data type since the storage required is 3 bytes, since some characters like € are represented by multiple bytes.

  • With both functions at our disposal, the procedure creates a dynamic query string on the requisite table that returns the max value of both alongside the literal values found in the INFORMATION_SCHEMA.

  • Using prepared statements, once executed, this procedure returns a result set for our review.


Example procedure call:
call util.analyse_table('test', 'bookings');
                                    
1:

Select allOpen in new window



Example results:
Analyse-Table-Bookings.PNG
  • 6 KB
  • Analyse_Table Bookings Results
Analyse_Table Bookings Results


Upon examination of the above, please note you would manually make recommendations based on analysis of the comparison(s) between len, len_data and len_bytes results.  Additionally, note that you would need different coding to handle data types like Set, which is highlighted, or even numerical values to determine if range of values could be better stored.

"" title=""]««Recommendations Explored»»
Here are some samples of data associated and how you might recommend as an example.  Since I mentioned the Set based by_type, I will start with that and then show the VarChar bk_customer.

1. Distinct listing of actual "bk_type" values:

bk_type
----------
Product
Service 
                                    
1:
2:
3:
4:

Select allOpen in new window


Since Sets are a special type, I won't get too deep into whether to use them or not -- I actually wasn't and converted this particular column as an example.  But if you do use them, hopefully you understand that the size of 15 that shows is because of the combination of the values in the set.  Moreover, since each individual selection within the set is 7 characters in length, it isn't really an issue in this case that our character/byte length is half that of the field's defined maximum length.  Your recommendation here would probably be to leave this "as-is" if you endorse Set data type in the first place.

2. Sample listing of actual "bk_customer" values:

bk_customer
---------------
C543423
C1234318
C89
C101160
C1237
C123419
C1234595
C10119
C123275
C1234465 
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:

Select allOpen in new window


Now, remember here our defined maximum is 10 characters; however, our highest actual data length is 8 like "C1234318".  Depending on your motives for the audit, you could certainly recommend VARCHAR(8) here.  

Adding our brains back into the equation, let's look at the variance in length a bit.  Since we are not using a fixed format like "C0000001", if you have a lot of customers or customer turnover like a job shop with no repeat business, you may opt instead to leave the extra 2 characters for future growth when you move past "C9999999".  This may intrigue you to amend the procedure above to send back minimum, standard deviation or other such additional statistical aggregates alongside our max lengths to quickly determine volatility or variability in data.



There-in lies the con to this solution: manual recommendations … so is there a better way ?

2

Oh!  It's Even Easier…

Use Procedure Analyse.


The new and improved analyse_table procedure:
delimiter $$

drop procedure if exists `util`.`analyse_table`;
create procedure `util`.`analyse_table`(in schemaname varchar(255), 
                          in tablename varchar(255),
                          in max_elems int,
                          in max_mem int)
begin

-- Build dynamic SQL string
select concat('select `'
            , group_concat(column_name separator '`,`')
            , '` from `', table_schema, '`.`', table_name
            , '` procedure analyse(',max_elems,',',max_mem,');')
from information_schema.columns
where (schemaname is null or table_schema = schemaname)
and table_name = tablename
-- and data_type like '%char'
-- and not character_maximum_length is null
order by ordinal_position
into @sqlstring;

-- If no table found, skip processing.
if (not @sqlstring is null) then
   -- Execute dynamic SQL string.   
   prepare s1 from @sqlstring;
   execute s1;
   deallocate prepare s1;
end if;

end $$

delimiter ;
                                    
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:

Select allOpen in new window



How it works:

  • Building on the prepared statement approach, we automate the much simpler call using procedure analyse which typically takes this form:
select bk_salesperson, bk_customer, bk_memo, bk_type
from bookings
procedure analyse(10,256); 
                                    
1:
2:
3:

Select allOpen in new window


Where the two parameters passed are used in the determining of ENUM data type usage: "10" represents the max number of distinct elements and "256" the max memory to use during analyzing distinct elements.  I personally like the lookup table approach instead, so would tend to not care for any recommendations of ENUM.

Note: In my testing regarding this, I found that the max_elements parameter will indeed appear not to work as asserted in the discussion on the MySQL documentation page.  What appears to be happening is that the max_memory controls this more than anything.  If you pass (1, 128) and 128 is a big enough value to analyze the distinct values in your table, ENUM will still be recommended on the column.  On the other hand, if you pass (128, 1), then ENUM will not be recommended even though you have indicated to use 128 distinct values.

  • To keep things flexible, the procedure signature was updated to include the elements and memory parameters needed in case some of you care to use the ENUM recommendations.

  • To correlate / compare the two versions of our analyse_table procedure, you will notice that this version using MySQL's procedure analyse will recommend data type size based on length in bytes, which will be consistent with using CHAR(3) versus CHAR(1) for our bk_currency example data.


Example procedure call:
call util.analyse_table('test', 'bookings', 1, 1);
                                    
1:

Select allOpen in new window



Voila !

As you will see from examination of the new results, you don't have to manually derive recommendation of type as you get optimal field type suggestions like "CHAR(8) NOT NULL".  Additionally, data types such as Set, Int, et al are all handled appropriately without requiring additional code.

With that said, since procedure analyse works with the wildcard select syntax:
select *
from bookings
procedure analyse(1,1);
                                    
1:
2:
3:

Select allOpen in new window


Our example procedure call is equivalent to the above in results; therefore, the real advantage to using the dynamic procedure call is the ability to analyze specific types of columns like character data only without having to know each column name.


Summary


In the above, we looked at a nice resource for database administrators in procedure analyse.  Please use it wisely, remembering not to get short-sighted.  For example, if your data only hold 8 character strings today, but may grow to 16 characters on future iterations of your application, the procedural code will not have this insight so as with anything else computer-related: don't check your brain at the door.  If you have the physical resources to sustain a slightly bigger size than currently needed, then it may be okay to do so, especially if the higher length doesn't require additional storage bytes.  Remember our example above where our business needs may warrant leaving flexibility for growth in data length because of the nature of the values and with VARCHAR the amount of storage required is based on the actual values and not the size specified; therefore, the added flexibility doesn't cost you anything extra today.

Therefore:
  • ensure you understand the storage requirement differences of certain data types that vary based on the specified size like CHAR as an example;

  • keep your business or experience in mind, so the ease of having most fields auto-recommended doesn't preclude you from making a smarter choice on a key, business-value column.


Thanks for reading!

Best regards and happy coding,

Kevin C. Cross, Sr. (mwvisa1)


References


Data Types > http://dev.mysql.com/doc/refman/5.1/en/data-types.html
Data Type Storage Requirements > http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Information Schema Columns Table > http://dev.mysql.com/doc/refman/5.1/en/columns-table.html
Prepared Statements > http://dev.mysql.com/doc/refman/5.1/en/sql-syntax-prepared-statements.html
Procedure Analyse > http://dev.mysql.com/doc/refman/5.1/en/procedure-analyse.html
String Functions > http://dev.mysql.com/doc/refman/5.1/en/string-functions.html


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you found this article helpful, please click the Yes button after the question just below. This will give me (the author) a few points and might encourage me to write more articles.

If you didn't or otherwise feel the need to vote No, please first leave a comment to give me a chance to answer and perhaps to improve this article.

Thank you!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Asked On
    2010-09-27 at 13:11:36ID3836
    Tags

    MySQL

    ,

    db administration

    ,

    procedure analyse

    ,

    analyze

    ,

    optimal data types

    Topic

    MySQL Server

    Views
    1686

    Comments

    Add your Comment

    Please Sign up or Log in to comment on this article.

    Join Experts Exchange Today

    Gain Access to all our Tech Resources

    Get personalized answers

    Ask unlimited questions

    Access Proven Solutions

    Search 3.2 million solutions

    Read In-Depth How-To Guides

    1000+ articles, demos, & tips

    Watch Step by Step Tutorials

    Learn direct from top tech pros

    And Much More!

    Your complete tech resource

    See Plans and Pricing

    30-day free trial. Register in 60 seconds.

    Loading Advertisement...

    Top MySQL Server Experts

    1. johanntagle

      286,814

      Guru

      6,000 points yesterday

      Profile
      Rank: Sage
    2. Ray_Paseur

      216,557

      Guru

      0 points yesterday

      Profile
      Rank: Savant
    3. DaveBaldwin

      119,595

      Master

      1,400 points yesterday

      Profile
      Rank: Genius
    4. angelIII

      61,340

      Master

      0 points yesterday

      Profile
      Rank: Elite
    5. mwvisa1

      57,185

      Master

      30 points yesterday

      Profile
      Rank: Genius
    6. HainKurt

      41,850

      0 points yesterday

      Profile
      Rank: Genius
    7. ralmada

      39,250

      0 points yesterday

      Profile
      Rank: Genius
    8. Roads_Roads

      33,080

      0 points yesterday

      Profile
      Rank: Genius
    9. arnold

      29,812

      0 points yesterday

      Profile
      Rank: Genius
    10. theGhost_k8

      29,785

      0 points yesterday

      Profile
      Rank: Sage
    11. Kdo

      29,682

      0 points yesterday

      Profile
      Rank: Genius
    12. bportlock

      26,604

      0 points yesterday

      Profile
      Rank: Genius
    13. jason1178

      23,574

      0 points yesterday

      Profile
      Rank: Genius
    14. maeltar

      23,236

      0 points yesterday

      Profile
      Rank: Guru
    15. StingRaY

      21,500

      0 points yesterday

      Profile
      Rank: Wizard
    16. smadeira

      19,968

      0 points yesterday

      Profile
      Rank: Wizard
    17. fundacionrts

      18,200

      0 points yesterday

      Profile
      Rank: Master
    18. gr8gonzo

      17,019

      0 points yesterday

      Profile
      Rank: Sage
    19. ChrisStanyon

      16,964

      0 points yesterday

      Profile
      Rank: Sage
    20. pratima_mcs

      16,614

      0 points yesterday

      Profile
      Rank: Genius
    21. TempDBA

      16,400

      0 points yesterday

      Profile
      Rank: Sage
    22. Sharath_123

      16,268

      0 points yesterday

      Profile
      Rank: Genius
    23. for_yan

      16,000

      0 points yesterday

      Profile
      Rank: Genius
    24. matthewspatrick

      15,800

      0 points yesterday

      Profile
      Rank: Savant
    25. AielloJ

      13,732

      0 points yesterday

      Profile
      Rank: Wizard

    Hall Of Fame