Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

Analyse This -- A Database Optimization Procedure

Kevin CrossChief Technology Officer
CERTIFIED EXPERT
Father, husband and general problem solver who loves coding SQL, C#, Salesforce Apex or whatever.
Published:
As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs.  This Article is intended to be a resource for such a task.


Preface

The other day, I was involved in a question by wasifg on MySQLQuestions.com (a MySQL dedicated site offered by the owners of EE, so usernames and passwords from EE can be utilized there for any interested).  The gist of the question is to determine:

An auditing tool that compares the schema definition with the data present in a table (e.g., a column is defined as VARCHAR(255), but the maximum data value is only 100 characters in size).

My original suggestion, not having previously played with the procedure analyse extension, was to query the INFORMATION_SCHEMA to build a dynamic SQL query.  This gets the job done of course, but MySQL has a potentially better way, at least one already intended for this purpose; therefore, this Article will act as a resource for my old way and the new way discovered using procedure analyse.

[step=""]Analyze That
If you came here interested in Analyze Table syntax, then unfortunately that is not the scope of this particular article.  If you want to ensure that the most up-to-date key distribution information is stored for your table to aid in the performance of queries/joins, then please check this manual page: http://dev.mysql.com/doc/refman/5.1/en/analyze-table.html.
analyze table bookings; 

Open in new window

[/step]

Okay!  Back to this.


Contents

It's Easy…Use Dynamic SQL
Oh!  It's Even Easier…Use Procedure Analyse
Summary

So let's get started...

 

1. It's Easy…

Use Dynamic SQL.Procedure analyse_table is born:
delimiter $$
                      
                      drop procedure if exists `util`.`analyse_table`;
                      create procedure `util`.`analyse_table`(in schemaname varchar(255), 
                                                in tablename varchar(255))
                      begin
                      
                      -- Build dynamic SQL string
                      select 
                      group_concat(
                         concat("select '", table_name, "' as tbl,'", column_name, "' as col,'"
                              , data_type, "' as typ,", character_maximum_length
                              , " as len, max(char_length(`", column_name
                              , "`)) as len_data, max(length(`", column_name
                              , "`)) as len_bytes from `", table_schema, "`.`", table_name, "`")
                         separator ' union '
                      )
                      from information_schema.columns
                      where (schemaname is null or table_schema = schemaname)
                      and table_name = tablename
                      -- and data_type like '%char'
                      and not character_maximum_length is null
                      order by ordinal_position
                      into @sqlstring;
                      
                      -- If no table found, skip processing.
                      if (not @sqlstring is null) then
                         -- Execute dynamic SQL string.   
                         prepare s1 from @sqlstring;
                         execute s1;
                         deallocate prepare s1;
                      end if;
                      
                      end $$
                      
                      delimiter ;

Open in new window


How it works:

Focusing on the example of VARCHAR fields, the original variation of this procedure began by searching INFORMATION_SCHEMA for columns of the specified table that had data type like CHAR.  Since this is character data, char_length can be used to determine the number of characters in each value in this table.

For example, both these symbols, representing our bookings record's transaction currency, are one character in size:select char_length(bk_currency) as num_chars from ( select _utf8 '€' as bk_currency union select '$' ) bookings ; Sample Char_Length Results
Subsequently, length can be used to determine the number of bytes of data in each value.  This is significant for character data as it reveals if there are any double-byte characters used, which would make the values returned from the aforementioned functions different.

Unlike our previous, example, notice the difference in the function results of this query:select bk_currency, length(bk_currency) as num_bytes from ( select _utf8 '€' as bk_currency union select '$' ) bookings ; Sample Length Results
Consequently, although the euro (€) and usd ($) symbols are both only single characters, you may opt to use CHAR(3) or VARCHAR(3) as your column's data type since the storage required is 3 bytes, since some characters like € are represented by multiple bytes.
With both functions at our disposal, the procedure creates a dynamic query string on the requisite table that returns the max value of both alongside the literal values found in the INFORMATION_SCHEMA.
Using prepared statements, once executed, this procedure returns a result set for our review.

Example procedure call:
call util.analyse_table('test', 'bookings');

Open in new window


Example results:
Analyse_Table Bookings Results
Upon examination of the above, please note you would manually make recommendations based on analysis of the comparison(s) between len, len_data and len_bytes results.  Additionally, note that you would need different coding to handle data types like Set, which is highlighted, or even numerical values to determine if range of values could be better stored.

[step="" title=""]««Recommendations Explored»»
Here are some samples of data associated and how you might recommend as an example.  Since I mentioned the Set based by_type, I will start with that and then show the VarChar bk_customer.

1. Distinct listing of actual "bk_type" values:
bk_type
                      ----------
                      Product
                      Service 

Open in new window

Since Sets are a special type, I won't get too deep into whether to use them or not -- I actually wasn't and converted this particular column as an example.  But if you do use them, hopefully you understand that the size of 15 that shows is because of the combination of the values in the set.  Moreover, since each individual selection within the set is 7 characters in length, it isn't really an issue in this case that our character/byte length is half that of the field's defined maximum length.  Your recommendation here would probably be to leave this "as-is" if you endorse Set data type in the first place.

2. Sample listing of actual "bk_customer" values:
bk_customer
                      ---------------
                      C543423
                      C1234318
                      C89
                      C101160
                      C1237
                      C123419
                      C1234595
                      C10119
                      C123275
                      C1234465 

Open in new window

Now, remember here our defined maximum is 10 characters; however, our highest actual data length is 8 like "C1234318".  Depending on your motives for the audit, you could certainly recommend VARCHAR(8) here.  

Adding our brains back into the equation, let's look at the variance in length a bit.  Since we are not using a fixed format like "C0000001", if you have a lot of customers or customer turnover like a job shop with no repeat business, you may opt instead to leave the extra 2 characters for future growth when you move past "C9999999".  This may intrigue you to amend the procedure above to send back minimum, standard deviation or other such additional statistical aggregates alongside our max lengths to quickly determine volatility or variability in data.
[/step]

There-in lies the con to this solution: manual recommendations … so is there a better way ?

2. Oh! It's Even Easier…

Use Procedure Analyse.The new and improved analyse_table procedure:
delimiter $$
                      
                      drop procedure if exists `util`.`analyse_table`;
                      create procedure `util`.`analyse_table`(in schemaname varchar(255), 
                                                in tablename varchar(255),
                                                in max_elems int,
                                                in max_mem int)
                      begin
                      
                      -- Build dynamic SQL string
                      select concat('select `'
                                  , group_concat(column_name separator '`,`')
                                  , '` from `', table_schema, '`.`', table_name
                                  , '` procedure analyse(',max_elems,',',max_mem,');')
                      from information_schema.columns
                      where (schemaname is null or table_schema = schemaname)
                      and table_name = tablename
                      -- and data_type like '%char'
                      -- and not character_maximum_length is null
                      order by ordinal_position
                      into @sqlstring;
                      
                      -- If no table found, skip processing.
                      if (not @sqlstring is null) then
                         -- Execute dynamic SQL string.   
                         prepare s1 from @sqlstring;
                         execute s1;
                         deallocate prepare s1;
                      end if;
                      
                      end $$
                      
                      delimiter ;

Open in new window


How it works:

Building on the prepared statement approach, we automate the much simpler call using procedure analyse which typically takes this form:
select bk_salesperson, bk_customer, bk_memo, bk_type from bookings procedure analyse(10,256); Where the two parameters passed are used in the determining of ENUM data type usage: "10" represents the max number of distinct elements and "256" the max memory to use during analyzing distinct elements.  I personally like the lookup table approach instead, so would tend to not care for any recommendations of ENUM.

Note: In my testing regarding this, I found that the max_elements parameter will indeed appear not to work as asserted in the discussion on the MySQL documentation page.  What appears to be happening is that the max_memory controls this more than anything.  If you pass (1, 128) and 128 is a big enough value to analyze the distinct values in your table, ENUM will still be recommended on the column.  On the other hand, if you pass (128, 1), then ENUM will not be recommended even though you have indicated to use 128 distinct values.
To keep things flexible, the procedure signature was updated to include the elements and memory parameters needed in case some of you care to use the ENUM recommendations.
To correlate / compare the two versions of our analyse_table procedure, you will notice that this version using MySQL's procedure analyse will recommend data type size based on length in bytes, which will be consistent with using CHAR(3) versus CHAR(1) for our bk_currency example data.

Example procedure call:
call util.analyse_table('test', 'bookings', 1, 1);

Open in new window


Voila !

As you will see from examination of the new results, you don't have to manually derive recommendation of type as you get optimal field type suggestions like "CHAR(8) NOT NULL".  Additionally, data types such as Set, Int, et al are all handled appropriately without requiring additional code.

With that said, since procedure analyse works with the wildcard select syntax:
select *
                      from bookings
                      procedure analyse(1,1);

Open in new window

Our example procedure call is equivalent to the above in results; therefore, the real advantage to using the dynamic procedure call is the ability to analyze specific types of columns like character data only without having to know each column name.


Summary

In the above, we looked at a nice resource for database administrators in procedure analyse.  Please use it wisely, remembering not to get short-sighted.  For example, if your data only hold 8 character strings today, but may grow to 16 characters on future iterations of your application, the procedural code will not have this insight so as with anything else computer-related: don't check your brain at the door.  If you have the physical resources to sustain a slightly bigger size than currently needed, then it may be okay to do so, especially if the higher length doesn't require additional storage bytes.  Remember our example above where our business needs may warrant leaving flexibility for growth in data length because of the nature of the values and with VARCHAR the amount of storage required is based on the actual values and not the size specified; therefore, the added flexibility doesn't cost you anything extra today.

Therefore:
ensure you understand the storage requirement differences of certain data types that vary based on the specified size like CHAR as an example;
keep your business or experience in mind, so the ease of having most fields auto-recommended doesn't preclude you from making a smarter choice on a key, business-value column.

Thanks for reading!

Best regards and happy coding,

Kevin C. Cross, Sr. (mwvisa1)


References

Data Types > http://dev.mysql.com/doc/refman/5.1/en/data-types.html
Data Type Storage Requirements > http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Information Schema Columns Table > http://dev.mysql.com/doc/refman/5.1/en/columns-table.html
Prepared Statements > http://dev.mysql.com/doc/refman/5.1/en/sql-syntax-prepared-statements.html
Procedure Analyse > http://dev.mysql.com/doc/refman/5.1/en/procedure-analyse.html
String Functions > http://dev.mysql.com/doc/refman/5.1/en/string-functions.html


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
If you found this article helpful, please click the Yes button after the question just below. This will give me (the author) a few points and might encourage me to write more articles.

If you didn't or otherwise feel the need to vote No, please first leave a comment to give me a chance to answer and perhaps to improve this article.

Thank you!
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
4
5,314 Views
Kevin CrossChief Technology Officer
CERTIFIED EXPERT
Father, husband and general problem solver who loves coding SQL, C#, Salesforce Apex or whatever.

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.