Our community of experts have been thoroughly vetted for their expertise and industry experience. Experts with Gold status have received one of our highest-level Expert Awards, which recognize experts for their valuable contributions.
Father, husband and general problem solver who loves coding SQL, C#, Salesforce Apex or whatever.
As a database administrator, you may need to audit your table(s) to determine whether the data types are optimal for your real-world data needs. This Article is intended to be a resource for such a task.
Preface
The other day, I was involved in a
question by wasifg on MySQLQuestions.com (a MySQL dedicated site offered by the owners of EE, so usernames and passwords from EE can be utilized there for any interested). The gist of the question is to determine:
An auditing tool that compares the schema definition with the data present in a table (e.g., a column is defined as VARCHAR(255), but the maximum data value is only 100 characters in size).
My original suggestion, not having previously played with the
procedure analyse extension, was to query the INFORMATION_SCHEMA to build a dynamic SQL query. This gets the job done of course, but MySQL has a potentially better way, at least one already intended for this purpose; therefore, this Article will act as a resource for my old way and the new way discovered using
procedure analyse.
[step=""]
Analyze That If you came here interested in
Analyze Table syntax, then unfortunately that is not the scope of this particular article. If you want to ensure that the most up-to-date key distribution information is stored for your table to aid in the performance of queries/joins, then please check this manual page:
http://dev.mysql.com/doc/refman/5.1/en/analyze-table.html.
delimiter $$drop procedure if exists `util`.`analyse_table`;create procedure `util`.`analyse_table`(in schemaname varchar(255), in tablename varchar(255))begin-- Build dynamic SQL stringselect group_concat( concat("select '", table_name, "' as tbl,'", column_name, "' as col,'" , data_type, "' as typ,", character_maximum_length , " as len, max(char_length(`", column_name , "`)) as len_data, max(length(`", column_name , "`)) as len_bytes from `", table_schema, "`.`", table_name, "`") separator ' union ')from information_schema.columnswhere (schemaname is null or table_schema = schemaname)and table_name = tablename-- and data_type like '%char'and not character_maximum_length is nullorder by ordinal_positioninto @sqlstring;-- If no table found, skip processing.if (not @sqlstring is null) then -- Execute dynamic SQL string. prepare s1 from @sqlstring; execute s1; deallocate prepare s1;end if;end $$delimiter ;
Focusing on the example of VARCHAR fields, the original variation of this procedure began by searching INFORMATION_SCHEMA for columns of the specified table that had data type like CHAR. Since this is character data,
char_lengthcan be used to determine the number of characters in each value in this table.
For example, both these symbols, representing our bookings record's transaction currency, are one character in size:select char_length(bk_currency) as num_chars from ( select _utf8 '€' as bk_currency union select '$' ) bookings ;
Subsequently,
lengthcan be used to determine the number of bytes of data in each value. This is significant for character data as it reveals if there are any double-byte characters used, which would make the values returned from the aforementioned functions different.
Unlike our previous, example, notice the difference in the function results of this query:select bk_currency, length(bk_currency) as num_bytes from ( select _utf8 '€' as bk_currency union select '$' ) bookings ;
Consequently, although the euro (€) and usd ($) symbols are both only single characters, you may opt to use CHAR(3) or VARCHAR(3) as your column's data type since the storage required is 3 bytes, since some characters like € are represented by multiple bytes.
With both functions at our disposal, the procedure creates a dynamic query string on the requisite table that returns the max value of both alongside the literal values found in the INFORMATION_SCHEMA.
Using prepared statements, once executed, this procedure returns a result set for our review.
Example results:
Upon examination of the above, please note you would manually make recommendations based on analysis of the comparison(s) between len, len_data and len_bytes results. Additionally, note that you would need different coding to handle data types like Set, which is highlighted, or even numerical values to determine if range of values could be better stored.
[step="" title=""]
««Recommendations Explored»» Here are some samples of data associated and how you might recommend as an example. Since I mentioned the Set based by_type, I will start with that and then show the VarChar bk_customer.
Since Sets are a special type, I won't get too deep into whether to use them or not -- I actually wasn't and converted this particular column as an example. But if you do use them, hopefully you understand that the size of 15 that shows is because of the combination of the values in the set. Moreover, since each individual selection within the set is 7 characters in length, it isn't really an issue in this case that our character/byte length is half that of the field's defined maximum length. Your recommendation here would probably be to leave this "as-is" if you endorse Set data type in the first place.
Now, remember here our defined maximum is 10 characters; however, our highest actual data length is 8 like "C1234318". Depending on your motives for the audit, you could certainly recommend VARCHAR(8) here.
Adding our brains back into the equation, let's look at the variance in length a bit. Since we are not using a fixed format like "C0000001", if you have a lot of customers or customer turnover like a job shop with no repeat business, you may opt instead to leave the extra 2 characters for future growth when you move past "C9999999". This may intrigue you to amend the procedure above to send back minimum, standard deviation or other such additional statistical aggregates alongside our max lengths to quickly determine volatility or variability in data.
[/step]
There-in lies the con to this solution: manual recommendations … so is there a better way ?
2. Oh! It's Even Easier…
Use Procedure Analyse.The new and improved analyse_table procedure:
delimiter $$drop procedure if exists `util`.`analyse_table`;create procedure `util`.`analyse_table`(in schemaname varchar(255), in tablename varchar(255), in max_elems int, in max_mem int)begin-- Build dynamic SQL stringselect concat('select `' , group_concat(column_name separator '`,`') , '` from `', table_schema, '`.`', table_name , '` procedure analyse(',max_elems,',',max_mem,');')from information_schema.columnswhere (schemaname is null or table_schema = schemaname)and table_name = tablename-- and data_type like '%char'-- and not character_maximum_length is nullorder by ordinal_positioninto @sqlstring;-- If no table found, skip processing.if (not @sqlstring is null) then -- Execute dynamic SQL string. prepare s1 from @sqlstring; execute s1; deallocate prepare s1;end if;end $$delimiter ;
Building on the prepared statement approach, we automate the much simpler call using
procedure analysewhich typically takes this form:
select bk_salesperson, bk_customer, bk_memo, bk_type from bookings procedure analyse(10,256); Where the two parameters passed are used in the determining of ENUM data type usage: "10" represents the max number of distinct elements and "256" the max memory to use during analyzing distinct elements. I personally like the lookup table approach instead, so would tend to not care for any recommendations of ENUM.
Note: In my testing regarding this, I found that the max_elements parameter will indeed appear
not to work as asserted in the discussion on the MySQL documentation page. What appears to be happening is that the max_memory controls this more than anything. If you pass (1, 128) and 128 is a big enough value to analyze the distinct values in your table, ENUM will still be recommended on the column. On the other hand, if you pass (128, 1), then ENUM will not be recommended even though you have indicated to use 128 distinct values.
To keep things flexible, the procedure signature was updated to include the elements and memory parameters needed in case some of you care to use the ENUM recommendations.
To correlate / compare the two versions of our analyse_table procedure, you will notice that this version using MySQL's
procedure analysewill recommend data type size based on length in
bytes, which will be consistent with using CHAR(3) versus CHAR(1) for our bk_currency example data.
As you will see from examination of the new results, you don't have to manually derive recommendation of type as you get optimal field type suggestions like "CHAR(8) NOT NULL". Additionally, data types such as Set, Int, et al are all handled appropriately without requiring additional code.
With that said, since
procedure analyseworks with the wildcard select syntax:
Our example procedure call is equivalent to the above in results; therefore, the real advantage to using the dynamic procedure call is the ability to analyze specific types of columns like character data only without having to know each column name.
Summary
In the above, we looked at a nice resource for database administrators in
procedure analyse. Please use it wisely, remembering not to get short-sighted. For example, if your data only hold 8 character strings today, but may grow to 16 characters on future iterations of your application, the procedural code will not have this insight so as with anything else computer-related: don't check your brain at the door. If you have the physical resources to sustain a slightly bigger size than currently needed, then it may be okay to do so, especially if the higher length doesn't require additional storage bytes. Remember our example above where our business needs may warrant leaving flexibility for growth in data length because of the nature of the values and with VARCHAR the amount of storage required is based on the
actual values and not the size specified; therefore, the added flexibility doesn't cost you anything extra today.
Therefore:
ensure you understand the storage requirement differences of certain data types that vary based on the specified size like CHAR as an example;
keep your business or experience in mind, so the ease of having most fields auto-recommended doesn't preclude you from making a smarter choice on a key, business-value column.
=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=-=-=-
=-=-=-=-=-
=-=-=-=-=-
=-=-=-=-=-
=-=-=-=-=-
=-=-=
If you found this article helpful, please click the
Yes button after the question just below. This will give me (the author) a few points and might encourage me to write more articles.
If you didn't or otherwise feel the need to vote No, please first leave a comment to give me a chance to answer and perhaps to improve this article.
Our community of experts have been thoroughly vetted for their expertise and industry experience. Experts with Gold status have received one of our highest-level Expert Awards, which recognize experts for their valuable contributions.
Father, husband and general problem solver who loves coding SQL, C#, Salesforce Apex or whatever.
Comments (0)