<

[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x

Robust Field Cleaning Function

Published on
5,168 Points
2,068 Views
1 Endorsement
Last Modified:
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within REPLACE functions). Over time this can be quite unreadable. It can also become a maintenance chore where this is not a single one-off effort.

This happened in a recent question here on EE. I'd like to share the solution that came out of that thread. I have written a function called udf_CleanDataFn. It can be invoke within any query as follows:
 
SELECT dbo.udf_CleanDataFn(YourColumn, @Options)

@Options is a varchar(255) field, which can accept any of the following "options":
 
  • ltrim - This will left-trim the value in YourColumn.
  • rtrim - This will right-trim the value in YourColumn.
  • trim - This will both left- and right-trim the value.
  • nocomma - This will remove any and all comma's.
  • nospace - This will remove any and all spaces.
  • nopunc - This will remove all standard punctuations (!@#$% etc)
  • nonum - This will remove all numbers (0-9).
  • noalpha - This will remove all alpha characters (A-Z and a-z).
  • alphaonly - This will remove everything except alpha characters.
  • numonly - This will remove everything except numbers.

Following are a few working examples that will demonstrate the different results . This first example takes out only the commas:
 
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma')
Result: "This ! is * a 1321 @ test"

This example removes all punctuations and numbers:
 
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nopunc, nonum')
Result: "This  is  a   test"

This example removes the alphas and spaces:
 
SELECT dbo.udf_CleanDataFn('    Spaces 123 all 456 around    ', 'noalpha, nospace')
Result: 123456

Where there is very large data, this function can be inefficient, but where small amounts of data need to be cleaned, this can be leveraged on many alpha-numeric fields. My bench tests showed that this could update about 200,000 records a minute. It is not fast compared to an elaborate SQL statement, but you get the benefits of leverage and ease of maintenance where the processing time is acceptable.

Here is the code:
 
CREATE FUNCTION udf_CleanDataFn (
    @Value      varchar(255),
    @Options    varchar(255) )
RETURNS varchar(255)
AS
BEGIN
    IF ISNULL(@Value, '') = ''
        RETURN ''

    DECLARE @DelimOptions varchar(257)

    SET @Options = ISNULL(@Options, 'trim')
    SET @DelimOptions = ',' + @Options + ','

    WHILE CHARINDEX(' ', @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, ' ', '')

    WHILE CHARINDEX(CHAR(9), @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, CHAR(9), '')

    IF CHARINDEX(',ltrim,', @DelimOptions) > 0
        SET @Value = LTRIM(@Value)

    IF CHARINDEX(',rtrim,', @DelimOptions) > 0
        SET @Value = RTRIM(@Value)

    IF CHARINDEX(',trim,', @DelimOptions) > 0
        SET @Value = RTRIM(LTRIM(@Value))

    IF CHARINDEX(',nocomma,', @DelimOptions) > 0
        SET @Value = REPLACE(@Value, ',', '')

    SET @Value = REPLACE(@Value, CHAR(9), ' ')
    WHILE CHARINDEX('  ', @Value) > 0
        SET @Value = REPLACE(@Value, '  ', ' ')

    IF CHARINDEX(',nospace,', @DelimOptions) > 0
        WHILE CHARINDEX(' ', @Value) > 0
            SET @Value = REPLACE(@Value, ' ', '')

    DECLARE @asc        tinyint,
            @chr        char(1),
            @len        tinyint,
            @ndx        smallint,
            @NewValue   varchar(255),
            @use        tinyint,
            @WkField    varchar(255)

    SET @len = LEN(@Value)
    SET @ndx = -1
    SET @NewValue = ''
    SET @WkField  = '"' + @Value + '"'

    WHILE @ndx <= @len 
    BEGIN
        SET @ndx = @ndx + 1
        SET @chr = SUBSTRING(@Value, @ndx, 1)
        SET @asc = ASCII(@chr)
        SET @use = 0

        IF @asc BETWEEN 32 AND 126
            SET @use = 1
                       
        IF CHARINDEX(',nopunc,', @DelimOptions) > 0
            IF @asc < 48 OR @asc BETWEEN 58 AND 64 OR @asc BETWEEN 91 AND 96 OR @asc > 122
                IF @asc <> 32
                    SET @use = 0

        IF CHARINDEX(',nonum,', @DelimOptions) > 0
            IF @chr LIKE '[0-9]'
                SET @use = 0

        IF CHARINDEX(',noalpha,', @DelimOptions) > 0
            IF @chr LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',alphaonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',numonly,', @DelimOptions) > 0 OR CHARINDEX(',numericonly,', @DelimOptions) > 0 OR CHARINDEX(',numberonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[0-9]'
                SET @use = 0

        IF @use = 1
            SET @NewValue = @NewValue + SUBSTRING(@Value, @ndx, 1)
    END

    RETURN @NewValue
END

Open in new window

Thank you for reading my article. Your feedback is welcome. If you liked this article or are interested in more like these, please click the Yes button for: Was this article helpful?
1
Comment
Author:dsacker
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 3
  • 2
  • +1
11 Comments
 
LVL 53

Expert Comment

by:Ryan Chong
Hi just have some comments on this...

1. what if we having:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'ltrim, trim')

will trim be executed?

>>IF CHARINDEX('trim', @Options) > 0 AND CHARINDEX('ltrim', @Options) = 0 AND CHARINDEX('rtrim', @Options) = 0
knowing you put the extra checks for trim option is to avoid accidentally comparing with ltrim or rtrim.

2. The delimiter for @Option is also not verify, it can be any charaters, or we can simply omit it, like:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocommanonum')

will get the same result as:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma, nonum')

or

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma @@ nonum')

knowing this not really crucial but just wondering if we need to add in some validation on this?  ; )
0
 
LVL 20

Author Comment

by:dsacker
Hi Ryan. I check for trim while ensuring that ltrim/rtrim are absent, but I was not tight on commas. I've taken your comment about commas to heart (*smile*) and have added edits to delimit the options by commas. If you have time to play with this a little bit, please let me know if it looks better. Thanks for your questions and comments. They were helpful.
0
 
LVL 53

Expert Comment

by:Ryan Chong
Hi
Saw the script in your article was updated.

By adding the @DelimOptions variable, it seems to resolve and then limit the script to only allow to accept the comma as the delimiter for the options parameter. It looks fine to handle multiple values in options parameter as well, since you adding ',' + @Options + ',' and then trim the value (sorry as currently I don't have a MS SQL installed, neither work or at home! so I can only debug using my own eyes)

My only concern seems not answered on:
>>IF CHARINDEX(',trim,', @DelimOptions) > 0 AND CHARINDEX(',ltrim,', @DelimOptions) = 0 AND CHARINDEX(',rtrim,', @DelimOptions) = 0

as mentioned, if I have a select statement like:

SELECT dbo.udf_CleanDataFn('  test    This !  is *  a  13,21  @  test   test   ', 'ltrim, trim')

will this execute the 2nd parameter "trim" ? (sorry for that again... as I don't have an environment to test with)

I guess you will get: "test    This !  is *  a  13,21  @  test   test   " as a return?

Since you already adding ',' + @Options + ',' in your script, I guess for comparing 'trim' option, you can simple replace:

IF CHARINDEX(',trim,', @DelimOptions) > 0 AND CHARINDEX(',ltrim,', @DelimOptions) = 0 AND CHARINDEX(',rtrim,', @DelimOptions) = 0

to

IF CHARINDEX(',trim,', @DelimOptions) > 0

And after that... See if that works to handle statement like:

SELECT dbo.udf_CleanDataFn('  test    This !  is *  a  13,21  @  test   test   ', 'ltrim, trim')

to return: "test    This !  is *  a  13,21  @  test   test" ?

Shouldn't the script behave like this instead?

>>I've taken your comment about commas to heart (*smile*) and have added edits to delimit the options by commas
Thanks for that... perhaps in your next version, you may consider to let user to choose their delimiter as well but at this moment, I think we no need to be too extreme to make your script to handle every scenario. Just stick with what you have now : )

cheers
0
Learn how to optimize MySQL for your business need

With the increasing importance of apps & networks in both business & personal interconnections, perfor. has become one of the key metrics of successful communication. This ebook is a hands-on business-case-driven guide to understanding MySQL query parameter tuning & database perf

 
LVL 20

Author Comment

by:dsacker
I had originally checked only for trim, but then added the ltrim/rtrim later. Your comments got me to thinking about that, and I agree that it's better to check trim on its own merits. I've changed the code. Appreciate your feedback very much.

I don't think an optional delimiter for @Options is needed, as the option values are strict. :)
0
 
LVL 53

Expert Comment

by:Ryan Chong
>>I don't think an optional delimiter for @Options is needed, as the option values are strict. :)
Yea.. after think twice, I think it's unnecessarily to do so. cheers
0
 
LVL 46

Expert Comment

by:aikimark
How does the regular expression find/replace perform in SS Mgt Studio?
https://msdn.microsoft.com/en-us/library/ms177261.aspx
0
 
LVL 20

Author Comment

by:dsacker
It is a little bit limited, but works fine. You may notice I'm using a little bit in the WHILE loop via the LIKE comparison.
0
 

Expert Comment

by:Nick Sawe
Hi, can this remove carriage return char(13) and line feed char(10)?
0
 
LVL 20

Author Comment

by:dsacker
Yes. It removes those by default.
0
 

Expert Comment

by:Nick Sawe
ok thanks, i will give it a go. thanks for sharing this by the way !
0
 
LVL 20

Author Comment

by:dsacker
You're welcome. If it proves helpful, please click the "Helpful" button. :)
0

Featured Post

Visualize your virtual and backup environments

Create well-organized and polished visualizations of your virtual and backup environments when planning VMware vSphere, Microsoft Hyper-V or Veeam deployments. It helps you to gain better visibility and valuable business insights.

Join & Write a Comment

Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month