Robust Field Cleaning Function

Published on
5,314 Points
1 Endorsement
Last Modified:
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within REPLACE functions). Over time this can be quite unreadable. It can also become a maintenance chore where this is not a single one-off effort.

This happened in a recent question here on EE. I'd like to share the solution that came out of that thread. I have written a function called udf_CleanDataFn. It can be invoke within any query as follows:
SELECT dbo.udf_CleanDataFn(YourColumn, @Options)

@Options is a varchar(255) field, which can accept any of the following "options":
  • ltrim - This will left-trim the value in YourColumn.
  • rtrim - This will right-trim the value in YourColumn.
  • trim - This will both left- and right-trim the value.
  • nocomma - This will remove any and all comma's.
  • nospace - This will remove any and all spaces.
  • nopunc - This will remove all standard punctuations (!@#$% etc)
  • nonum - This will remove all numbers (0-9).
  • noalpha - This will remove all alpha characters (A-Z and a-z).
  • alphaonly - This will remove everything except alpha characters.
  • numonly - This will remove everything except numbers.

Following are a few working examples that will demonstrate the different results . This first example takes out only the commas:
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma')
Result: "This ! is * a 1321 @ test"

This example removes all punctuations and numbers:
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nopunc, nonum')
Result: "This  is  a   test"

This example removes the alphas and spaces:
SELECT dbo.udf_CleanDataFn('    Spaces 123 all 456 around    ', 'noalpha, nospace')
Result: 123456

Where there is very large data, this function can be inefficient, but where small amounts of data need to be cleaned, this can be leveraged on many alpha-numeric fields. My bench tests showed that this could update about 200,000 records a minute. It is not fast compared to an elaborate SQL statement, but you get the benefits of leverage and ease of maintenance where the processing time is acceptable.

Here is the code:
    @Value      varchar(255),
    @Options    varchar(255) )
RETURNS varchar(255)
    IF ISNULL(@Value, '') = ''
        RETURN ''

    DECLARE @DelimOptions varchar(257)

    SET @Options = ISNULL(@Options, 'trim')
    SET @DelimOptions = ',' + @Options + ','

    WHILE CHARINDEX(' ', @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, ' ', '')

    WHILE CHARINDEX(CHAR(9), @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, CHAR(9), '')

    IF CHARINDEX(',ltrim,', @DelimOptions) > 0
        SET @Value = LTRIM(@Value)

    IF CHARINDEX(',rtrim,', @DelimOptions) > 0
        SET @Value = RTRIM(@Value)

    IF CHARINDEX(',trim,', @DelimOptions) > 0
        SET @Value = RTRIM(LTRIM(@Value))

    IF CHARINDEX(',nocomma,', @DelimOptions) > 0
        SET @Value = REPLACE(@Value, ',', '')

    SET @Value = REPLACE(@Value, CHAR(9), ' ')
    WHILE CHARINDEX('  ', @Value) > 0
        SET @Value = REPLACE(@Value, '  ', ' ')

    IF CHARINDEX(',nospace,', @DelimOptions) > 0
        WHILE CHARINDEX(' ', @Value) > 0
            SET @Value = REPLACE(@Value, ' ', '')

    DECLARE @asc        tinyint,
            @chr        char(1),
            @len        tinyint,
            @ndx        smallint,
            @NewValue   varchar(255),
            @use        tinyint,
            @WkField    varchar(255)

    SET @len = LEN(@Value)
    SET @ndx = -1
    SET @NewValue = ''
    SET @WkField  = '"' + @Value + '"'

    WHILE @ndx <= @len 
        SET @ndx = @ndx + 1
        SET @chr = SUBSTRING(@Value, @ndx, 1)
        SET @asc = ASCII(@chr)
        SET @use = 0

        IF @asc BETWEEN 32 AND 126
            SET @use = 1
        IF CHARINDEX(',nopunc,', @DelimOptions) > 0
            IF @asc < 48 OR @asc BETWEEN 58 AND 64 OR @asc BETWEEN 91 AND 96 OR @asc > 122
                IF @asc <> 32
                    SET @use = 0

        IF CHARINDEX(',nonum,', @DelimOptions) > 0
            IF @chr LIKE '[0-9]'
                SET @use = 0

        IF CHARINDEX(',noalpha,', @DelimOptions) > 0
            IF @chr LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',alphaonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',numonly,', @DelimOptions) > 0 OR CHARINDEX(',numericonly,', @DelimOptions) > 0 OR CHARINDEX(',numberonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[0-9]'
                SET @use = 0

        IF @use = 1
            SET @NewValue = @NewValue + SUBSTRING(@Value, @ndx, 1)

    RETURN @NewValue

Open in new window

Thank you for reading my article. Your feedback is welcome. If you liked this article or are interested in more like these, please click the Yes button for: Was this article helpful?
  • 5
  • 3
  • 2
  • +1
LVL 56

Expert Comment

by:Ryan Chong
Hi just have some comments on this...

1. what if we having:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'ltrim, trim')

will trim be executed?

>>IF CHARINDEX('trim', @Options) > 0 AND CHARINDEX('ltrim', @Options) = 0 AND CHARINDEX('rtrim', @Options) = 0
knowing you put the extra checks for trim option is to avoid accidentally comparing with ltrim or rtrim.

2. The delimiter for @Option is also not verify, it can be any charaters, or we can simply omit it, like:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocommanonum')

will get the same result as:

SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma, nonum')


SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma @@ nonum')

knowing this not really crucial but just wondering if we need to add in some validation on this?  ; )
LVL 20

Author Comment

Hi Ryan. I check for trim while ensuring that ltrim/rtrim are absent, but I was not tight on commas. I've taken your comment about commas to heart (*smile*) and have added edits to delimit the options by commas. If you have time to play with this a little bit, please let me know if it looks better. Thanks for your questions and comments. They were helpful.
LVL 56

Expert Comment

by:Ryan Chong
Saw the script in your article was updated.

By adding the @DelimOptions variable, it seems to resolve and then limit the script to only allow to accept the comma as the delimiter for the options parameter. It looks fine to handle multiple values in options parameter as well, since you adding ',' + @Options + ',' and then trim the value (sorry as currently I don't have a MS SQL installed, neither work or at home! so I can only debug using my own eyes)

My only concern seems not answered on:
>>IF CHARINDEX(',trim,', @DelimOptions) > 0 AND CHARINDEX(',ltrim,', @DelimOptions) = 0 AND CHARINDEX(',rtrim,', @DelimOptions) = 0

as mentioned, if I have a select statement like:

SELECT dbo.udf_CleanDataFn('  test    This !  is *  a  13,21  @  test   test   ', 'ltrim, trim')

will this execute the 2nd parameter "trim" ? (sorry for that again... as I don't have an environment to test with)

I guess you will get: "test    This !  is *  a  13,21  @  test   test   " as a return?

Since you already adding ',' + @Options + ',' in your script, I guess for comparing 'trim' option, you can simple replace:

IF CHARINDEX(',trim,', @DelimOptions) > 0 AND CHARINDEX(',ltrim,', @DelimOptions) = 0 AND CHARINDEX(',rtrim,', @DelimOptions) = 0


IF CHARINDEX(',trim,', @DelimOptions) > 0

And after that... See if that works to handle statement like:

SELECT dbo.udf_CleanDataFn('  test    This !  is *  a  13,21  @  test   test   ', 'ltrim, trim')

to return: "test    This !  is *  a  13,21  @  test   test" ?

Shouldn't the script behave like this instead?

>>I've taken your comment about commas to heart (*smile*) and have added edits to delimit the options by commas
Thanks for that... perhaps in your next version, you may consider to let user to choose their delimiter as well but at this moment, I think we no need to be too extreme to make your script to handle every scenario. Just stick with what you have now : )

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

LVL 20

Author Comment

I had originally checked only for trim, but then added the ltrim/rtrim later. Your comments got me to thinking about that, and I agree that it's better to check trim on its own merits. I've changed the code. Appreciate your feedback very much.

I don't think an optional delimiter for @Options is needed, as the option values are strict. :)
LVL 56

Expert Comment

by:Ryan Chong
>>I don't think an optional delimiter for @Options is needed, as the option values are strict. :)
Yea.. after think twice, I think it's unnecessarily to do so. cheers
LVL 46

Expert Comment

How does the regular expression find/replace perform in SS Mgt Studio?
LVL 20

Author Comment

It is a little bit limited, but works fine. You may notice I'm using a little bit in the WHILE loop via the LIKE comparison.

Expert Comment

by:Nick Sawe
Hi, can this remove carriage return char(13) and line feed char(10)?
LVL 20

Author Comment

Yes. It removes those by default.

Expert Comment

by:Nick Sawe
ok thanks, i will give it a go. thanks for sharing this by the way !
LVL 20

Author Comment

You're welcome. If it proves helpful, please click the "Helpful" button. :)

Featured Post

[Webinar] Improve your customer journey

A positive customer journey is important in attracting and retaining business. To improve this experience, you can use Google Maps APIs to increase checkout conversions, boost user engagement, and optimize order fulfillment. Learn how in this webinar presented by Dito.

Join & Write a Comment

Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month