<

Robust Field Cleaning Function

Published on
5,655 Points
2,555 Views
1 Endorsement
Last Modified:
Occasionally there is a need to clean table columns, especially if you have inherited legacy data. There are obviously many ways to accomplish that, including elaborate UPDATE queries with anywhere from one to numerous REPLACE functions (even within REPLACE functions). Over time this can be quite unreadable. It can also become a maintenance chore where this is not a single one-off effort.

This happened in a recent question here on EE. I'd like to share the solution that came out of that thread. I have written a function called udf_CleanDataFn. It can be invoke within any query as follows:
 
SELECT dbo.udf_CleanDataFn(YourColumn, @Options)

@Options is a varchar(255) field, which can accept any of the following "options":
 
  • ltrim - This will left-trim the value in YourColumn.
  • rtrim - This will right-trim the value in YourColumn.
  • trim - This will both left- and right-trim the value.
  • nocomma - This will remove any and all comma's.
  • nospace - This will remove any and all spaces.
  • nopunc - This will remove all standard punctuations (!@#$% etc)
  • nonum - This will remove all numbers (0-9).
  • noalpha - This will remove all alpha characters (A-Z and a-z).
  • alphaonly - This will remove everything except alpha characters.
  • numonly - This will remove everything except numbers.

Following are a few working examples that will demonstrate the different results . This first example takes out only the commas:
 
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nocomma')
Result: "This ! is * a 1321 @ test"

This example removes all punctuations and numbers:
 
SELECT dbo.udf_CleanDataFn('This !  is *  a  13,21  @  test', 'nopunc, nonum')
Result: "This  is  a   test"

This example removes the alphas and spaces:
 
SELECT dbo.udf_CleanDataFn('    Spaces 123 all 456 around    ', 'noalpha, nospace')
Result: 123456

Where there is very large data, this function can be inefficient, but where small amounts of data need to be cleaned, this can be leveraged on many alpha-numeric fields. My bench tests showed that this could update about 200,000 records a minute. It is not fast compared to an elaborate SQL statement, but you get the benefits of leverage and ease of maintenance where the processing time is acceptable.

Here is the code:
 
CREATE FUNCTION udf_CleanDataFn (
    @Value      varchar(255),
    @Options    varchar(255) )
RETURNS varchar(255)
AS
BEGIN
    IF ISNULL(@Value, '') = ''
        RETURN ''

    DECLARE @DelimOptions varchar(257)

    SET @Options = ISNULL(@Options, 'trim')
    SET @DelimOptions = ',' + @Options + ','

    WHILE CHARINDEX(' ', @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, ' ', '')

    WHILE CHARINDEX(CHAR(9), @DelimOptions) > 0
        SET @DelimOptions = REPLACE(@DelimOptions, CHAR(9), '')

    IF CHARINDEX(',ltrim,', @DelimOptions) > 0
        SET @Value = LTRIM(@Value)

    IF CHARINDEX(',rtrim,', @DelimOptions) > 0
        SET @Value = RTRIM(@Value)

    IF CHARINDEX(',trim,', @DelimOptions) > 0
        SET @Value = RTRIM(LTRIM(@Value))

    IF CHARINDEX(',nocomma,', @DelimOptions) > 0
        SET @Value = REPLACE(@Value, ',', '')

    SET @Value = REPLACE(@Value, CHAR(9), ' ')
    WHILE CHARINDEX('  ', @Value) > 0
        SET @Value = REPLACE(@Value, '  ', ' ')

    IF CHARINDEX(',nospace,', @DelimOptions) > 0
        WHILE CHARINDEX(' ', @Value) > 0
            SET @Value = REPLACE(@Value, ' ', '')

    DECLARE @asc        tinyint,
            @chr        char(1),
            @len        tinyint,
            @ndx        smallint,
            @NewValue   varchar(255),
            @use        tinyint,
            @WkField    varchar(255)

    SET @len = LEN(@Value)
    SET @ndx = -1
    SET @NewValue = ''
    SET @WkField  = '"' + @Value + '"'

    WHILE @ndx <= @len 
    BEGIN
        SET @ndx = @ndx + 1
        SET @chr = SUBSTRING(@Value, @ndx, 1)
        SET @asc = ASCII(@chr)
        SET @use = 0

        IF @asc BETWEEN 32 AND 126
            SET @use = 1
                       
        IF CHARINDEX(',nopunc,', @DelimOptions) > 0
            IF @asc < 48 OR @asc BETWEEN 58 AND 64 OR @asc BETWEEN 91 AND 96 OR @asc > 122
                IF @asc <> 32
                    SET @use = 0

        IF CHARINDEX(',nonum,', @DelimOptions) > 0
            IF @chr LIKE '[0-9]'
                SET @use = 0

        IF CHARINDEX(',noalpha,', @DelimOptions) > 0
            IF @chr LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',alphaonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[A-Za-z]'
                SET @use = 0

        IF CHARINDEX(',numonly,', @DelimOptions) > 0 OR CHARINDEX(',numericonly,', @DelimOptions) > 0 OR CHARINDEX(',numberonly,', @DelimOptions) > 0
            IF @chr NOT LIKE '[0-9]'
                SET @use = 0

        IF @use = 1
            SET @NewValue = @NewValue + SUBSTRING(@Value, @ndx, 1)
    END

    RETURN @NewValue
END

Open in new window

Thank you for reading my article. Your feedback is welcome. If you liked this article or are interested in more like these, please click the Yes button for: Was this article helpful?
1
Author:dsacker
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free