Solved

Delete hidden characters in SQL 2012

Posted on 2014-04-23
20
1,168 Views
Last Modified: 2014-04-25
I have a system taking input from a scanner and inserting it in to a database, I have an issue where ISBN is being scanned and a ¿ at the end leaving this data in a bad state for the application.
I need to cut that hidden character out of any item where in this table but not all items have this umlaut. Could anyone help with an update query that will do this?

regards,
0
Comment
Question by:atorex
  • 9
  • 8
  • 2
  • +1
20 Comments
 
LVL 10

Expert Comment

by:JEaston
ID: 40017478
Try something like:

UPDATE tablename SET ISBN = REPLACE ( ISBN, '¿', '')

I've not tested this, so ensure you run on a test table / system first.  Otherwise ensure you have a backup!
0
 
LVL 65

Expert Comment

by:Jim Horn
ID: 40017481
-- Returns 191
SELECT ASCII('¿')

-- Get rid of all ¿'s in table your_table, column your_column
UPDATE your_table
SET your_column = REPLACE(your_column, '¿', '')

btw, in similar situations when I'm looking at text files that may have goofy-riffic characters like '¿', I open UltraEdit, open the file, and do a Ctrl-H to go into Hex mode and view the Hex code of EVERY character.  Very damn handy when dealing with text files generated from mainframes that frequently have these characters.

Good luck.
Jim
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40017606
The dodgy character could be a unicode character or a control character, you can use Replace as the above suggestions but further down or a few days later you may come across another rogue character or combination of control characters...
As far as I know valid ISBN values can be represented by varchar datatype so I'd use something like

Update <my_table> set ISBN = Convert(Varchar(13), ISBN);

It doesn't change the data-type used for the storage of the ISBN, just ensure anything which cannot be represented by Varchar will be removed.
0
 

Author Comment

by:atorex
ID: 40018021
updating the column to varchar 13 does drop the hidden portion however new data will come in with it as per below, it actually represents itself as a NUL.
Any ideas how I can get future data to exclude this?


9780964339330NUL]
0
 
LVL 65

Expert Comment

by:Jim Horn
ID: 40018040
>Any ideas how I can get future data to exclude this?
If you can control the source of this data, a good smacking to remove these characters will do nicely.   If they're on the fence about doing so, estimate high on how many hours it will take for you to fix, and send them a bill.

If not, then you're going to have to run the update with every feed, either by including the UPDATE in your SSIS package, or creating a staging table(s) where you do validations like this, transforms, and then write from that staging to your ultimate target.

No other choices.
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40019426
If it's ISBN values you're concerned about you can create a function like the following for validating the data:

CREATE FUNCTION [dbo].[ValidateISBN]
(  @string VARCHAR(13)
)
   RETURNS VARCHAR(13)
AS
BEGIN
DECLARE @IncorrectCharLoc SMALLINT = PATINDEX('%[^0-9,x,X]%', @string);

WHILE @IncorrectCharLoc > 0
SELECT
  @string = STUFF(@string, @IncorrectCharLoc, 1, '')
, @IncorrectCharLoc = PATINDEX('%[^0-9,x,X]%', @string);

RETURN @string
END


You can then use that function for updating your data as in following:
Update <my_table> set ISBN = ValidateISBN(ISBN);
0
 

Author Comment

by:atorex
ID: 40020937
I have pulled a production backup and found that there is data longer than 13 so updating to varchar 13 will cut all other data.

with the validation script above does it cut out the nul or does it just validate?
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40021130
You can use longer varchar if you want; I've used 13 because that's the maximum expected length for a valid ISBN. Anyhow the length doesn't play any important role for the cleanup here, so you can increase it to avoid unnecessary truncation. The important bit is the combination of PatIndex and Stuff which gets rid of all characters except for digits 1-9 and X which are the only valid characters for ISBN.

Good Luck with your ptoject.   : )
0
 

Author Comment

by:atorex
ID: 40021196
Thanks for the help, this will work changing the length to the schema length, one last question, how could I turn that function to just a script I have a java tool that will make the changes on the several production systems if I had it as a script I could add it to the tool.

regards,
Atorex
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40021221
Using the same length as the existing schema is very sensible.
The function need to be this way, i.e. as a SQL Server function, otherwise you won't be able to use it at the right side of the equal sign.  Id all production systems are on the same server, you can create the function in master db, which is ideal for this kind of utility functions, because it'll be accessible from all other DBs on the instance of the SQL Server.

Glad it was helpful. Good luck with your data cleansing.  : )
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:atorex
ID: 40021285
digging in to the data I found that there are scanned data represented in several ways and this function is stripping out all none numeric characters attached is a clip of the problem data, when I query the data in sql the result is
9781588342614
9780880882453
9780880882453
9781568986883
9781856697958
9781557090577
9781770850132

however the actual data is represented in the database as per below when I export it to a CSV, I would like to cut out just that last character, any ideas on how I can update the function to do just that?

9781588342614¿
9780880882453¿
9780880882453¿
9781568986883¿
9781856697958¿
9781557090577¿
9781770850132¿
0
 

Author Comment

by:atorex
ID: 40022873
I ran this update statement but the result was the same character was not removed, any other ideas on how I can remove this without damaging other data?
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022900
I don't think it's the actual data in the database but it's representation in the flat file, perhaps either the column delimiter or (if the sample in your post is actually the whole l) the row delimiter is a combination of characters that cannot be displayed properly in your text editor. you can check this by trying to export to a flat file again, but this time when defining the flat file pay attention to the col/row delimiter options; for instance use comma for col delimiter and [CR][LF] (carraige return + Line feed) for row delimiter and verify the results.
0
 

Author Comment

by:atorex
ID: 40022962
the below is how it is represented in Excel exported in to a csv.

9.7816E+12
9.78019E+12
9.78191E+12
3.32968E+12
9.7804E+12
9.7804E+12
9.78031E+12
9.78052E+12
3.32968E+12
9.78006E+12
9.78157E+12
9.78038E+12
9.78157E+12
9.78191E+12
9.78191E+12
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40022993
I see you have exported the file to Excel rather than a flat file! But that doesn't matter as the point of this exercise was to verify the data and to make sure it doesn't include any dodgy  characters, which has been achieved now! If there were any rogue characters in the data, excel wouldn't consider it as valid numeric values and wouldn't convert it to scientific notations. So rest assured the data in your DB is fine.

As a FYI, the scientific notation is because:
1. Since all ISBN values are only digits, Excel has jumped into conclusion that the data is numeric;
2. If the column width is not enough to show the whole numeric value it automatically switches to scientific notation.

In order to prevent it from doing so, you can right-click and select the format cell to be numeric with no digits after decimal point and make sure the column width are wide enough before saving the file as CSV... or as I suggested in the first place just export the data directly to a to a flat file.
0
 

Author Comment

by:atorex
ID: 40023085
Sorry the CSV I sent was from an edited table (to may of them) the table that contains the ISBN issue if I export in to a CSV and open using Excel this is the result.
9781588342614¿
9780880882453¿
9780880882453¿
9781568986883¿
9781856697958¿

if I open the same CSV with notepad ++ I get the attached png as I cant past it here.
is there a way for me to just remove the last 2 characters from a query result.
in order to get the above I run the query with a where len is 15 but the ISBM should be 13 so if I could just cut the last 2 digits I should be OK!


lastly I think I can copy the data with the length of 15 producing a result of all the bad ISBN and I can run the function above thus updating the ISBN to 13, import this new data in to a temp table. question is now how could I update the original table with the new 13 digit ISBN from the temp table?

this would be the query that would result in the correct data being pulled I would like to edit this select to then update the original table, the trans_seq is the primary key. this may be the only way I can get the data I need fixed without affecting other data the column is set to NVarchar 60 so there are many other length and data there not just ISBN and I cant cut the rest to just numeric.

Thanks,
Atorex  


ELECT [scanned_item_id] ,[trans_seq]
  FROM [dbo].[trl_sale_lineitm_tmp]  where len(scanned_item_id) = '13'
0
 
LVL 8

Expert Comment

by:ProjectChampion
ID: 40023119
There's no attachment.
0
 

Author Comment

by:atorex
ID: 40023137
Sorry, helps to click on the upload file link!
csv.png
0
 
LVL 8

Accepted Solution

by:
ProjectChampion earned 500 total points
ID: 40023299
I agree, you definitely have some data cleansing to do.
If you have already created the ValidateISBN function in my earlier post, increase the size of @string (if you haven't already), say set it to varchar (255).
Then run the following query:
Select *,ValidISBN(ISBN) FROM <my_table> WHERE ISBN <> ValidISBN(ISBN);

This should return all rows that need to be cleaned. If it doesn't return anything then let me know and I'll suggest an alternative solution. If it returns the result and you're happy for the
dodgy ISBN's to be replaced with ValidatedISBN, then just update the table using the following query in which, xx is the size of ISBN field in your table:
; With CTE As (Select *,ValidISBN(ISBN) FROM <my_table> WHERE ISBN <> ValidISBN(ISBN))
Update CTE Set ISBN = Substring(ValidatedISBN, 1, xx);

Let me know if that doesn't work and we'll find another way if needed.
0
 

Author Comment

by:atorex
ID: 40023450
I have updated the function as per below and this now gets me the updated ISBN and retains all other data as normal, since the schema created that table with the length of 60 I updated the function to that.
thanks for the help this was a learning experience.

CREATE FUNCTION [dbo].[ValidateISBN]
(  @string VARCHAR(60)
)
   RETURNS VARCHAR(60)
AS
BEGIN

DECLARE @IncorrectCharLoc SMALLINT = PATINDEX('%[^0-9,a-z,A-Z,-]%', @string);

WHILE @IncorrectCharLoc > 0
SELECT
  @string = STUFF(@string, @IncorrectCharLoc, 1, '')
, @IncorrectCharLoc = PATINDEX('%[^0-9,a-z,A-Z,-]%', @string);

RETURN @string
END
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

I don't know if many of you have made the great mistake of using the Cisco Thin Client model with the management software VXC. If you have then you are probably more then familiar with the incredibly clunky interface, the numerous work arounds, and …
The recent Microsoft changes on update philosophy for Windows pre-10 and their impact on existing WSUS implementations.
This tutorial will walk an individual through the steps necessary to join and promote the first Windows Server 2012 domain controller into an Active Directory environment running on Windows Server 2008. Determine the location of the FSMO roles by lo…
This tutorial will walk an individual through the process of transferring the five major, necessary Active Directory Roles, commonly referred to as the FSMO roles from a Windows Server 2008 domain controller to a Windows Server 2012 domain controlle…

759 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now