Solved

Concatenating with FOR XML PATH('') while dealing with special XML characters

Posted on 2009-05-12
26
6,648 Views
Last Modified: 2013-11-05
Hello fellow experts.  I figured it was time to ask a question so here it goes.  

The goal is that for each A in table #t, there will be a single returned value which is the concatenation of all values in column b.  You will see that I am using a correlated sub-select to get the values, and everything works well for #t.a=2.  But since #t.a=1 has the & < and carriage return line feed in the data, the FOR XML PATH('') is converting them into their XML hex values respectively.  

Adding a group by will ensure that each #t.a is only returned once, that's not my concern.

So my question is this.  Come up with a creative, and efficient, way to return a single value for each #t.a that HOPEFULLY does not use a nested replace like I am doing in the third column below.

Points for effort, creativity and participation WILL be given.

Please do NOT suggest a cursor, while loop, or other RBAR based solution.  This HAS to be a single statement and PREFERABLY without the use of a UDF.

Thanks fellow experts and I look forward to your responses.
create table #t

(a   int

,b   nvarchar(20)

)

go

insert into #t values(1,'a&b')

insert into #t values(1,'c<

d')

insert into #t values(2,'wx')

insert into #t values(2,'yz')

go

select a,b,

(select b+N'' 

        from #t 

        where a=t.a

        for xml path('')

       ) 'WITH xml CHARACTERS'

,replace(

replace(

replace(

replace(

replace(

replace(

replace(

(select b+N'' 

        from #t 

        where a=t.a

        for xml path('')

       )

,'&#x20;',char(20))

,'&amp;','&')

,'&lt;','<')

,'&gt;','>')

,'&#x0A;',char(10))

,'&#x0D;',char(13))

,'&#x09',char(9)) 'With XML Replaced'
 

from #t t
 

go

drop table #t

Open in new window

0
Comment
Question by:BrandonGalderisi
  • 10
  • 8
  • 5
  • +2
26 Comments
 
LVL 37

Expert Comment

by:momi_sabag
ID: 24366565
you can solve that by casting the string into varbinary format, and then after the select statement, cast it back
look for my comment here
http://sqlservernation.com/blogs/147.aspx
it has an example of it
0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24366593
Wow.  That's kinda weird being referred to my own web site for a question I asked.
0
 
LVL 37

Expert Comment

by:momi_sabag
ID: 24366621
:-)
0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24366631
If you could, show me as a single select returning the following results.  Because for me it doesn't seem to work without assigning it to an XML variable, and using VARIABLE.value which can't be processed as a single statement select.

The desired output should be:
a          
---------- ------------
1           a&bc<
d
1           a&bc<
d
2           wxyz
2           wxyz
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24366797
Sad but true... Ideally need to replace first (put place holders there), and after. Or maybe resort to casting as varbinary first and after.

What you are currently doing is converting to string, so that XML is now varchar effectively.

One of the other subtleties / challenge with replace is that it will insert NULL characters into the string which can sometimes cause a few problems. For example replacing 4 char with 1 means three NULLS  - it is doing a replace, not rewriting the string.

Will have a think of something to replace the replace...
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24366862
That was interesting... No one was here a while ago...

0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24367396
perhaps this will explain it better with something that is easier to grasp than some fictional test data.

The "requirement":
List all objects in tempdb whose names start with "BG_Objects" and include a comma delimited list of their column names.  Do so in a single statement (CTE's OK) without looping through records to build the concatenated field.

The results for the second column should be "[a],[b],[a & b]," and not "[a],[b],[a &amp; b],".


BTW, I am perfectly capable of accepting the "you can't do that" answer if that is the answer.  Heck, I can't think of a way.  

FYI... I will be leaving this question open for 3 days, or until a working solution is found.  At which time I will be awarding points for valid participation in the discussion with more going to the first working solution provider.
use tempdb

go

create table #BG_Objects1 (a int,b int,c int)

create table #BG_Objects2 (a int,b int,[a & b] as a+b)

go

select o.name as oname,(select c.name + ',' from sys.columns c where o.object_id=c.object_id for xml path('')) cnames

 from sys.objects o

 where o.name like '#BG_Objects%'

go

drop table #BG_Objects1

drop table #BG_Objects2

Open in new window

0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24368317
well, a bit shirty, no xml (and who said I am a fan ? ) BUT, it is a single statement, it does the concat...
use tempdb

go

create table #BG_Objects1 (a int,b int,c int)

create table #BG_Objects2 (a int,b int,[a & b] as a+b)

go
 

;with cte_concat as

  (

  select (select max(ordinal_position) from information_schema.columns c where c.table_name =i.table_name) max_num,

         ordinal_position, convert(varchar(1000),column_name) as column_name, table_name 

  from information_schema.columns i where table_name like '#bg_objects%' and ordinal_position = 1
 

  union all
 

  select max_num,i.ordinal_position, convert(varchar(1000),rtrim(c.column_name) + ',' + rtrim(i.column_name)), c.table_name

  from cte_concat c

  inner join information_schema.columns i on c.ordinal_position + 1 = i.ordinal_position and c.table_name = i.table_name

  ) 

select table_name,column_name from cte_concat where ordinal_position = max_num
 

go

drop table #BG_Objects1

drop table #BG_Objects2

Open in new window

0
 
LVL 51

Accepted Solution

by:
Mark Wills earned 250 total points
ID: 24368357
Bugga, forgot the []...
use tempdb

go

create table #BG_Objects1 (a int,b int,c int)

create table #BG_Objects2 (a int,b int,[a & b] as a+b)

go

 

;with cte_concat as

  (

  select (select max(ordinal_position) from information_schema.columns c where c.table_name =i.table_name) max_num,

         ordinal_position, convert(varchar(1000),'['+column_name+']') as column_name, table_name 

  from information_schema.columns i where table_name like '#bg_objects%' and ordinal_position = 1
 

  union all
 

  select max_num,i.ordinal_position, convert(varchar(1000),rtrim(c.column_name) + ',[' + rtrim(i.column_name)+']'), c.table_name

  from cte_concat c

  inner join information_schema.columns i on c.ordinal_position + 1 = i.ordinal_position and c.table_name = i.table_name

  ) 

select table_name,column_name from cte_concat where ordinal_position = max_num

 

go

drop table #BG_Objects1

drop table #BG_Objects2

Open in new window

0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24375403
Good deal mark.  I didn't have a incrementing value like the ordinal position in my data set, so I had to fake it by first running a partitioned row_number, then a nested CTE to build the strings.

Any other suggestions from anyone?
0
 
LVL 51

Expert Comment

by:tedbilly
ID: 24382275
Just use gloves: http://thedailywtf.com/Articles/The_Complicator_0x27_s_Gloves.aspx

T-SQL has limits, you can trick it into doing some amazing things, however, that isn't it's strength.  IE: http://thedailywtf.com/articles/stupid-coding-tricks-the-tsql-madlebrot.aspx

Assuming this data is being consumed by a data access layer why don't you concatenate the data there instead of in the database?
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24382301
Well tedbilly, that certainly is very very interesting. If T-SQL can do fractals (or pretend to), then surely a simple concatenation isn't beyond the realms, just needs more thinking....

Love the gloves story. So very true, so very often...
0
 
LVL 51

Expert Comment

by:tedbilly
ID: 24382342
What I tell the team I manage (we build n-tier applications) is to make sure all SQL statements are as short as possible with minimal CPU cost.  Only do the bare minimum in the database to protect the data and move all complex programing out to the middle tiers.  The reason being is practical for a number of reasons, for one, it minimizes the risk of locking, but the big key is that it's easier to add more non-SQL hardware.

MS SQL doesn't have load balancing clusters, you are stuck with one instance of SQL, however I can add as many web or application servers as I like with a solid design.  Therefore I move as much CPU processing as I can off the SQL server onto the other servers.

Cheers
0
Get up to 2TB FREE CLOUD per backup license!

An exclusive Black Friday offer just for Expert Exchange audience! Buy any of our top-rated backup solutions & get up to 2TB free cloud per system! Perform local & cloud backup in the same step, and restore instantly—anytime, anywhere. Grab this deal now before it disappears!

 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24387093
tedbilly:

Thanks, but I think you missed the point of this exercise.  I guess it wasn't enough to say "Please do NOT suggest a cursor, while loop, or other RBAR based solution.  This HAS to be a single statement and PREFERABLY without the use of a UDF.".  I should have also said not to suggest pulling it out of the database.  

You are making an incorrect assumption that there is a front end to this component.  This is a database task that has to build a single value from a one-to-many value pair, for numerous items at one time.  I don't want to go through each item and use a variable to do the FOR XML PATH('') statement because the potential exists for there to be a high number of top level items, each having 0-X number of values to be concatenated into a single value per item.

Locking is not a concern as this is processed in indexed (non-#) temp tables on 16x servers with 96GB of RAM sitting on a multi-petabyte SAN.

B
0
 
LVL 51

Expert Comment

by:tedbilly
ID: 24392150
Based on your requirements I think you need to use a UDF.  I've seen articles using the XML PATH as a mechanism to do what you are attempting and they all have the same design flaw.  Calling REPLACE to fix the strings.

XPATH is doing exactly what it was designed to do.  Just for the fun of it, try the UDF and compare the performance on a large table against using the XML PATH without using the replace.

I am surprised that a database application like this exists without any other tiers.  The data must come from somewhere and be used somewhere else outside of the database.  I'll have to trust you on that one.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24392600
And I was expecting a bit of a debate - sheesh...

Question, are the resulting output columns predictable and consistant ?   Not meaning the input, or the source data, but the final result set ?

0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24396517
"I am surprised that a database application like this exists without any other tiers.  The data must come from somewhere and be used somewhere else outside of the database.  I'll have to trust you on that one."

Here's more of an explanation.

The process is invoked on the set of data for which a user has chosen to perform certain actions.  The action being taken, is the generation of NEW data from existing data.

So the user is presented with a list of the various data components that have been defined in their system.  For this example, "Field A" is a one-to-one field to the base objects, "Field B" is a one to many field to the base objects, "Field C" is a one to many object to the base objects and "Date A" is a one-to-many field to the base objects.

Now, you have Base object 1 and 2.

Base Object 1:
  Field          Value
  Field A:       "Ted"
  Field B:       "ABC"
                 "XYZ"
  Field C:       "Dell"
                 "HP"
                 "IBM"
  Date A:        "2009-05-15"
                 "2009-05-14"
                 "2009-05-13"
                 "2009-05-12"
                 "2009-05-11"

Base Object 2:
  Field          Value
  Field A:       "Brandon"
  Field B:       "Hello"
  Field C:       "Dell"
                 "IBM"
  Date A:        "2009-04-12"
                 "2009-04-11"
                 "2009-04-10"
                 "2009-04-09"


Base Object 3:
  Field          Value
  Field A:       "Mark"
  Field B:       "Experts-exchange.com"
                 "hire-an-expert.com"
                 "sqlservernation.com"
  Field C:       "Dell"
  Date A:        "2009-04-11"

Now let's say the user wants to create some new field.  That field is a composite field.  I will use ++ to show concatenation and "" to indicate free text that they can supply

Field A++" - "++Field B(MIN)++" - "++Field C(ALL, Text qualifier ", Delimiter ;)++" - "++Date A(max)

So the value would be:
Object 1.
     Ted - ABC - "Dell";"HP";"IBM" - 2009-05-15
Object 2.
     Brandon - Hello - "Dell";"IBM" - 2009-04-12
Object 3.
     Mark - Experts-exchange.com - Dell - 2009-04-11

Now let's pretend that there are 230K base objects, that some of the fields have 30-40 values, and that their composite field is FAR more complex with formatting options on date fields and other fields.
0
 
LVL 51

Expert Comment

by:tedbilly
ID: 24400445
How does the user interact with the database?  Is there nothing inbetween or do they fire a stored procedure directly?
0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24401084
They click a button on an asp.net page, then they log out and walk away.  The entire process that is initiated from the time they click that button can take anywhere from 30 seconds to several hours.  Depending upon the number of base objects they have chosen to include in this action.  The complexity of the "new data" generation also is a factor, but the database processing won't even be the long part of the process.  The long part will be that each base object will, after database processing has been completed, have one or more files generated in the file system for it.  

The files may be named after one of these "new data" values.  They will also contain up to 6, user defined "new data" values that must be prepared prior to the file operation starting.

Our system is a VERY database heavy application.  We do database caching of searches so that searches are available when a user gets bounced to a different web server in a web farm.  Like I said our db servers are each usually 16 cores with 96GB of ram.  A typical database is about 40GB, with them being as large as 150GB.

So again, I will ask that if you want to participate in this fun little experiment, that you abide by my requests and stop telling me what I should and should not be doing and focus on what I have asked.

Mark has already set me in the right direction and I have developed a working solution that does not use xml concatenation.  I am leaving this open for now in hopes of getting some of the great DB minds at EE to throw in their input.

Thanks,
BrandonGalderisi
0
 
LVL 51

Expert Comment

by:tedbilly
ID: 24401209
Well you are missing the point.  You see, in my mind the experiment is picking the right tool and technology for the job.  I don't want to sound arrogant I simply want you to understand my point of view.  My father was a System Analyst for IBM.  He taught me how to write code using punch cards on an IBM 360 Series in 1973.  I've been programming ever since.  That's 36 years.  I grew up talking with my father not only about what technology can do, but how to do it, about the design decision making process et cetera.

So from a high level point of view, you are using XML PATH in a way that it wasn't intended, to get a SQL database to do something it's not good at.  Isn't that a red flag?  SQL wasn't originally designed for recursion and because it's not a compiled language, code execution time will never match Java, .NET or C++.

I work at a company with over 7,000 full-time developers and am concerned a go to guy when it comes to analyzing application performance.  One SQL heavy database application was running into some scalability issues with hierarchical data.  As you know, as a RDBMS, MS SQL is not a hierarchical database.  The performance problems were due to the attempt to add recursion and reentrant functionality to SQL.  So, I talked them out of using SQL.

The've rewritten the functionality in C# and it is literally thousands of times faster.

Would you be willing to guarantee 100% that your business process could not be written using another language?  Your profile says you are "open to criticism", so I hope that is still true.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24401538
>>you are using XML PATH in a way that it wasn't intended<<

Maybe, it is one of my favourite little uses, so let me attempt to defend (?) it's use, or maybe even convert you (hehehe) ? .

If we examine the reason why For XML exists, then we quickly realise it is first and formost reason to live is to string concatenate individual rowsets. Addmittedly with tags, but also allows us not to use tags. So, is perfect as a mean of being able to convert multi-rows into a streaming string of characters... And it is very, very fast, it has been optimised beyond any recognisable code that we could create to concatenate data elements (heart of the engine type stuff).

Yes, it is a matter of correct tool for the job, and bending data is not unreasonably achieved within the database. Other databases do have facilities more suited to this type of requirement, and could reasonably argue that so does MS by using the XML function. It just needs management of special characters - there are only 5. That in itself is not neccessarily a failure of any toolset as much as a definite problem with converting information into a new format. And really if we wanted to we could export the XML with a style sheet, but then that really does become a matter of bending XML to be something else.

Reality check... Doing the replace on the special characters to ensure it formats as a string properly is likely an issue regardless of tool. Here it happens to include a few printable characters like ampersand, lessthan, greaterthan, quote and also (technically) apostrophe. There are in fact hundreds of them depending on style sheet. And then there are the unprintable characters that can also compromise a string structure - things like "bel" won't but a "htab" will (see attached).

If you look at W3.org you can see that there are entire chapters devoted to wasy of dealing with it - it is part of the XML standard : http://www.w3.org/TR/REC-xml/

If you look at MS then they too suggest a fix for escape characters : http://support.microsoft.com/kb/308060

And in reality, the list of "esccaped" characters is really significant : http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

So, it really isn't a problem with the use of For XML as a technique for concatenation as much as it is a recognised problem as to how XML manages special characters.

In that regard, really, the replace() function does work a treat (in terms of functionality).

And really someone should look at the range of XML replacement characters and build a function to either encode or decode that range. The trouble with a UDF in that case is that it will become a multi-step UDF, a scaler with a BEGIN and END and as soon as you say that, then performance gets hit big time - not so much in the execution plan, but in actual execution, the UDF suddenly becomes runtim line by line interpretative. In that regard, better to write a non-scaler CLR function... Now, if it is a table UDF, could 'join' the string as a tabletype to the table of xml characters and export an approriate replacement character - a file driven UDF does perform very well (go figure - just having a formal list of column names in the "returns" clause).

But also concede the point that sometimes a bit of code at the correct place can achieve wonders - but is that really any different to using replace ?

So, whaddya reckon ? Hopefully at least converted (or a bit of doubt maybe) to the extent that it might not be considered a corruption of the "for XML" function... ? Maybe ?
Character-Codes.pdf
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 24401565
Oh, just one more little thing... As a developer, I do agree that generating purpose built compiled code will outperform any interpretive environment. As a DBA, I also see the wisdom of getting the engine to do the hard work in reducing the rowset through purpose built "engine" functionality, then it will likely perform that function more efficiently.
 


 
0
 
LVL 60

Assisted Solution

by:chapmandew
chapmandew earned 250 total points
ID: 24404737
OK, first things first, here is the solution to the problem.  The .value portion of the xml data type is the key.

select
      ComeGetSome.value('.', 'varchar(1000)')
from
(
      select  
      cast((select b+N''
      from #t r
      where r.a=t.a
      for xml path('SQLServerNation')
      ) as xml)'ComeGetSome', ranking = DENSE_RANK() over(partition by a order by newid())
      from #t t
) x
where ranking = 1
      
As for the comment:  "you are using XML PATH in a way that it wasn't intended"

How on earth do you know?  XML PATH, like any other tool, is designed to solve problems.  I will use it at any chance I get.  It works great for this solution. In fact, I'll put it up against any RBAR solution you want to come up with.  

Love,
Tim
0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24412275
chap,
I will be trying your solution today.  As of now, I have it working with Mark's method.
0
 
LVL 39

Author Comment

by:BrandonGalderisi
ID: 24415274
Thanks chap and Mark.  Both solutions provided me with the desired outcome within the scope of the requirements.  I will be doing some various testing within my system to determine which is the best.

For various reasons, I will be unsubscribing from this thread once I have accepted the solutions.  The two of you, and the mods/admins, are aware of how to get in touch with me.

Thanks again!
0
 
LVL 39

Author Closing Comment

by:BrandonGalderisi
ID: 31580606
Thanks for thinking inside of the defined box!  :)
0

Featured Post

Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

by Mark Wills PIVOT is a great facility and solves many an EAV (Entity - Attribute - Value) type transformation where we need the information held as data within a column to become columns in their own right. Now, in some cases that is relatively…
In this article I will describe the Copy Database Wizard method as one possible migration process and I will add the extra tasks needed for an upgrade when and where is applied so it will cover all.
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now