Community Pick: Many members of our community have endorsed this article.

How to write an XQuery that is the exact equivalent of a SQL OUTER JOIN

Andrew_Webster
CERTIFIED EXPERT
Published:

The Problem

How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end.

The situation expressed as relational data
Let’s work through this.  I’ve mocked up some data in Access, as I can’t share the original data with you.

We have some data in tblMain, and some related data in tblSub.  We’re going to write a query with an outer join so that we can see everything in tblMain and related data in tblSub.

(An inner join would act as a constraint and limit what was displayed from tblMain to only those records with a match in tblSub.)

Here are some screenshots to illustrate.

  tblMain tblSub
MainID is the Primary Key in tblMain and the Foreign Key in tblSub.

The SQL Solution
Here's the SQL that will produce the outer join that I'm after:
 
SELECT 
                          tblMain.MainID, 
                          tblMain.MainText, 
                          tblSub.SubID, 
                          tblSub.Sub
                      FROM 
                          tblMain 
                      LEFT OUTER JOIN 
                          tblSub 
                      ON 
                          tblMain.MainID = tblSub.MainID;
                      

Open in new window


The result of this SQL is shown in the next figure.  As you can see, and this will be no surprise to you folks who work with databases, the datasheet has empty cells where there is no matching data from tblSub.
  SQL output
Now let's contrast this with the XML case.

Expressed as XML
The same data, once exported to XML, is as follows.
tblMain:
<?xml version="1.0" encoding="UTF-8"?>
                      <dataroot xmlns:od="urn:schemas-microsoft-com:officedata" generated="2010-06-18T16:45:51">
                          <tblMain>
                              <MainID>1</MainID>
                              <MainText>First</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>2</MainID>
                              <MainText>Second</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>3</MainID>
                              <MainText>Third</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>4</MainID>
                              <MainText>Fourth</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>5</MainID>
                              <MainText>Fifth</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>6</MainID>
                              <MainText>Sixth</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>7</MainID>
                              <MainText>Seventh</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>8</MainID>
                              <MainText>Eighth</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>9</MainID>
                              <MainText>Ninth</MainText>
                          </tblMain>
                          <tblMain>
                              <MainID>10</MainID>
                              <MainText>Tenth</MainText>
                          </tblMain>
                      </dataroot>
                      

Open in new window


tblSub:  
<?xml version="1.0" encoding="UTF-8"?>
                      <dataroot xmlns:od="urn:schemas-microsoft-com:officedata" generated="2010-06-18T16:46:11">
                          <tblSub>
                              <SubID>1</SubID>
                              <MainID>1</MainID>
                              <Sub>1 First Sub</Sub>
                          </tblSub>
                          <tblSub>
                              <SubID>2</SubID>
                              <MainID>1</MainID>
                              <Sub>2 First Sub</Sub>
                          </tblSub>
                          <tblSub>
                              <SubID>3</SubID>
                              <MainID>4</MainID>
                              <Sub>1 Fourth Sub</Sub>
                          </tblSub>
                          <tblSub>
                              <SubID>4</SubID>
                              <MainID>8</MainID>
                              <Sub>1 Eighth Sub</Sub>
                          </tblSub>
                      </dataroot>
                      

Open in new window


The “Standard” XQuery solution

Having trawled the internet and the available XQuery books, the standard approach to the outer join problem is as follows.

 
xquery version "1.0";
                      <results>
                      {
                      for $main in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
                      return
                      <row>
                          <MainID>{data($main/MainID)}</MainID>
                          <MainText>{data($main/MainText)}</MainText>
                          {for $sub in doc("C:\Xquery Outer Joins\tblSub.xml")/dataroot/tblSub[MainID=$main/MainID]
                          return
                          <Sub>{data($sub/Sub)}</Sub>
                          }
                      </row>
                      }
                      </results>
                      

Open in new window


This works by looping through tblMain, outputting the required nodes and testing to see if tblSub has nodes that match.

But the problem with this is that it outputs a tree structure.  If there are two matching items in tblSub for an item in tblMain then we don’t see two items from tblMain in the output.  In the world of XML, this is OK, but we’re working with a way to have XML simulate relational data, so this doesn’t work for us here.  It’s correct, but it’s not what we want.  

Moreover, where there is nothing in tblSub that matches tblMain, there is nothing output other than the nodes from tblMain.  A big fat “nothing”!  What we want is a big fat “something” that explicitly tells us that there’s no matching data.

(This is a topic for another article, and has already been a topic for many: what is a Null?  Suffice it to say that I understand Null to mean “We don’t have a value here, but we could”.)

Why it's not what I want
Take a look at the output (displayed in XML Spy).
  XQuery result - not what I want!
Here it is as raw XML.
 
<results>
                          <row>
                              <MainID>1</MainID>
                              <MainText>First</MainText>
                              <Sub>1 First Sub</Sub>
                              <Sub>2 First Sub</Sub>
                          </row>
                          <row>
                              <MainID>2</MainID>
                              <MainText>Second</MainText>
                          </row>
                          <row>
                              <MainID>3</MainID>
                              <MainText>Third</MainText>
                          </row>
                          <row>
                              <MainID>4</MainID>
                              <MainText>Fourth</MainText>
                              <Sub>1 Fourth Sub</Sub>
                          </row>
                          <row>
                              <MainID>5</MainID>
                              <MainText>Fifth</MainText>
                          </row>
                          <row>
                              <MainID>6</MainID>
                              <MainText>Sixth</MainText>
                          </row>
                          <row>
                              <MainID>7</MainID>
                              <MainText>Seventh</MainText>
                          </row>
                          <row>
                              <MainID>8</MainID>
                              <MainText>Eighth</MainText>
                              <Sub>1 Eighth Sub</Sub>
                          </row>
                          <row>
                              <MainID>9</MainID>
                              <MainText>Ninth</MainText>
                          </row>
                          <row>
                              <MainID>10</MainID>
                              <MainText>Tenth</MainText>
                          </row>
                      </results>
                      

Open in new window


As you can see, this is all perfectly correct XML, but that doesn’t make it any closer to being our solution.

What did?

The Solution

If you look back at the SQL for the outer join, it’s pretty simple.  However to get XQuery to do the same thing, we have to lead it by the hand.

We have to get the data from tblMain where it has no matches in tblSub and attach some placeholders for there the data from tblSub is empty.

Then we have to get the data from tblMain and tblSub where they match.

Next, we stick them together with a union operation, then sort them back into the original order, and output them.

Here’s the XQuery.

 
xquery version "1.0";
                      <results>
                      {
                          (: Final variable :)
                          let $result :=
                          (
                          (: First variable :)
                          let $unmatched :=
                          (for $main1 in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
                          where empty(doc("C:\Xquery Outer Joins\tblSub.xml")/dataroot/tblSub[MainID = $main1/MainID])
                          return
                          <row>
                              <ID>{data($main1/MainID)}</ID>
                              <Main>{data($main1/MainText)}</Main>
                              <Sub>{'-'}</Sub>
                          </row> )
                          
                          (: Second variable :)
                          let $matched :=
                          (for $main in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
                          for $sub in doc("C:\Xquery Outer Joins\tblSub.xml")/dataroot/tblSub[MainID = $main/MainID]
                          return
                          <row>
                              <ID>{data($main/MainID)}</ID>
                              <Main>{data($main/MainText)}</Main>
                              <Sub>{data($sub/Sub)}</Sub>
                          </row> 
                          )
                          return 
                          $unmatched|$matched 
                          
                          )
                          (: sort and output :)
                          for $sorted in $result 
                          order by $sorted/ID
                          return $sorted 
                      }
                      </results>
                      

Open in new window


Let’s step through it.

XQuerys can be as simple as an XPath expression, but here we’ll use the entire FLWOR.

We start with a Let, this assigns a set of nodes from tblMain where they have no match in tblSub to a variable, $unmatched.

 
(: First variable :)
                          let $unmatched :=
                          (for $main1 in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
                          where empty(doc("C:\Xquery Outer Joins\tblSub.xml")/dataroot/tblSub[MainID = $main1/MainID])
                          return
                          <row>
                              <ID>{data($main1/MainID)}</ID>
                              <Main>{data($main1/MainText)}</Main>
                              <Sub>{'-'}</Sub>
                          </row> 
                          )
                      

Open in new window


I had to understand two things to get this to work:

•      you can assign the output of a FLWOR to a variable
•      you can use the idea of an “empty” set as the equivalent of a SQL “WHERE blah Is Null” to get unmatched nodes.

As you can see, I output a literal value, “-“, as a placeholder for the absent data from tblSub.

Now I did this again, to get the matched data this time:

 
(: Second variable :)
                          let $matched :=
                          (for $main in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
                          for $sub in doc("C:\Xquery Outer Joins\tblSub.xml")/dataroot/tblSub[MainID = $main/MainID]
                          return
                          <row>
                              <ID>{data($main/MainID)}</ID>
                              <Main>{data($main/MainText)}</Main>
                              <Sub>{data($sub/Sub)}</Sub>
                          </row> 
                          )
                      

Open in new window


Similar idea, except that this time we simply match tblMain with tblSub on MainID – the Primary Key in tblMain and the Foreign Key in tblSub.

Now we can union them, and this is simple in XQuery.  The union operator is the vertical bar “|”.  If you trace your way carefully through the rat’s nest of brackets, you’ll see that they are unioned and assigned to the final variable $result.

 
let $result :=
                          (
                           (
                            Build $unmatched
                           )
                           (
                            Build $matched
                           )
                          return 
                          $unmatched|$matched 
                          
                          )
                      

Open in new window


Now we’ve got this final variable, with all the data in it that we want, all that remains is to hand that over to a For so that we can sort it.  

Remember, a Let assigns a sequence, a set of nodes, to a variable, and it just takes care of them for a while.  A For is a looping mechanism, so that’s where sorting can take place.

 
(: sort and output :)
                          for $sorted in $result 
                          order by $sorted/ID
                          return $sorted
                      

Open in new window


We don’t have to use any constructors here, as all that work has already been done.

Let’s look at the results, first in XML Spy.

  Xquery result - what I do want.  WIN!
See how this isn’t a tree, it’s a table?  That’s what we want.

Here's the output again as raw XML:

 
<results>
                          <row>
                              <ID>1</ID>
                              <Main>First</Main>
                              <Sub>1 First Sub</Sub>
                          </row>
                          <row>
                              <ID>1</ID>
                              <Main>First</Main>
                              <Sub>2 First Sub</Sub>
                          </row>
                          <row>
                              <ID>10</ID>
                              <Main>Tenth</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>2</ID>
                              <Main>Second</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>3</ID>
                              <Main>Third</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>4</ID>
                              <Main>Fourth</Main>
                              <Sub>1 Fourth Sub</Sub>
                          </row>
                          <row>
                              <ID>5</ID>
                              <Main>Fifth</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>6</ID>
                              <Main>Sixth</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>7</ID>
                              <Main>Seventh</Main>
                              <Sub>-</Sub>
                          </row>
                          <row>
                              <ID>8</ID>
                              <Main>Eighth</Main>
                              <Sub>1 Eighth Sub</Sub>
                          </row>
                          <row>
                              <ID>9</ID>
                              <Main>Ninth</Main>
                              <Sub>-</Sub>
                          </row>
                      </results>
                      

Open in new window


And that XML, when run through the gizmo (which I can’t show you, sorry, intellectual property and all that) produces perfect tabular CSV files as required.

But, as you probably gathered, getting there included some good lessons in some of the tricks that are possible with XQuery.
Now, if you’re reading this, and you’re an XQuery ninja, and you’re reading this and thinking “What a klutz!” then please comment and show me the neat way to do this.  I’d love it!

In the meantime, as an old SQL hand, I’m willing to say that I still think XQuery sucks.  Ok, I know, like any technology, it’s good at what it’s good at.  But XML and its related technologies are still the new kids on the block, and they sure as heck don’t get close to the mature capabilities of relational database systems when it comes to dealing with relational data!
Funny, that.

What’s next?

I’ll be using this to create queries that have outer joins with multiple tables, so I’ll be extending this technique.  If I learn anything, I’ll post it here.

Summary
To create an outer join Xquery that provides placeholders where it’s SQL equivalent would provide nulls:

1

Create a FLWOR with its where clause using the XQuery empty function to create a sequence of nodes from the main document that are unmatched in the outer document. Add in placeholder literal elements where the outer data “should” be.

2

Assign the output to a variable.  If you need to add a where clause to filter the data, do so.

3

Create a second FLWOR with its where clause using a straight equality restriction to create a sequence of nodes from both tables: the equivalent of an inner join.  Again, assign the output to a variable.  If the first FLWOR had a restrictive where clause, duplicate it here too.

4

Combine the two variables using the union operator and assign to a third variable.  This is now basically the data you’re after.

5

To order it, use a final FLWOR to loop through the last variable, and use an order by clause to sort it as required.
Done.

Background

I’ve inherited a project where a very large amount of relational data has been extracted as XML for data retention purposes.  As a data nerd, this is not a choice I’ve have made, as for all XML’s wonderful qualities, replacing the capabilities of a full-blown relational database isn’t one of them!  But it’s the situation I’m dealing with.

The XML has been imported into an XML database, which stores the data in libraries (the equivalent of a database) that contains documents (the equivalent of tables).  The XML schema simply matches the original table’s schema, but all metadata, information about indexes, relational integrity etc. has been lost.

The retained data will have to be queried once in a while, and the XML database has a full implementation of XQuery 1.0.
XQuery is one of those languages that looks simple enough, but in practice, it’s a brute!  Maybe it’s just that the grooves in my head have been worn deep with SQL, so it’s hard to wear some new ones with XQuery, but although all the literature blithely states that if you’re a dab hand with SQL, then XQuery will be a snap for you.  Pffft….!

The next layer to this is that there’s  a utility that’s part of the whole framework to be used, that will run an XQuery, display the result in a browser, and give the user the option to output it as CSV.  Once written, the queries can be stored and run by the users.

For the most part, this all works a treat.  However, we found that when dealing with data that needed a query using an outer join, that we had a problem.  While an outer join in SQL produces meaningful nulls to fill in the gaps, XML processors don’t work like that: in XML if it’s missing then it plain ain’t there!  

When run through the gizmo that produced our CSV output, this meant that we were getting ragged structures, and data was showing up in the wrong columns.  This wasn’t acceptable, obviously.

So I wanted to write an Xquery that would somehow replace these missing values with a placeholder and force the gizmo to maintain the data integrity.  This is counter to the mindset for XML: missing is missing!  I needed missing is meaningful.
0
5,924 Views
Andrew_Webster
CERTIFIED EXPERT

Comments (2)

CERTIFIED EXPERT

Author

Commented:
I posted this to the x-query.com mailing list, and so far I've had two really great responses.

First up, from David Carlisle is this:
 
I don't think you need build the matched an unmatched cases separately and then sort them back. Also sorting on $sorted/ID will be a string sort and so (on saxon at least) 10 comes before 2 (I think it depends on your systems default collation

I'd just do something like

xquery version "1.0";
<results>
{
  for $main in doc("tblMain.xml")/dataroot/tblMain
  let $subs :=  doc("tblSub.xml")/dataroot/tblSub[MainID = $main/MainID]
   return
   if (exists($subs))
   then
       for $sub in $subs
       return
       <row>
       <ID>{data($main/MainID)}</ID>
       <Main>{data($main/MainText)}</Main>
       <Sub>{data($sub/Sub)}</Sub>
       </row>
   else
   <row>
       <ID>{data($main/MainID)}</ID>
       <Main>{data($main/MainText)}</Main>
       <Sub>-</Sub>
   </row>
}
</results>

Open in new window


Second up, from Martin Probst, is this simple solution:
 
Both of these illustrate some great lessons in XQuery.  For me, they are a lesson in how I think in SQL, and can visualize SQL solutions, but still really struggle to do so with XQuery.  However, I've been developing with XQuery for fifteen years and with XQuery for three months, so maybe that's not a surprise.

I've got to write queries based on what I've learned here that will operate on XML data representing millions of rows of relational data.  I'll try all three approaches and check them for performance, then I'll post the results here.  
CERTIFIED EXPERT

Author

Commented:
Here's the code for Martin's solution - it didn't make it to the last comment!
<results>
{
for $main in doc("C:\Xquery Outer Joins\tblMain.xml")/dataroot/tblMain
let $subs := doc("C:\Xquery Outer
Joins\tblSub.xml")/dataroot/tblSub[MainID=$main/MainID]/Sub
let $actual := if ($subs) then $subs else '-'
for $sub in $actual
return
<row>
   <MainID>{data($main/MainID)}</MainID>
   <MainText>{data($main/MainText)}</MainText>
   <Sub>{data($sub)}</Sub>
</row>
}
</results>

Open in new window

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.