We have XML data in a SQL Server database that we want to search efficiently, but are not sure of the best way to get the results we want. We need to find specific values in certain XML nodes- this will involve finding node values that match given substrings. Check out this tip to learn more.
We will be working with the AdventureWorks2012 sample database - querying the Demographics XML column in the Person.Person table. To see what the Demographics XML data looks like, let's pull a sample Demographics field by selecting a random record:
--ordering by NEWID() function randomizes the sort order SELECT TOP 1 Demographics FROM AdventureWorks2012.Person.Person ORDER BY NEWID() GO
If we expand the XML contents of this field, we see:
It's evident that we are dealing with typed XML, since we can see a namespace declaration (xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey") inside the root node. This means that the XML data is associated with an XML schema that enforces data type and validation rules. We want to get a count of the surveys where the individual's education is listed as �Bachelors'. How should we do this? We can try a rather primitive method that finds the existence of substrings anywhere in the XML column by using the CHARINDEX function:
--filtering XML data using CHARINDEX function SELECT COUNT(1) FROM Person.Person WHERE CHARINDEX('Bachelors',CONVERT(VARCHAR(MAX),Demographics),1) > 0
This does return a count of the number of times the word �Bachelors' appears. However, it's quite clumsy - and worse yet, doesn't guarantee that the text �Bachelors' is a real XML value in the table - it could also be part of a node name. A much cleaner solution is to use XQuery.
XQuery is a language designed for querying XML data, and is not proprietary to SQL Server - it is used by many relational database management systems (RDBMS). A simple implementation of an XQuery solution is as follows:
--filtering XML data using XQuery ;WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey' AS ns) SELECT COUNT(1) FROM Person.Person WHERE Demographics.exist('(/ns:IndividualSurvey/ns:Education[.="Bachelors"])')=1 GO
This may look a little complicated, but really isn't too hard to decipher. The WITH XMLNAMESPACES statement is needed to explicitly declare the namespace for in order to reference any XML node element or attribute, which we did in the WHERE clause (the Education node element). Another way to declare the namespace is to put the declaration in-line with the actual XQuery method that is being used:
--filtering XML data using XQuery with in-line namespace declaration SELECT COUNT(1) FROM Person.Person WHERE Demographics.exist('declare namespace ns= "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey"; (/ns:IndividualSurvey/ns:Education[.="Bachelors"])')=1 GO
In both examples, we declared a namespace and aliased it as �ns', which is used as a prefix for each path step or node reference. We listed the node path steps from root node (IndividualSurvey) to destination node (Education).
The exist() method is used in our query as the WHERE criterion to filter the XML data, finding XML records where the Education node's value is equal to the text �Bachelors'. The exist() method returns a bit value of 1 if the method finds a non-empty node whose element or attribute value matches the given criteria. Upon running the query, however, we get a record count of zero:
What happened? We've assumed too much - that the value we are looking for in the Education node will exactly match the text �Bachelors'. If we look at the sample Demographics XML fragment closely, we see that there is a space after the word. So, if we modify our query and run it again, we get the following:
--filtering XML data using XQuery - adding a space to the string ;WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey' AS ns) SELECT COUNT(1) FROM Person.Person WHERE Demographics.exist('(/ns:IndividualSurvey/ns:Education[.="Bachelors "])')=1 GO
How can we prevent this oversight from happening again? XQuery in SQL Server makes use of a contains() function that finds a string as a whole substring in a node value. The contains() function requires two arguments:
Using the contains() function, we don't have to be concerned with matching the string exactly:
--filtering XML data using the XQuery contains() function ;WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey' AS ns) SELECT COUNT(1) FROM Person.Person WHERE Demographics.exist('(/ns:IndividualSurvey/ns:Education[contains(.,"Bachelors")])')=1 GO
We can easily expand our filter by additionally specifying that only records where the Gender node contains �F' are returned:
--filtering XML data using the XQuery contains() function in multiple WHERE criteria ;WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey' AS ns) SELECT COUNT(1) FROM Person.Person WHERE Demographics.exist('(/ns:IndividualSurvey/ns:Education[contains(.,"Bachelors")])')=1 AND Demographics.exist('(/ns:IndividualSurvey/ns:Gender[contains(.,"F")])')=1 GO
Our contains() function instances use what is called a self node axis abbreviation (�.') as their first argument. This specifies that we want to search in the current node context (Gender or Education). The second argument is for the actual string we are searching for (�Bachelors' or �F').
We assume that our query is returning the records that meet our criteria, but what if we want to see the actual values to verify this? Another XQuery method we can use for this is value(). The syntax for the method is as follows:
We'll employ this method to return node values for the Education and Gender nodes (only the first 3 records), along with the non-XML LastName field:
;WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/IndividualSurvey' AS ns) SELECT TOP 3 LastName, Demographics.value('(/ns:IndividualSurvey/ns:Education)[1]', 'varchar(50)') AS Education, Demographics.value('(/ns:IndividualSurvey/ns:Gender)[1]', 'char(1)') AS Gender FROM Person.Person WHERE Demographics.exist('(/ns:IndividualSurvey/ns:Education[contains(.,"Bachelors")])')=1 AND Demographics.exist('(/ns:IndividualSurvey/ns:Gender[contains(.,"F")])')=1 GO
The value() method requires an XQuery path argument like the exist() method, but also needs a SQL Server data type argument. We are using �varchar(50)' and �char(1)' as the SQL Server data types for the Education and Gender node values.
The value() method also enforces a singleton rule - requiring that the node element or attribute instance in a sequence be explicitly specified, even if there is only one. For example, there is only one Education node in our XML fragment, but we still have to designate a singleton by placing the �[1]' at the end of the value() method, thus indicating that we want the first Education node.
We've looked briefly at some basic ways to query XML data using XQuery. We introduced the exist() method as an effective XQuery method to use in the WHERE clause. We also experimented with the XQuery contains()function, using it to count XML records having a precise node string value match. Finally, we used a very common XQuery method - the value() method - to return XML values.