<CharlieDigital/> Programming, Politics, and uhh…pineapples

26Jul/13Off

SQL Query for Multi-Values In An Encoded String

Consider a table with an textual column that encodes multi-values like so:

1|US;2|CA;3|MX

How can we query for all rows using an OR criteria?

For a single value, it's quite easy by searching for the string (in this case, an ISO 2 code).  But what if we need to search for the occurrence of one of n strings?

The following query achieves this using the Microsoft SQL XML data type, the nodes() function, and CROSS APPLY:

-- Create data source; this is just for demo purposes
DECLARE @contacts TABLE
(
    id int,
    contact nvarchar(100),
    countries nvarchar(100) 
)

-- Insert sample test data.
INSERT INTO @contacts VALUES (1, 'Charles', '1|US;2|CA;3|MX') -- US, CA, MX
INSERT INTO @contacts VALUES (2, 'Steven', '1|US;3|MX;2|CA') -- US, MX, CA
INSERT INTO @contacts VALUES (3, 'Arturo', '3|MX') -- MX
INSERT INTO @contacts VALUES (4, 'Silvia', '4|FR') -- FR
INSERT INTO @contacts VALUES (5, 'John', '2|CA;1|US') -- CA, US
INSERT INTO @contacts VALUES (5, 'Johan', '5|DE') -- DE

-- Query for all contacts in US OR MX OR CA (Charles, Steven, Arturo, John)
SELECT
    DISTINCT T1.id,
    T1.contact
FROM (
    SELECT 
        id,
        contact,        
        CAST('<a><c>' + REPLACE(countries, ';','</c><c>') + '</c></a>' AS XML) AS countriesXml
    FROM @contacts  
    ) AS T1
CROSS APPLY T1.countriesXml.nodes('/a/c') T2(c)
WHERE CAST(T2.c.query('string(.)') AS varchar(max)) IN ('1|US', '3|MX', '2|CA')

This should yield the values Charles, Steven, Arturo, and John by first converting the delimited values into XML by simply using string replacement.  Next, the XML is "shredded" using nodes().  For each base row, the shredding generates one row per node (for example, for Charles, we would have one row for US, one row for CA, and one row for MX).

Here is the result of the inner sub-select:

1	Charles	<a><c>1|US</c><c>2|CA</c><c>3|MX</c></a>
2	Steven	<a><c>1|US</c><c>3|MX</c><c>2|CA</c></a>
3	Arturo	<a><c>3|MX</c></a>
4	Silvia	<a><c>4|FR</c></a>
5	John	<a><c>2|CA</c><c>1|US</c></a>
5	Johan	<a><c>5|DE</c></a>

And here is the resultset after shredding:

1	Charles	1|US
1	Charles	2|CA
1	Charles	3|MX
2	Steven	1|US
2	Steven	3|MX
2	Steven	2|CA
3	Arturo	3|MX
4	Silvia	4|FR
5	John	2|CA
5	John	1|US
5	Johan	5|DE

You can see the intermediate resultset using this query:

SELECT
	T1.id,
	T1.contact,
	T2.c.query('string(.)')
FROM (
	SELECT 
		id,
		contact,		
		CAST('<a><c>' + REPLACE(countries, ';','</c><c>') + '</c></a>' AS XML) AS countriesXml
	FROM @contacts 	
	) AS T1
CROSS APPLY T1.countriesXml.nodes('/a/c') T2(c)

Finally, the DISTINCT clause collapses the resultset once again.

From my own testing, better performance can be achieved by creating a table variable with the target values and using a JOIN instead of an IN (about 1/5 of the time). For 100k records, using IN takes about 13.430s. Using a JOIN to a table variable takes about 2.293s.

Posted by Charles Chen

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.