SQL Query for Multi-Values In An Encoded String
Consider a table with an textual column that encodes multi-values like so:
1 |
1|US;2|CA;3|MX |
How can we query for all rows using an OR criteria?
For a single value, it’s quite easy by searching for the string (in this case, an ISO 2 code). But what if we need to search for the occurrence of one of n strings?
The following query achieves this using the Microsoft SQL XML data type, the nodes() function, and CROSS APPLY:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
-- Create data source; this is just for demo purposes DECLARE @contacts TABLE ( id int, contact nvarchar(100), countries nvarchar(100) ) -- Insert sample test data. INSERT INTO @contacts VALUES (1, 'Charles', '1|US;2|CA;3|MX') -- US, CA, MX INSERT INTO @contacts VALUES (2, 'Steven', '1|US;3|MX;2|CA') -- US, MX, CA INSERT INTO @contacts VALUES (3, 'Arturo', '3|MX') -- MX INSERT INTO @contacts VALUES (4, 'Silvia', '4|FR') -- FR INSERT INTO @contacts VALUES (5, 'John', '2|CA;1|US') -- CA, US INSERT INTO @contacts VALUES (5, 'Johan', '5|DE') -- DE -- Query for all contacts in US OR MX OR CA (Charles, Steven, Arturo, John) SELECT DISTINCT T1.id, T1.contact FROM ( SELECT id, contact, CAST('<a><c>' + REPLACE(countries, ';','</c><c>') + '</c></a>' AS XML) AS countriesXml FROM @contacts ) AS T1 CROSS APPLY T1.countriesXml.nodes('/a/c') T2(c) WHERE CAST(T2.c.query('string(.)') AS varchar(max)) IN ('1|US', '3|MX', '2|CA') |
This should yield the values Charles, Steven, Arturo, and John by first converting the delimited values into XML by simply using string replacement. Next, the XML is “shredded” using nodes(). For each base row, the shredding generates one row per node (for example, for Charles, we would have one row for US, one row for CA, and one row for MX).
Here is the result of the inner sub-select:
1 2 3 4 5 6 |
1 Charles <a><c>1|US</c><c>2|CA</c><c>3|MX</c></a> 2 Steven <a><c>1|US</c><c>3|MX</c><c>2|CA</c></a> 3 Arturo <a><c>3|MX</c></a> 4 Silvia <a><c>4|FR</c></a> 5 John <a><c>2|CA</c><c>1|US</c></a> 5 Johan <a><c>5|DE</c></a> |
And here is the resultset after shredding:
1 2 3 4 5 6 7 8 9 10 11 |
1 Charles 1|US 1 Charles 2|CA 1 Charles 3|MX 2 Steven 1|US 2 Steven 3|MX 2 Steven 2|CA 3 Arturo 3|MX 4 Silvia 4|FR 5 John 2|CA 5 John 1|US 5 Johan 5|DE |
You can see the intermediate resultset using this query:
1 2 3 4 5 6 7 8 9 10 11 12 |
SELECT T1.id, T1.contact, T2.c.query('string(.)') FROM ( SELECT id, contact, CAST('<a><c>' + REPLACE(countries, ';','</c><c>') + '</c></a>' AS XML) AS countriesXml FROM @contacts ) AS T1 CROSS APPLY T1.countriesXml.nodes('/a/c') T2(c) |
Finally, the DISTINCT clause collapses the resultset once again.
From my own testing, better performance can be achieved by creating a table variable with the target values and using a JOIN instead of an IN (about 1/5 of the time). For 100k records, using IN takes about 13.430s. Using a JOIN to a table variable takes about 2.293s.