Wednesday, June 23, 2010

SQL Server T-SQL error: XML parsing: line X, character Y, illegal xml character

A quickie:
If you happen to stumble on this error, probably you have some XML stored in a varchar/nvarchar column and you try to do a convert and filter on it with XQuery. At least this happened to me.

XML parsing: line X, character Y, illegal xml character

This tells us that at the specified char, there is an illegal XML character according to XML RFC. (dig it up on Google).

There are many tricks out there, but none of them are able to identify correctly the problematic chars in the according rows, unless you use some regular expression to filter it.

Oh, SQL Server has scarce support for regexes (at least to my knowledge), so I had to create a .Net CLR C# user defined function. (create a new clr project, rename the assembly with the name of the project, add a user defined function and paste the code, then press deploy on the dbname of your choice.).

[Microsoft.SqlServer.Server.SqlFunction]
public static SqlBoolean RegexIsMatch(SqlString regex, SqlString text)
{
if (regex.IsNull || regex.Value == null || text == null || text.Value == null)
return new SqlBoolean(false);
return new SqlBoolean(new Regex(regex.Value).IsMatch(text.Value));
}
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString RegexReplace(SqlString regex, SqlString originaltext, SqlString texttoinsert)
{
if (regex.IsNull || regex.Value == null)
throw new Exception("First parameter, regex, must be not null.");
if (texttoinsert.IsNull || texttoinsert.Value == null)
throw new Exception("Third parameter, texttoinsert, must be not null.");
if (originaltext.IsNull || originaltext.Value == null)
return originaltext;
return new SqlString(new Regex(regex.Value).Replace(originaltext.Value, texttoinsert.Value));
}



Now the SELECT:
SELECT * FROM YourTable
WHERE YourDB.dbo.RegexIsMatch('.*[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F].*', ProblematicXMLNvarcharColumn)=1

You can test the method with:
SELECT YourDB.dbo.RegexIsMatch('.*[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F].*', 'cosmin'+CONVERT(varchar(100), 0x02)+'bla')

The REGEX identifies characters not in the allowed range according to XML RFC.

Of course on large XMLs it has a performance penalty (it took for me 25 minutes to parse 5 million records), but this is not supposed for day-to-day operation, rather than a cleanup before converting the column to an xml column.

PS: Remember to ALTER the definitions of the functions after deployment, and replace nvarchar(4000) parameters and return values with nvarchar(max). This will allow you to manipulate bigger XMLs.
PS2: I included also a regex replace method, just for fun.