Dealing with Diacritics in Your Data

This post has been republished via RSS; it originally appeared at: SQL Server Support articles.

First published on MSDN on Apr 15, 2013

Question : My customer wants to store data other than English in his table, using the Latin alphabet. For example, he has Spanish, German, Portuguese and English data. He wants to be able to return all the variations of the same character with one query, regardless of whether it has a diacritic mark (accent). For example, if he queries for last name like '%pena%', he wants to get both “Arpeña” and “Arpena” returned. He wants to ensure queries using Tempdb, linked servers, and full-text search work fine as well.

Answer : There could be a few approaches to this, but the easiest solution is to use Accent-Insensitive collation to store this data. What this means is that regardless of accent or diacritic mark added to the character, SQL Server will treat them all as if they do not have one (or as if they all have the same one). Here is an example:

use AccentInsensitiveTest
go
drop table t1
create table t1 (c1 int, c2 nvarchar(2000) collate SQL_Latin1_General_CP1_CI_AI) --Can use Latin1_General_CI_AI also

--note that we are inserting two values one with the ñ and one using n

insert into t1 values (1, 'Arpena'), (2, 'Arpeña')

--now select the data out using either version of the character

select * from t1 where c2 like '%pen%'

go

select * from t1 where c2 like '%peñ%'

Results – as you can see regardless of diacritic mark (accent), SQL Server returns both rows:

c1          c2
----------- -----------------
1           Arpena
2           Arpeña

(2 row(s) affected)

c1          c2
----------- -----------------
1           Arpena
2           Arpeña

(2 row(s) affected)

TempDb Considerations

If your SQL Server collation is different from your database or table collation, then you will encounter a collation error 468. Here is an example, extending the above.

create table #temp (c2 nvarchar(2000))

insert into #temp values ('Arpena'), ('Arpeña')

select * from t1 join #temp on t1.c2 = #temp.c2

Msg 468, Level 16, State 9, Line 3
Cannot resolve the collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "SQL_Latin1_General_CP1_CI_AI" in the equal to operation.

So based on this MSDN article , here is how you deal with this problem:

Create the temp table either with the explicit accent-insensitive collation, or with the collation used in your user database (if your entire database uses accent-insensitive collation).

--Option 1: this will work in all cases

create table #temp (c2 nvarchar(2000) collate SQL_Latin1_General_CP1_CI_AI)

--Option 2: works only if entire user database uses a case-insensitive collation

create table #temp (c2 nvarchar(2000) collate database_default)

Full-Text Search Considerations

Full-Text search is built to locate words in a sentence and then discover the stem of each word. Thus, full-text search is “smart” in recognizing that for example ‘el’ is different from ‘él’ in Spanish. Therefore, Full-text search cannot be “fooled” to return both words at the same time with one query or one clause. Two different clauses would be necessary. In the specific case of querying for last name, full-text is not a solution because names are not really “proper” words in a language (unless the name are in fact language nouns like “joy”).

Linked Servers Considerations

Running a linked server query against the accent-insensitive table, should be pretty transparent.

select * from openquery ([mySQLServer\sql2008r2], 'select * from [AccentInsensitiveTest].dbo.t1 where c2 like ''%peñ%''')

select * from [mySQLServer\sql2008r2].[AccentInsensitiveTest].dbo.t1 where c2 like '%peñ%'

For any additional linked server considerations, review this MSDN article

Namaste!

Joseph

Leave a Reply Cancel reply