-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL Server UTF8 collations #25798
Comments
Can you post a runnable code sample which shows the corruption occurring? |
Sure, |
Note for triage: this is because the Not using static class Program
{
static void Main()
{
using (var connection = new SqlConnection(@"Server=localhost;Database=master;Trusted_Connection=True"))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = @"DROP DATABASE [UTF8];";
command.ExecuteNonQuery();
command.CommandText = @"CREATE DATABASE [UTF8];";
command.ExecuteNonQuery();
}
connection.Close();
}
using (var connection = new SqlConnection(@"Server=localhost;Database=UTF8;Trusted_Connection=True"))
{
connection.Open();
using (var command = connection.CreateCommand())
{
command.CommandText = @"
CREATE TABLE [MyTable] (
[Id] bigint NOT NULL IDENTITY,
[Param] varchar(255) COLLATE Latin1_General_100_BIN2_UTF8 NULL,
CONSTRAINT [PK_MyTable] PRIMARY KEY ([Id]) );";
command.ExecuteNonQuery();
command.CommandText = @"INSERT INTO [MyTable] ([Param]) VALUES (@p0);";
var parameter = command.CreateParameter();
parameter.DbType = DbType.AnsiString;
parameter.Size = 255;
parameter.ParameterName = "p0";
parameter.Value = "Как дела?";
command.Parameters.Add(parameter);
command.ExecuteNonQuery();
command.CommandText = @"SELECT TOP(1) [m].[Id], [m].[Param] FROM [MyTable] AS [m] ORDER BY [m].[Id] DESC";
using (var reader = command.ExecuteReader())
{
reader.Read();
var value = reader.GetFieldValue<string>(reader.GetOrdinal("Param"));
if (value != "Как дела?")
{
Console.WriteLine($"Wrote {parameter.Value}, read {value}.");
}
}
}
connection.Close();
}
}
} |
Notes from triage:
|
Thanks @ajcvickers
Thanks, but isn't it a solution, to see if the collation name contains or ends with _utf8, if so, all string operations must be prepended with N like in N'как лела?' |
@egbertn that could be a fix, but it would "require type mappings to understand collations, which is not trivial" as @ajcvickers wrote above. It's unfortunately too late to go into something like this for 6.0. |
Another direction for a proper fix would be to recognize the UTF8 collation in SQL Server's migration generator, and based on that create a varchar column instead of nvarchar, even if Unicode is true. This would allow keeping collations out of type mapping, and keep them in migrations which is where they currently live. This would also mean users continue running with Unicode=true, which feels more correct (after all, UTF8 is unicode). |
👍👍 |
Also there are definite performance disadvantages when using utf8 collation and |
THat is off topic. We are dealing here with EF supporting just another collation type. |
See also #7172 when considering collations in the type mapping. |
We are hitting this issue as well and didn't realize our non ASCII characters were getting corrupted. Is there a work around that doesn't involve manual editing migrations? I tried the following but EF still sends varchar parameters.
|
Looking into it further, UTF8 collations seem quite problematic. My understanding is that UTF8 columns should be declared as varchar but it looks like unicode characters only remain intact if the parameter is of type nvarchar. This will incur a conversion by sql server which can impact performance and SARGability as @mburbea mentioned. IMO this is quite relevant and important to consider this issue. So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither We were exciting to use UTF8 collations to lower IO and storage size but I think we'll stick to UTF16 for now... |
As suggested in this issue, the current workaround, is to do as EF is supposed to work, declare(keep) it nvarchar create a migration, and modify the migration script to use varchar, but the collation, of course, must be set to utf8. I did it and saw no problems. |
I was hoping for a work that doesn't involve manually editing migration. Also I don't think that #25798 (comment) is off topic. Using a nvarchar will cause a conversion to a varchar which will impact performance and SARGability. |
The collation isn't something that gets specified on a parameter, but rather on the column (or at the database level for all columns); see our docs for more info on this. Aside from that, as @egbertn wrote above, a workaround exists but requiring editing the migration to change the type to varchar (doing something better is what this issue tracks). There's no reason to avoid editing the migration file - it's perfectly fine (and frequently recommended) to customize migration code after generating it, see our docs. I definitely wouldn't avoid UTF8 just because it requires a one-time edit to migration code. |
Right, I understand that collation are applied to columns but one can also specify a collation on a varchar literal with the COLLATE keyword. Regarding editing migrations, the reason we avoid it is that we have to squash migrations every now and then because too many migrations causes very slow builds. Unfortunately since squashing migrations is a manual process that involves deleting existing migrations and creating a new one, any manual edits on migrations are lost. So, as a general practice, we attempt to apply all configurations via configuration code instead of custom migration code. I hope that makes sense. |
You can do that with EF.Collate if that's what you want, but that typically causes indexes to not be used, since they use the column's collation. So I'd really recommend against this.
Yeah, that does make sense, though I'd check exactly why it is that you need to squash frequently. The number one reason we see for this is users putting lots of data (or big data) in seeding, which is something we explicitly recommend against (seeding was not designed for that). Assuming you're not doing that, squashing should only be necessary quite rarely. More generally, editing migrations really is absolutely necessary in various cases. The typical one is when you rename a property: EF typically cannot know whether you want a rename (conserving data) or a drop and a creation of a new column, and so assumes the latter. If a rename is intended, the migration needs to be edited to reflect that (see this in our docs). So I'd advise not trying to avoid migration customization, since it's a normal, required part of the migration process. I'd rather concentrate on how to make sure customizations are recorded somewhere, so that they can be reapplied after squashing etc. |
Thanks @roji for your comments. Beside having a lot of migrations, there are other reasons for us to avoid manually editing migrations. I understand your comment regarding renamed properties and EF assuming it's a new property. We're ok editing the code in that case, because we know that if we squash the migration, a new table with the latest column name would be created anyway. So, I guess there are some type of manuals edits which don't really matter once migrations are squashed. On the other hand, modifying the migration script to use varchar instead of nvarchar is something that would be lost/overridden when squashing migrations. Having the DB schema be a simple reflection of the code is a nice property which gives us confidence to change our db contexts and not be worried about breaking things. Sorry, we got off topic a bit with the original question. |
No worries, it's an interesting discussion. I do have it on my plate to propose something better for 7.0 (which wouldn't require editing migrations), so this will hopefully go away soon. |
I'm looking forward to it 😀 |
Design proposal: tl;dr allow users to configure UTF8 by explicitly setting both the column type to protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.Entity<Blog>()
.Property(b => b.Name)
.HasColumnType("varchar(max)")
.IsUnicode(true)
.UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
} We can add a sugar method which does the above: modelBuilder.Entity<Blog>()
.Property(b => b.Name)
.UseUTF8("LATIN1_GENERAL_100_CI_AS_SC_UTF8"); Notes:
Global model configurationThe default database collation can already be set via protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
} All string properties can be configured to be UTF8 by default via pre-convention model configuration: protected override void ConfigureConventions(ModelConfigurationBuilder configurationBuilder)
{
configurationBuilder.DefaultTypeMapping<string>(b => b.HasColumnType("varchar(max)").IsUnicode(true));
} We could also add a |
@roji is there support for this in db first scaffolding? |
I hope so - at least the "decomposed version" (i.e. with HasColumnType+IsUnicode+UseCollation), if not the nicer UseUTF8 version... I'll see when I get to implementing it... |
Design decisions:
|
This does not behave as documented and expected.
If I have a entity for which I use Fluent API to define properties...
The SQL field (in my example) is varchar(255) using collation Latin1_General_100_BIN2_UTF8
in EF defined as
p.Property(prop => prop.Param).IsUnicode(false).UseCollation("Latin1_General_100_BIN2_UTF8").HasMaxLength(255);
However, unicode chars get's corrupted anyway on SQL both on Azure as on 2019 express.
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
The text was updated successfully, but these errors were encountered: