Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Server UTF8 collations #25798

Closed
egbertn opened this issue Aug 28, 2021 · 25 comments · Fixed by #27634
Closed

SQL Server UTF8 collations #25798

egbertn opened this issue Aug 28, 2021 · 25 comments · Fixed by #27634
Assignees
Labels
area-sqlserver area-type-mapping closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported type-enhancement
Milestone

Comments

@egbertn
Copy link

egbertn commented Aug 28, 2021

This does not behave as documented and expected.

If I have a entity for which I use Fluent API to define properties...
The SQL field (in my example) is varchar(255) using collation Latin1_General_100_BIN2_UTF8
in EF defined as
p.Property(prop => prop.Param).IsUnicode(false).UseCollation("Latin1_General_100_BIN2_UTF8").HasMaxLength(255);

However, unicode chars get's corrupted anyway on SQL both on Azure as on 2019 express.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

@roji
Copy link
Member

roji commented Aug 28, 2021

Can you post a runnable code sample which shows the corruption occurring?

@egbertn
Copy link
Author

egbertn commented Aug 29, 2021

Can you post a runnable code sample which shows the corruption occurring?

Sure,
there you go. Thanks for your commitment.

https://github.com/egbertn/EFCore-5-UTF-8-support

@roji roji self-assigned this Aug 29, 2021
@ajcvickers ajcvickers transferred this issue from dotnet/EntityFramework.Docs Aug 31, 2021
@ajcvickers
Copy link
Member

Note for triage: this is because the DbType is set to AnsiString when IsUnicode is false. Below is a repro without EF Core. Changing the DbType to String makes it work.

Not using IsUnicode(false) makes the test pass, but creates an nvarchar(255) column, which apparently isn't desirable for SQL Server's squirrely UTF8 support.

static class Program
{
	static void Main()
	{
		using (var connection = new SqlConnection(@"Server=localhost;Database=master;Trusted_Connection=True"))
		{
			connection.Open();

			using (var command = connection.CreateCommand())
			{
				command.CommandText = @"DROP DATABASE [UTF8];";
				command.ExecuteNonQuery();
				
				command.CommandText = @"CREATE DATABASE [UTF8];";
				command.ExecuteNonQuery();
			}
			
			connection.Close();
		}

		using (var connection = new SqlConnection(@"Server=localhost;Database=UTF8;Trusted_Connection=True"))
		{
			connection.Open();

			using (var command = connection.CreateCommand())
			{
				command.CommandText = @"
      CREATE TABLE [MyTable] (
      [Id] bigint NOT NULL IDENTITY,
      [Param] varchar(255) COLLATE Latin1_General_100_BIN2_UTF8 NULL,
      CONSTRAINT [PK_MyTable] PRIMARY KEY ([Id]) );";

				command.ExecuteNonQuery();

				command.CommandText = @"INSERT INTO [MyTable] ([Param]) VALUES (@p0);";
				
				var parameter = command.CreateParameter();
				parameter.DbType = DbType.AnsiString;
				parameter.Size = 255;
				parameter.ParameterName = "p0";
				parameter.Value = "Как дела?";
				command.Parameters.Add(parameter);

				command.ExecuteNonQuery();

				command.CommandText = @"SELECT TOP(1) [m].[Id], [m].[Param] FROM [MyTable] AS [m] ORDER BY [m].[Id] DESC";
				using (var reader = command.ExecuteReader())
				{
					reader.Read();
					var value = reader.GetFieldValue<string>(reader.GetOrdinal("Param"));

					if (value != "Как дела?")
					{
						Console.WriteLine($"Wrote {parameter.Value}, read {value}.");
					}
				}
			}
			
			connection.Close();
		}
	}
}

@ajcvickers
Copy link
Member

Notes from triage:

  • It's not clear how to fix this; it will likely require type mappings to understand collations, which is not trivial.
  • For now, the workaround is to use IsUnicode(true) (which is the default) and to edit the migration to generate a varchar column.

@egbertn
Copy link
Author

egbertn commented Sep 1, 2021

Thanks @ajcvickers

Notes from triage:

* It's not clear how to fix this; it will likely require type mappings to understand collations, which is not trivial.

* For now, the workaround is to use `IsUnicode(true)` (which is the default) and to edit the migration to generate a `varchar` column.

Thanks, but isn't it a solution, to see if the collation name contains or ends with _utf8, if so, all string operations must be prepended with N like in N'как лела?'

@roji
Copy link
Member

roji commented Sep 1, 2021

@egbertn that could be a fix, but it would "require type mappings to understand collations, which is not trivial" as @ajcvickers wrote above. It's unfortunately too late to go into something like this for 6.0.

@roji roji changed the title Collation and Entity Framework Core does not behave Using SQL Server UTF8 collations Sep 2, 2021
@roji
Copy link
Member

roji commented Sep 2, 2021

Another direction for a proper fix would be to recognize the UTF8 collation in SQL Server's migration generator, and based on that create a varchar column instead of nvarchar, even if Unicode is true. This would allow keeping collations out of type mapping, and keep them in migrations which is where they currently live. This would also mean users continue running with Unicode=true, which feels more correct (after all, UTF8 is unicode).

@egbertn
Copy link
Author

egbertn commented Sep 2, 2021

Another direction for a proper fix would be to recognize the UTF8 collation in SQL Server's migration generator, and based on that create a varchar column instead of nvarchar, even if Unicode is true. This would allow keeping collations out of type mapping, and keep them in migrations which is where they currently live. This would also mean users continue running with Unicode=true, which feels more correct (after all, UTF8 is unicode).

👍👍

@mburbea
Copy link

mburbea commented Sep 15, 2021

Also there are definite performance disadvantages when using utf8 collation and nvarchar parameters, as the query is then non-sargable.

@egbertn
Copy link
Author

egbertn commented Sep 15, 2021

Also there are definite performance disadvantages when using utf8 collation and nvarchar parameters, as the query is then non-sargable.

THat is off topic. We are dealing here with EF supporting just another collation type.

@ajcvickers
Copy link
Member

See also #7172 when considering collations in the type mapping.

@clement911
Copy link
Contributor

clement911 commented Mar 6, 2022

We are hitting this issue as well and didn't realize our non ASCII characters were getting corrupted.

Is there a work around that doesn't involve manual editing migrations?

I tried the following but EF still sends varchar parameters.

pty.SetIsUnicode(true);
pty.SetColumnType("varchar");

@clement911
Copy link
Contributor

Looking into it further, UTF8 collations seem quite problematic.

My understanding is that UTF8 columns should be declared as varchar but it looks like unicode characters only remain intact if the parameter is of type nvarchar. This will incur a conversion by sql server which can impact performance and SARGability as @mburbea mentioned. IMO this is quite relevant and important to consider this issue.

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

We were exciting to use UTF8 collations to lower IO and storage size but I think we'll stick to UTF16 for now...

@egbertn
Copy link
Author

egbertn commented Mar 7, 2022

Looking into it further, UTF8 collations seem quite problematic.

My understanding is that UTF8 columns should be declared as varchar but it looks like unicode characters only remain intact if the parameter is of type nvarchar. This will incur a conversion by sql server which can impact performance and SARGability as @mburbea mentioned. IMO this is quite relevant and important to consider this issue.

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

We were exciting to use UTF8 collations to lower IO and storage size but I think we'll stick to UTF16 for now...

As suggested in this issue, the current workaround, is to do as EF is supposed to work, declare(keep) it nvarchar create a migration, and modify the migration script to use varchar, but the collation, of course, must be set to utf8. I did it and saw no problems.

@clement911
Copy link
Contributor

I was hoping for a work that doesn't involve manually editing migration.

Also I don't think that #25798 (comment) is off topic. Using a nvarchar will cause a conversion to a varchar which will impact performance and SARGability.

@roji
Copy link
Member

roji commented Mar 7, 2022

@clement911

So I believe ideally we would want to pass a varchar parameter and also indicates that the collation of the parameter is Latin1_General_100_BIN2_UTF8 (or whatever other actually UTF8 collation was used). The problem I see is that neither Microsoft.Data.SqlClient.SqlParameter nor sp_executesql allows passing the collation name of given parameters.

The collation isn't something that gets specified on a parameter, but rather on the column (or at the database level for all columns); see our docs for more info on this.

Aside from that, as @egbertn wrote above, a workaround exists but requiring editing the migration to change the type to varchar (doing something better is what this issue tracks). There's no reason to avoid editing the migration file - it's perfectly fine (and frequently recommended) to customize migration code after generating it, see our docs. I definitely wouldn't avoid UTF8 just because it requires a one-time edit to migration code.

@clement911
Copy link
Contributor

Right, I understand that collation are applied to columns but one can also specify a collation on a varchar literal with the COLLATE keyword.

Regarding editing migrations, the reason we avoid it is that we have to squash migrations every now and then because too many migrations causes very slow builds. Unfortunately since squashing migrations is a manual process that involves deleting existing migrations and creating a new one, any manual edits on migrations are lost. So, as a general practice, we attempt to apply all configurations via configuration code instead of custom migration code. I hope that makes sense.

@roji
Copy link
Member

roji commented Mar 7, 2022

Right, I understand that collation are applied to columns but one can also specify a collation on a varchar literal with the COLLATE keyword.

You can do that with EF.Collate if that's what you want, but that typically causes indexes to not be used, since they use the column's collation. So I'd really recommend against this.

Regarding editing migrations [...]

Yeah, that does make sense, though I'd check exactly why it is that you need to squash frequently. The number one reason we see for this is users putting lots of data (or big data) in seeding, which is something we explicitly recommend against (seeding was not designed for that). Assuming you're not doing that, squashing should only be necessary quite rarely.

More generally, editing migrations really is absolutely necessary in various cases. The typical one is when you rename a property: EF typically cannot know whether you want a rename (conserving data) or a drop and a creation of a new column, and so assumes the latter. If a rename is intended, the migration needs to be edited to reflect that (see this in our docs).

So I'd advise not trying to avoid migration customization, since it's a normal, required part of the migration process. I'd rather concentrate on how to make sure customizations are recorded somewhere, so that they can be reapplied after squashing etc.

@clement911
Copy link
Contributor

Thanks @roji for your comments.
FYI we don't use seeding at all.

Beside having a lot of migrations, there are other reasons for us to avoid manually editing migrations.
Sometimes we split a context into multiple contexts, or merge multiple context into ones, or move tables from one context to another (our project is pretty big). When that happens, EF might generate new migrations to drop / create tables that were moved to a different context. But we don't actually run these migrations because we don't really want to drop or create tables (they already exist from another context). After we do such changes, we find it's cleaner to regenerate an initial migration (squash) for each context so that each developer can create a fresh DB on their environment.

I understand your comment regarding renamed properties and EF assuming it's a new property. We're ok editing the code in that case, because we know that if we squash the migration, a new table with the latest column name would be created anyway. So, I guess there are some type of manuals edits which don't really matter once migrations are squashed. On the other hand, modifying the migration script to use varchar instead of nvarchar is something that would be lost/overridden when squashing migrations. Having the DB schema be a simple reflection of the code is a nice property which gives us confidence to change our db contexts and not be worried about breaking things.

Sorry, we got off topic a bit with the original question.

@roji
Copy link
Member

roji commented Mar 7, 2022

Sorry, we got off topic a bit with the original question.

No worries, it's an interesting discussion. I do have it on my plate to propose something better for 7.0 (which wouldn't require editing migrations), so this will hopefully go away soon.

@clement911
Copy link
Contributor

I'm looking forward to it 😀

@roji
Copy link
Member

roji commented Mar 8, 2022

Design proposal:

tl;dr allow users to configure UTF8 by explicitly setting both the column type to char/varchar and IsUnicode to true:

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.Entity<Blog>()
        .Property(b => b.Name)
        .HasColumnType("varchar(max)")
        .IsUnicode(true)
        .UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
}

We can add a sugar method which does the above:

modelBuilder.Entity<Blog>()
    .Property(b => b.Name)
    .UseUTF8("LATIN1_GENERAL_100_CI_AS_SC_UTF8");

Notes:

  • Today, explicitly setting the column to char/varchar also sets DbType=AnsiString (note that Unicode still remains true in the type mapping - not ideal).
  • Explicitly setting to Unicode to true currently has no effect if the store type is set to char/varchar, i.e. DbType is still AnsiString.
  • We can allow the user to explicitly set varchar(max) and IsUnicode=true - this would opt into UTF8. The column type is exactly what it should be in migrations (and also in the query pipline etc.), and IsUnicode tells us to send DbType.String instead of DbType.AnsiString.
  • Note that this isn't enough: a collation is needed as well (and it has to be explicit). We can add model validation that for all UTF8 properties (char/varchar with Unicode=true), to check for a UTF8-compatible collation (ends with _UTF8).
  • Scaffolding: look into doing this reliably. The combination of a char/varchar property with a collation ending with UTF8 (including at the database level) should lead to the correct UTF8 property being scaffolded (i.e. with either UseUTF8 or IsUnicode(true)`)

Global model configuration

The default database collation can already be set via modelBuilder.UseCollation():

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    modelBuilder.UseCollation("LATIN1_GENERAL_100_CI_AS_SC_UTF8");
}

All string properties can be configured to be UTF8 by default via pre-convention model configuration:

protected override void ConfigureConventions(ModelConfigurationBuilder configurationBuilder)
{
    configurationBuilder.DefaultTypeMapping<string>(b => b.HasColumnType("varchar(max)").IsUnicode(true));
}

We could also add a ConfigureUTF8() extension method to do the above.

@ErikEJ
Copy link
Contributor

ErikEJ commented Mar 8, 2022

@roji is there support for this in db first scaffolding?

@roji
Copy link
Member

roji commented Mar 8, 2022

I hope so - at least the "decomposed version" (i.e. with HasColumnType+IsUnicode+UseCollation), if not the nicer UseUTF8 version... I'll see when I get to implementing it...

@roji
Copy link
Member

roji commented Mar 11, 2022

Design decisions:

roji added a commit to roji/efcore that referenced this issue Mar 13, 2022
@roji roji added closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. and removed needs-design area-migrations labels Mar 13, 2022
roji added a commit that referenced this issue Mar 17, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0, 7.0.0-preview3 Mar 31, 2022
@ajcvickers ajcvickers changed the title Using SQL Server UTF8 collations SQL Server UTF8 collations Apr 18, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0-preview3, 7.0.0 Nov 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-sqlserver area-type-mapping closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants