Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk load data conversion error (truncation) for concept_synonym_name #1047

Open
stpatnoe opened this issue Sep 5, 2024 · 4 comments
Open

Comments

@stpatnoe
Copy link

stpatnoe commented Sep 5, 2024

While running the SQL server BULK INSERT script for the CONCEPT_SYNONYM table for the latest vocabulary (v5.0 30-AUG-24), I'm getting the following error message for multiple rows:

Msg 4863, Level 16, State 1, Line 85
Bulk load data conversion error (truncation) for row 420061, column 2 (concept_synonym_name).

The row referenced in the error message above is for concept_id 1251539. It looks like the concept_synonym_name field was cut at 1000 characters but then continues on a second line?

The bulk insert script I'm using is the same format as the one provided below from this link.

TRUNCATE TABLE CONCEPT_SYNONYM;
BULK INSERT CONCEPT_SYNONYM
FROM 'C:\CDMV5VOCAB\CONCEPT_SYNONYM.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\CDMV5VOCAB\CONCEPT_SYNONYM.bad',
TABLOCK
);

@AlaikseiKatyshou
Copy link
Contributor

AlaikseiKatyshou commented Sep 6, 2024

Hi @stpatnoe,

I think the error in this case is related to the fact that the field concept_synonym_name in the concept_synonym table has the VARCHAR(1000) type, while the concept 1251539 has a synonym name truncated to a length of 1000 and the name itself contains Unicode characters Ξ.
A field with the VARCHAR(1000) type is designed to store non-Unicode text 1000 bytes long, where each character is 1 byte.
While a 1000-character string containing Unicode will not fit in that field, because a Unicode character can take more than 1 byte.

As a solution to this issue, I would suggest changing the data type of the concept_synonym_name field from VARCHAR(1000) to NVARCHAR(1000), since NVARCHAR is designed to work with Unicode text.
@clairblacketer

@cgreich
Copy link
Contributor

cgreich commented Sep 6, 2024

Somebody in the vocab team should remove that "synonym" altogether.

@Alexdavv
Copy link
Member

Somebody in the vocab team should remove that "synonym" altogether.

Maybe, but it wouldn't solve the DDL issue.

@stpatnoe
Copy link
Author

Switching to NVARCHAR(1000) did not resolve the issue for me. It looks like there are 54 concepts in the 'OMOP Invest Drug' vocabulary_id that have more than 1000 characters. I was able to use the bulk insert after updating the concept_synonym_name column to NVARCHAR(MAX):

ALTER TABLE CONCEPT_SYNONYM ALTER COLUMN concept_synonym_name nvarchar(max) not null

TRUNCATE TABLE CONCEPT_SYNONYM;
BULK INSERT CONCEPT_SYNONYM
FROM 'C:\CDMV5VOCAB\CONCEPT_SYNONYM.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '0x0a',
ERRORFILE = 'C:\CDMV5VOCAB\CONCEPT_SYNONYM.bad',
TABLOCK
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants