RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

dc-aichara · 2021-11-08T09:02:50Z

I am getting RuntimeError: Unable to find column name "Some_Japanese_word" among names [columns with some Japanese characters in in column names] error when I try to convert SKLearn object (ColumnTransformer) into ONNX object with convert_sklearn(model, initial_types=initial_type) .

After looking into code, I found that function

sklearn-onnx/skl2onnx/common/_topology.py

Line 919 in fab5a97

def _generate_unique_name(seed, existing_names):

replaces all Japanese characters with _ which causes error.

Why does it support only alphabets and numbers? Is there any risk with other language’s characters in data column names?

Note: Replacing

sklearn-onnx/skl2onnx/common/_topology.py

Line 932 in fab5a97

seed = re.sub('[^0-9a-zA-Z]', '_', seed)

with seed = re.sub('[^\\w+]', '_', seed) fixes Japanese character issue.

Please run following code to reproduce this issue.

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, StringTensorType,Int64TensorType

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
# Replace age with it's Japanese translation for this test.
data.rename(columns= {"age": "年齢"}, inplace=True)

X = data.drop('survived', axis=1)
y = data['survived']


cols =['embarked', 'sex', 'pclass', '年齢', 'fare']
X = X[cols]

for cat in ['embarked', 'sex', 'pclass']:
    X[cat].fillna('missing', inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

numeric_features = ['年齢', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ])

X_train = preprocessor.fit_transform(X_train)

initial_type = [
                 ('pclass', Int64TensorType(shape=[None, 1])),
                 ('sex', StringTensorType(shape=[None, 1])),
                 ('年齢', FloatTensorType(shape=[None, 1])),
                 ('fare', FloatTensorType(shape=[None, 1])),
                 ('embarked', StringTensorType(shape=[None, 1]))
               ]

onnx_object = convert_sklearn(preprocessor, initial_types=initial_type)

The text was updated successfully, but these errors were encountered:

xadupre · 2021-11-18T10:13:47Z

I made a fix with the suggested change. However, onnxruntime does not support such character yet so it is difficult to test that the predictions are the same in that case.

xadupre · 2021-11-26T10:01:16Z

I'll close the issue feel free to reopen it.

xadupre mentioned this issue Nov 18, 2021

Support non ascii characters in variable names #784

Merged

xadupre closed this as completed Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

dc-aichara commented Nov 8, 2021

xadupre commented Nov 18, 2021

xadupre commented Nov 26, 2021

RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

Comments

dc-aichara commented Nov 8, 2021

xadupre commented Nov 18, 2021

xadupre commented Nov 26, 2021