Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Unable to find column name '年齢' among names ['pclass', 'sex', '__', 'fare', 'embarked']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. #774

Closed
dc-aichara opened this issue Nov 8, 2021 · 2 comments

Comments

@dc-aichara
Copy link

I am getting RuntimeError: Unable to find column name "Some_Japanese_word" among names [columns with some Japanese characters in in column names] error when I try to convert SKLearn object (ColumnTransformer) into ONNX object with convert_sklearn(model, initial_types=initial_type) .

After looking into code, I found that function

def _generate_unique_name(seed, existing_names):
replaces all Japanese characters with _ which causes error.

Why does it support only alphabets and numbers? Is there any risk with other language’s characters in data column names?

Note: Replacing

seed = re.sub('[^0-9a-zA-Z]', '_', seed)
with seed = re.sub('[^\\w+]', '_', seed) fixes Japanese character issue.


Please run following code to reproduce this issue.


import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType, StringTensorType,Int64TensorType

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
# Replace age with it's Japanese translation for this test.
data.rename(columns= {"age": "年齢"}, inplace=True)

X = data.drop('survived', axis=1)
y = data['survived']


cols =['embarked', 'sex', 'pclass', '年齢', 'fare']
X = X[cols]

for cat in ['embarked', 'sex', 'pclass']:
    X[cat].fillna('missing', inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

numeric_features = ['年齢', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
    ])

X_train = preprocessor.fit_transform(X_train)

initial_type = [
                 ('pclass', Int64TensorType(shape=[None, 1])),
                 ('sex', StringTensorType(shape=[None, 1])),
                 ('年齢', FloatTensorType(shape=[None, 1])),
                 ('fare', FloatTensorType(shape=[None, 1])),
                 ('embarked', StringTensorType(shape=[None, 1]))
               ]

onnx_object = convert_sklearn(preprocessor, initial_types=initial_type)
@xadupre
Copy link
Collaborator

xadupre commented Nov 18, 2021

I made a fix with the suggested change. However, onnxruntime does not support such character yet so it is difficult to test that the predictions are the same in that case.

@xadupre
Copy link
Collaborator

xadupre commented Nov 26, 2021

I'll close the issue feel free to reopen it.

@xadupre xadupre closed this as completed Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants