Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added an optional parameter to specify the character encoding of a source #78

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

jze
Copy link
Contributor

@jze jze commented Dec 13, 2022

Overview

I have added an optional parameter to specify the character encoding of a source. The test TableEncodingTests::createTableFromIso8859() passes.

However, I'm not at all sure that my approach makes sense in context of the rest of the code/framework. Somehow we would have to transfer the encoding property from the Tabular Data Resource to the Table.fromSource method invocation.

Closes #77


Please preserve this line to notify @iSnow (lead of this repository)

@iSnow iSnow changed the base branch from main to issue78-charsets December 13, 2022 09:55
@iSnow iSnow merged commit b8ddbd3 into frictionlessdata:issue78-charsets Dec 13, 2022
@iSnow
Copy link
Contributor

iSnow commented Dec 15, 2022

Thanks for the PR, much appreciated! For now, I pulled it into a new branch to get the API right.

It's not a huge problem to parse the encoding from the resource definition and hand it down to the Table factory method. Reading Resources from files should be easy, and the API for Tables and Datapackages would match. What's giving me headaches is whether to set an encoding on a URL-based Table in the first place. Usually, the web server should give us that.

In the context of Table, it should be fine to trust the web server and therefore not have an encoding param in the Table.fromSource method, but in the context of a Datapackage, there might be an encoding defined on a Resource which may or may not match the encoding of the individual URLs.

Essentially, it comes down to:

  • File based Table:
    • no encoding -> use UTF-8.
    • Encoding given -> use it
  • URL based Table:
    • no encoding -> use whatever the web server gives us OR use UTF-8 (?).
    • Encoding given: might clash with what the web server gives us if the Table API allows to specify it.

@roll How do other implementations handle this? or what's your opinion?

@roll
Copy link
Member

roll commented Dec 23, 2022

Hi, in Python we infer the encoding from a byte sample (buffer) if it's not provided. We tried to use an encoding from the HTTP headers but it's ofter missleading so we stopped using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Character set of a Table's DataSourceFormat can't be set
3 participants