Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix character decoding issues with text-like files #19

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

brc-dd
Copy link
Contributor

@brc-dd brc-dd commented Dec 14, 2024

fixes #18

Non utf-8 html pages are still not supported. It will need detecting language from content-type header or meta tags like <meta charset="utf-8" /> or <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />. And at last fall back to auto detecting.

Bing, WikiPedia and Youtube converters don't need this as they always have utf-8 encoding.

@brc-dd
Copy link
Contributor Author

brc-dd commented Dec 14, 2024

@microsoft-github-policy-service agree

@brc-dd brc-dd force-pushed the fix/18 branch 2 times, most recently from d44f35f to 112c276 Compare December 14, 2024 08:10
@gagb gagb self-requested a review December 14, 2024 09:16
@gagb
Copy link
Contributor

gagb commented Dec 14, 2024

@brc-dd can you please expand the tests? Or at least share an example csv with this error?

@brc-dd
Copy link
Contributor Author

brc-dd commented Dec 14, 2024

Updated the tests.

@gagb
Copy link
Contributor

gagb commented Dec 14, 2024

Can you please run the pre-commit checks? I think it should be pre-commit run --all-files.

@brc-dd
Copy link
Contributor Author

brc-dd commented Dec 15, 2024

Formatted.

@gagb gagb merged commit ed91e8b into microsoft:main Dec 16, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use charset_normalizer
2 participants