Fix character decoding issues with text-like files #19

brc-dd · 2024-12-14T08:04:49Z

fixes #18

Non utf-8 html pages are still not supported. It will need detecting language from content-type header or meta tags like <meta charset="utf-8" /> or <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />. And at last fall back to auto detecting.

Bing, WikiPedia and Youtube converters don't need this as they always have utf-8 encoding.

brc-dd · 2024-12-14T08:05:38Z

@microsoft-github-policy-service agree

gagb · 2024-12-14T09:23:06Z

@brc-dd can you please expand the tests? Or at least share an example csv with this error?

brc-dd · 2024-12-14T09:37:40Z

Updated the tests.

gagb · 2024-12-14T22:51:55Z

Can you please run the pre-commit checks? I think it should be pre-commit run --all-files.

brc-dd · 2024-12-15T05:08:41Z

Formatted.

tests/test_markitdown.py

brc-dd force-pushed the fix/18 branch 2 times, most recently from d44f35f to 112c276 Compare December 14, 2024 08:10

gagb self-requested a review December 14, 2024 09:16

brc-dd force-pushed the fix/18 branch from 112c276 to d1797a1 Compare December 14, 2024 09:36

brc-dd force-pushed the fix/18 branch from d1797a1 to 1bfde0c Compare December 14, 2024 10:17

brc-dd force-pushed the fix/18 branch from 1bfde0c to da1f518 Compare December 15, 2024 05:07

Fix character decoding issues with text-like files

52b7237

brc-dd force-pushed the fix/18 branch from da1f518 to 52b7237 Compare December 15, 2024 05:08

gagb approved these changes Dec 15, 2024

View reviewed changes

brc-dd commented Dec 15, 2024

View reviewed changes

tests/test_markitdown.py Show resolved Hide resolved

Merge branch 'main' into fix/18

aeff2cb

gagb merged commit ed91e8b into microsoft:main Dec 16, 2024
3 checks passed

kentaroy47 mentioned this pull request Dec 17, 2024

trouble with writing out markdown file #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix character decoding issues with text-like files #19

Fix character decoding issues with text-like files #19

brc-dd commented Dec 14, 2024 •

edited

Loading

brc-dd commented Dec 14, 2024

gagb commented Dec 14, 2024

brc-dd commented Dec 14, 2024

gagb commented Dec 14, 2024

brc-dd commented Dec 15, 2024

Fix character decoding issues with text-like files #19

Fix character decoding issues with text-like files #19

Conversation

brc-dd commented Dec 14, 2024 • edited Loading

brc-dd commented Dec 14, 2024

gagb commented Dec 14, 2024

brc-dd commented Dec 14, 2024

gagb commented Dec 14, 2024

brc-dd commented Dec 15, 2024

brc-dd commented Dec 14, 2024 •

edited

Loading