Switch LGPL'd chardet for MIT licensed charset_normalizer #5797

ashb · 2021-04-22T08:25:50Z

At least for Python 3 -- charset_normalizer doesn't support Python2, so for that chardet is still used -- this means the "have chardet" path is also still tested.

Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd.

By changing out this dependency for one under MIT we remove all license ambiguity.

As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, requests[lgpl]. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it)

I've read #4115, #3389, and chardet/chardet#36 (comment) so I'm aware of the history, but I hope that the approach in this PR will allow this to be merged, as right now, the Apache Software Foundation doesn't allow projects to depend upon LGPL'd code (this is something I'm trying to get changed, but it is a very slow process)

potiuk · 2021-04-22T08:54:23Z

That looks like an easy fix to a long-standing problem and licencing problem. It would be really great if this change gets merged and released.

For the maintainers of requests - I know You have limited time for testing but If you think there is any need for extra testing, I am happy to help to get it approved.

potiuk · 2021-04-26T10:09:50Z

Dear maintainers of requests library.

Would it be possible to hear from you what you think about this idea? It's super-important for us - basically for the whole Apache Software Foundation but for Apache Airflow and Apache Liminal projects particularly. It would be great if we know whether this or similar change will be possible to accept by you or not and whether we should try some alternatives.

Implementing non-LGPL bound dependency by requests would be by-far easiest way to proceed for us, but In case it's not - we will have to start thinking about alternatives before we release next version, because this problem effectively blocks us from releasing our software.

This is just a kind request for feedbck and comments, rather than asking to merge it immediately, we just need to know what our options are.

sigmavirus24 · 2021-04-26T11:12:25Z

The library is under a rather restrictive (for the maintainers' own sanity) feature-freeze at the moment and unfortunately changing a dependency which can have a ripple-effect on the usage of the library is very unlikely to be accepted - regardless of licensing

potiuk · 2021-04-26T12:03:14Z

The library is under a rather restrictive (for the maintainers' own sanity) feature-freeze at the moment and unfortunately changing a dependency which can have a ripple-effect on the usage of the library is very unlikely to be accepted - regardless of licensing

Is there anything we can help with to make this happen ? Maybe we can somehow step-in and help you with testing/changes and step-up gradually to become maintainers eventually? I think we could probably try to bring it to the attention of more experienced devs that could help with that?

One alternative I can think of is to maintain fork of requests in the Apache-owned organization (and any of the library that uses that uses requests that we use) but obviously this is not something that we'd love - might be good for short-term, tactical approach but not really great for long-term maintenance/strategic solution.

Is there any timeline for the feature freeze you mentioned ?

kaxil · 2021-04-26T12:09:36Z

The library is under a rather restrictive (for the maintainers' own sanity) feature-freeze at the moment and unfortunately changing a dependency which can have a ripple-effect on the usage of the library is very unlikely to be accepted - regardless of licensing

Can we also know the reason for the feature-freeze, we like Ash & Jarek said would be happy to help in any testing and maintaining if that is one of the main concerns. This change should help most of the downstream projects which are used at lot of organization who have strict licensing policies.

ashb · 2021-04-26T12:09:50Z

The library is under a rather restrictive (for the maintainers' own sanity) feature-freeze at the moment and unfortunately changing a dependency which can have a ripple-effect on the usage of the library is very unlikely to be accepted - regardless of licensing

@sigmavirus24 I can appreciate that. Is this a feature freeze "for ever", or is there some planned time frame?

Ousret · 2021-05-05T08:57:48Z

Hi everyone,

Like @sigmavirus24 said, this is highly unlikely to happen. The absence of any kind of answer is a good index.

I worked in several companies that have a license check mechanism, sometimes used right sometimes not. In those companies, I proposed a really dirty way to fix this by using their internal proxy switching chardet to charset-normalizer via a hacky package. Not a good way to go, but worked.

The only concern I would have regarding this PR is the deployment scale, although this alternative package is stable and has been used by many it is no match to requests usage. But we have to ponder this with the actual usage of the 'apparent_encoding' property.

This concern can be eliminated with many solutions at the maintainers' disposal like using the 'pre-release' mark from Pypi, using a separate branch automatically synced with the master one, etc...
We are available if you (maintainers) needed to verify if any regression shows up.

If this project has any future major release I would vouch in favor of removing charset detection from it as it is not 'HTTP Client' related. Like httpx did sometime this year.

Otherwise, technically, correct me if I am wrong, I feel confident to say that charset-normalizer is way more reliable than chardet. And can be proven easily. (ex. i. poor charset coverage, ii. sometimes return encoding that cannot decode given bytes)

Maintainer @nateprewitt did already take a risk when bumping chardet from 3.x to 4.0 earlier this year. (In a matter of ~80 hours after the tag (4.0 chardet) was published to Github) So one could argue that such a change is not impossible.

In case this PR would fail, I like the idea of @potiuk having a fork in the Apache org with this patch and synced with upstream at all times.

Regards,

potiuk · 2021-05-07T09:45:38Z

@sigmavirus24 - what do you think? Any chance we can avoid forking? I think we will need to solve it before the next release of Airflow which is likely going to happen next week, so we do not have a lot of time.

sethmlarson · 2021-05-07T14:19:27Z

@potiuk This is unlikely to be accepted and released within a week.

ashb · 2021-05-07T14:31:46Z

Within a week is tight, yeah 🙂 but even just a "yes, we are working to accept this" is probably enough for us to not have to fork it in the short term (which we'd clearly like to avoid.)

potiuk · 2021-05-09T20:11:47Z

Hey @sigmavirus24 . FYI: We have started to discuss this in LEGAL JIRA of the Apache Software Foundation. Seems that in the ASF we have now ~50 projects that are using requests library and the discussion came to the conclusion that basically we have to migrate out of the requests library because of the chardet dependency.

Here is the discussion: https://issues.apache.org/jira/browse/LEGAL-572

We are discussing about the solution. Forking requests and asking others to use the fork is one thing.

One more thing that came out in the discussion is to propose you to donate requests to the ASF (and go through incubation process). I believe if you have problems with feature-freeze/maintenance effort needed, becoming part of an established organisation might be a path you might be willing to take.

What do you think? Would you be willing to have a second thought about getting rid of the LGPL dependency and merging the PR ? I think we are really close to start a bigger effort of not only converting 50 ASF projects but also asking a number of 3rd-party libraries to switch to our non-LGPL depending fork that we are planning to make.

We really want to play it nicely, please don't treat it as a hostile move, but I think we have no other choice here.

sigmavirus24 · 2021-05-10T17:36:33Z

Is this a feature freeze "for ever", or is there some planned time frame?

This started with the release of 2.0. The idea was to slow down development to a reasonable pace. Lots of features were being thrown over the wall at us and the library's reach was beginning to sprawl way too much. It's been a feature freeze for quite a few years.

Maybe we can somehow step-in and help you with testing/changes and step-up gradually to become maintainers eventually? I think we could probably try to bring it to the attention of more experienced devs that could help with that?

To be clear, you're suggesting becoming maintainers to stop a feature freeze created to keep the surface area minimal, to keep churn down to provide a stable library, and not because we're inexperienced as developers or maintainers. I'm sure of course you weren't trying to call us inexperienced because we have a default policy of "No" that doesn't belong in Requests.

That said, this appears to have been sent as a PR in sudden urgency that feels manufactured as chardet has been a dependency of this library almost as long as the library has existed. It was also created with the message of "We can do testing if needed" as in you did no testing expecting us to just merge it because the ASF needed it, which defeats the purpose of designing for stability.

One more thing that came out in the discussion is to propose you to donate requests to the ASF (and go through incubation process). I believe if you have problems with feature-freeze/maintenance effort needed, becoming part of an established organisation might be a path you might be willing to take.

So, requests isn't mine to donate by any stretch of the imagination.

The PSF is fairly established but doesn't force things.

Would you be willing to have a second thought about getting rid of the LGPL dependency and merging the PR ?

As I said, I don't trust the PR. Y'all have given me 0 confidence that this actually doesn't break things for users. So that's a hard pass for me on merging this.

Now that Pip is off of Python 2, however, I think a Requests 3 that's Python 3 only is well within sight and just dropping the character detection altogether is the right choice. I don't think it works well either way with any dependency in particular.

As for when Requests 3 might ship, who knows. I don't particularly have a lot of time for that and neither does Nate. Also it probably won't be requests 3. But that's a separate thing altogether

Ousret · 2021-05-10T19:17:12Z

Hi @sigmavirus24

I am truly amazed by the situation. Lots of things learned in the past month.

You do even realize that the frustration lived by the community is born out of those kinds of weird situations?
How much would it cost to answer that in the beginning? How much PR would be closed already if there were fewer non-said things? This debate could have been avoided in the first place, It appears that maintainers already made their decision the second that PR showed up.

You said that maintainers' time is valuable without considering the time of your fellows.
You said that 'requests' is SO critical that no sudden changes are tolerated YET chardet 4.0 was used in less time than you answering a hard NO to this. How did maintainers verify that the new release was safe? But denying idna latest major.
When answering a PR about the PSF CoC and clearly dismissing others, how would you expect reasonable people to react?
And that is a small portion of what there is to say, unfortunately. Lots of contradictions.

OpenSource can only grow and evolve by confronting each other's opinions, being as truthful as possible, only together that we will succeed.

I am truly convinced that honesty is gold. Even if brutal and cold as long as said cordially.
No one said that current maintainers are inexperienced, contributions speak for themselves. Really I was so much inspired by what requests brought that I started composing things on my own. So when others reach to propose help in any form, it should not be considered hostile.
Today I feel a little bit disappointed in requests/maintainers.

As for when Requests 3 might ship, who knows. I don't particularly have a lot of time for that and neither does Nate. Also it probably won't be requests 3. But that's a separate thing altogether

Yes, maybe requests3 is meant to be created by others. Who knows.

As I said, I don't trust the PR. Y'all have given me 0 confidence that this actually doesn't break things for users. So that's a hard pass for me on merging this.
Now that Pip is off of Python 2, however, I think a Requests 3 that's Python 3 only is well within sight and just dropping the character detection altogether is the right choice. I don't think it works well either way with any dependency in particular.

That is AN opinion, feel free to check out actual facts.

Finally, thank you for giving us a response even if negative. Others are still hoping.

Hopefully, there is a better future for requests and for us all.
Regards,

sigmavirus24 · 2021-05-10T19:41:01Z

How much would it cost to answer that in the beginning? How much PR would be closed already if there were fewer non-said things? This debate could have been avoided in the first place, It appears that maintainers already made their decision the second that PR showed up.

Contrary to the story you seem to have constructed about my opinion (and only mine, I don't speak for Nate), I was keeping an open mind that someone would say "We did these tests to verify this library is easily swapped for chardet, we think this is safe as an alternative". The whole conversation hasn't been about backwards compatibility but instead about a license which has been present since before Requests 1.0.

You said that 'requests' is SO critical that no sudden changes are tolerated YET chardet 4.0 was used in less time than you answering a hard NO to this. How did maintainers verify that the new release was safe? But denying idna latest major.

chardet has had consistently top-notch quality releases. It's much like we upgrade urllib3 pretty quickly. I feel like I'm missing something in this conversation though since you seem very hung up on this. 4.0, to the best of my knowledge (which is likely incomplete) didn't cause any issues. Dropping Python 2 will get us onto the latest idna but once again, no one has done any testing to indicate that they've found it to have backwards compatibility. Like this PR, folks just send it and expect it to get merged or expect us to do that testing ourselves. If we were to write that kind of testing into our CI, we'd get low quality bug reports from linux distro maintainers about those tests talking to the open internet or even worse, failing as they package incompatible versions together. Just smashing merge doesn't save us any time and only irritates users who are broken by those changes.

That is AN opinion, feel free to check out actual facts.

I'm genuinely confused by this. Are you arguing that the automatic character detection works well? I have years of issues, stackoverflow questions, emails, and blog posts indicating that it's terrible for a great number of people. Maybe not 100% of users, but without any kind of telemetry I can only look at the data available to me and make the determination that users are liable to be less confused by Requests' behaviour if something like chardet wasn't used. That's also orthogonal to the only concern the ASF seems to have which is the license.

Ousret · 2021-05-10T20:33:51Z

Text is not a good form of exchange, with too much incomprehension.

Contrary to the story you seem to have constructed about my opinion (and only mine, I don't speak for Nate), I was keeping an open mind.

I don't doubt your good intentions. What I am saying today is that something could have been handled in a better way. Many things you oppose to this PR are stated as things that could have been saying earlier. Not admitting that is not the way to go I think.

chardet has had consistently top-notch quality releases.

Comparing chardet to urllib3 is a bit of a stretch. Many open-source programs are well released. Ok for urllib3 but not chardet, I have studied the code and it does not answer the "top quality" safe content.
urllib3 release-process is in fact a well-oiled machine.

didn't cause any issues.

Future tense. Took the risk and waited. Could have been a disaster. As you said, I hung on that. The release calendar is abrupt. This is reasonable to raise the question.

Like this PR, folks just send it and expect it to get merged or expect us to do that testing ourselves.

I think one of the main issues is your feeling regarding those PRs. A contribution could be meet halfway, if you wish nothing to be done to your end, that okay. Be a mentor, guide them. Let them do the actual work, share your knowledge, expand your horizon with them. If you wish to be a solo reviewer/merger, that okay for everyone.

Just smashing merge doesn't save us any time and only irritates users who are broken by those changes.

Everyone knows that and agrees with it. You saw in people's message only that. But let me assure you that it wasn't just about that.

Are you arguing that the automatic character detection works well?

No, never said that. Some solution brings more stability than other for sure. Chardet is far behind cchardet/uchardet, should be completely blind to not see it. I am as well placed as you to answer things on this matter.

but without any kind of telemetry I can only look at the data available to me

You really need to make some more research. There is more data than "what you have". You would know it if you looked at what was made.

Otherwise, I insist that if you opened up earlier. No one would have insisted for weeks...

but instead about a license which has been present since before Requests 1.0.

Yes, this is true. But alternatives were not there until recently and would have been ridiculous to propose it sooner. Needed more time to gain maturity across usage. cchardet excluded due to binding/build constraint.

Regards,

sigmavirus24 · 2021-05-10T22:23:23Z

@Ousret I think you've interpreted things and added context to my actions that simply isn't there and made assumptions that have led you to believe I've wasted time or disrespected others' time.

You mention the CoC PR. I did all of that while on my phone picking up groceries or walking my dog because it was easy and didn't need nuanced communication. This issue, however, I have wanted to provide clear communication about and doing that from my phone is nearly impossible. I had time today at a computer, but genuinely most of my time at my computer these days is not spent on Open Source or even this website. What little time my personal computer is on it is in search of something I or my family needs or some other task that needs to be done and then I'm away from it rather immediately.

You don't get to see that. This 'social' platform doesn't give you that context. Some things I can handle quickly without needing to communicate clearly, others not so much. Early on, I misread that folks were planning to do more testing. Only later on a re-read after the umpteenth direct ping about this did I realize I'd misread it and that was an afterthought on the author's part. Any contribution I've made to another project I've ensured was tested and I've tried my best to test. I've even reached out to the author's in advance of sending the PR to see if they have advice or if my tests would be sufficient.

Chardet is far behind cchardet/uchardet, should be completely blind to not see it.

"Far behind" and the three, from the last several times I tested it returned wildly different results. Swapping out dependencies that produce different results is not backwards compatible. But apparently I'm blind because I have constraints that those projects haven't bothered to research or consider or even talk to me about before declaring themselves the clearly superior project in all cases, especially this one.

ashb · 2021-05-11T09:24:14Z

@sigmavirus24 I'm sorry for all the raised tempers that this PR has envoked, this was far from my intent.

My understanding is then that the requests library is in critical fixes only mode, and that you don't view this as a critical fix, so I'll close this PR as it's clear you won't accept it because you don't want to break code for existing users, which is something I can entirely appreciate!

If there is any chance you would accept this PR I will do the work to make it happen, but I don't think it'll happen. Please tell me if I'm mistaken.

sigmavirus24 · 2021-05-11T10:34:12Z

@ashb I'm sorry for the miscommunications and how little time I have to test this.

One thing I can think of is that if the ASF can run a test against some of the most popular websites and determine for the ones that don't declare an encoding does this dependency and chardet agree?

There are open lists of these websites, sadly I expect most of them to actually declare their encoding but it's worth a shot as a sample. This is one of the tests I've run in the past but neither have sponsored hardware to use nor time to orchestrate

ashb · 2021-05-11T10:43:02Z

@sigmavirus24 No problem, I understand the pressures of being an opensource maintainer!

Perhaps another test we can do is to capture some sites in various encodings and strip out the encoding headers/meta tags?

sigmavirus24 · 2021-05-11T11:38:02Z

Yes. The main thing we need is reasonable confidence this is mostly backwards compatible (I don't expect full backwards compatibility because I know chardet's algorithm is far from perfect and I know the alternatives are also far from perfect). I do except something greater than 75% compatibility roughly. The last time I tested things the compatibility was roughly 60% even accounting for the fact that chardet reports some encodings differently (because the models are a little more dated). Sadly when Rackspace killed their F/OSS credit program I lost that data because I hadn't backed it up elsewhere

potiuk · 2021-05-11T12:54:18Z

Yes. The main thing we need is reasonable confidence this is mostly backwards compatible (I don't expect full backwards compatibility because I know chardet's algorithm is far from perfect and I know the alternatives are also far from perfect). I do except something greater than 75% compatibility roughly. The last time I tested things the compatibility was roughly 60% even accounting for the fact that chardet reports some encodings differently (because the models are a little more dated). Sadly when Rackspace killed their F/OSS credit program I lost that data because I hadn't backed it up elsewhere

This is great that we can discuss now how we can help with testing :). I would be very happy to help. Especially that better part of yesterday and today I implemented and tested replacing requests with httpx for Apache Airflow (we need it in order to release it).

Actually one other thing that the ASF projects can do - if we have alpha/beta release of requests with optional chardet dependency - we could ask all the projects that are using requests to test it it works for them with chardet removed.

da1910 · 2021-05-11T14:14:34Z

I ran a quick test using both the Alexa top 50 domains and a list of 500 domains I grabbed from moz.com. The results were fairly positive, though it would be worth actually grabbing the full Alexa list for a few countries (US, JP, RU etc.) to repeat the test.

Quick and dirty script is here: https://gist.github.com/da1910/79c168294a8dfe2957a8cbc61daa1710

The reported encoding, chardet's and charset-normalizer's determinations are included in the csv files, and the table below shows the results where the two packages differ.

Alexa Top 50 (2 timed out, 12 reported no encoding, 3 had a different result):

URL	Chardet	Charset_Normalizer
Wikipedia.org	Windows-1254	utf_8
Twitch.tv	Windows-1252	utf_8
Amazon.co.jp	SHIFT_JIS	cp932

all_results_50.txt

Moz's top 500 domains (17 timed out, 92 reported no encoding, 11 had different results):

URL	Chardet	Charset_Normalizer
brandbucket.com	ISO-8859-1	utf_8
cdc.gov	utf-8	utf_8
nasa.gov	ISO-8859-1	utf_8
youronlinechoices.com	Windows-1252	iso8859_10
amazon.co.jp	SHIFT_JIS	cp932
twitch.tv	Windows-1252	utf_8
m.wikipedia.org	Windows-1254	utf_8
wikipedia.org	Windows-1254	utf_8
wn.com	Windows-1254	utf_8
gutenberg.org	Windows-1252	utf_8
photos1.blogger.com	Windows-1252	None

Note: photos1.blogger.com returned a gif for me, so it's most definitely not windows-1252...

all_results_500.txt

potiuk · 2021-05-11T20:00:08Z

I will try to run more tests tomorrow with more sites.

Just for @sigmavirus24 and others who read that - so that you was aware why it is important for us and what is the problem we are trying to address. Maybe you simply are not aware what is the extent of the problem.

For Airflow - we are just about to release 2.1, and I just finished the PR (took me more than a day) where I removed requests as core dependency (basically we replaced requests with httpx, we vendored-in connexion (we have API build with connexion) and replaced requests with httpx for it) and changed HTTP provider (and it's derived classes) to use httpx. Here is the PR: apache/airflow#15781 . This PR still needs to be reviewed, corrected after review thoroughly tested and merged. This is a monster change: 86 files, +7,035 −684 lines of code.

But as PMC of Airflow we have no choice now - we are obliged to do it to release new version (now that we are aware of the problem with chardet it would be conscious violating the policy of ASF if we keep chardet as mandatory requirement).

Luckily we could do it rather "quickly" because we've already split airflow into core and optional parts and this OK with the policy of ASF to have optional dependency on LGPL. But we cannot have mandatory one.

But this does not even touch all the optional dependencies and transitive ones. We have 43 3rd-party packages that use requests (https://issues.apache.org/jira/browse/LEGAL-572?focusedCommentId=17341767&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17341767) where requests is used (GCP/Azure/Docker/Kubernetes - they are optional but it's hard to imagine anyone using Airflow withou tthose).

Unfotunately some other projects of ASF are using Airflow as dependency (for example Apache Liminal and they require docker, Kubernetes optional parts of Airflfow. So they are in much worse situation - because for them pretty much a chain of dependencies would have to be updated. So for now Liminal is pretty much blocked.

Those ~50 projects of ASF could either do the same as I just did, or test requests with charset_normalizer when it is ready for testing in alpha/beta.

Ousret · 2021-05-12T10:46:28Z

Hi,

First of all, I would like to thanks @sigmavirus24

Then @ashb
I created two tests playground using docker-compose with python 2.7 and 3.8 using your branch. Sources available at https://github.com/Ousret/requests-x-charset_normalizer

The results are also fairly positives. The remaining only problem is that you missed filtering out the warning about "trying to detect from.." as did @da1910

Here are what I got from running playground27 and playground38. (Diff only)

I took into account that ISO-8859-7 == cp1253, as it is the same code page under a different name.

We have a firm 78 % backward compatibility exact results. Based on +400 files. And these numbers increase if we tolerate different encoding that produces equal Unicode output. Even more, if we tolerate minor differences.

file;chardet;charset_normalizer
/raw/utf-8-sig/bom-utf-8.srt;UTF-8-SIG;utf_8
/raw/windows-1250-slovak/_ude_1.txt;Windows-1254;latin_1
/raw/TIS-620/opentle.org.xml;TIS-620;iso8859_11
/raw/iso-8859-2-hungarian/hirtv.hu.xml;ISO-8859-1;cp1250
/raw/Johab/iyagi-readme.txt;None;johab
/raw/windows-1250-czech/_ude_2.txt;Windows-1252;cp850
/raw/iso-8859-2-hungarian/shamalt.uw.hu.mv.xml;ISO-8859-1;cp1250
/raw/windows-1250-hungarian/bbc.co.uk.hu.forum.xml;ISO-8859-1;cp1250
/raw/windows-1250-slovak/_ude_3.txt;Windows-1254;latin_1
/raw/iso-8859-2-hungarian/shamalt.uw.hu.mk.xml;ISO-8859-1;cp1250
/raw/utf-8/sample.3.ar.srt;UTF-8-SIG;utf_8
/raw/TIS-620/pharmacy.kku.ac.th.centerlab.xml;TIS-620;iso8859_11
/raw/IBM866/blog.mlmaster.com.xml;IBM866;cp1125
/raw/iso-8859-2-hungarian/auto-apro.hu.xml;ISO-8859-1;cp1250
/raw/GB2312/jjgod.3322.org.xml;GB2312;gb18030
/raw/windows-1254-turkish/_chromium_windows-1254_with_no_encoding_specified.html;ISO-8859-1;hp_roman8
/raw/GB2312/cnblog.org.xml;GB2312;gb18030
/raw/GB2312/cindychen.com.xml;GB2312;gb18030
/raw/windows-1252/_mozilla_bug421271_text.html;ISO-8859-1;cp437
/raw/windows-1256-arabic/sample.2.ar.srt;MacCyrillic;cp1256
/raw/GB2312/chen56.blogcn.com.xml;GB2312;gb18030
/raw/TIS-620/_mozilla_bug488426_text.html;TIS-620;iso8859_11
/raw/iso-8859-2-czech/_ude_1.txt;ISO-8859-1;iso8859_15
/raw/iso-8859-2-hungarian/shamalt.uw.hu.mr.xml;ISO-8859-1;cp1250
/raw/TIS-620/pharmacy.kku.ac.th.analyse1.xml;TIS-620;iso8859_11
/raw/IBM866/newsru.com.xml;IBM866;cp1125
/raw/GB2312/pda.blogsome.com.xml;GB2312;gb18030
/raw/GB2312/luciferwang.blogcn.com.xml;GB2312;gb18030
/raw/IBM866/music.peeps.ru.xml;IBM866;cp1125
/raw/GB2312/softsea.net.xml;GB2312;gb18030
/raw/windows-1252/_ude_2.txt;Windows-1252;cp437
/raw/windows-1250-slovene/_ude_1.txt;Windows-1252;cp850
/raw/GB2312/_chromium_gb18030_with_no_encoding_specified.html.xml;GB2312;euc_jis_2004
/raw/iso-8859-2-hungarian/cigartower.hu.xml;ISO-8859-1;cp1250
/raw/windows-1256-arabic/sample.1.ar.srt;MacCyrillic;cp1256
/raw/iso-8859-2-hungarian/honositomuhely.hu.xml;ISO-8859-1;cp1250
/raw/GB2312/_mozilla_bug171813_text.html;GB2312;big5hkscs
/raw/IBM866/aug32.hole.ru.xml;IBM866;cp1125
/raw/IBM866/aif.ru.health.xml;IBM866;cp1125
/raw/iso-8859-2-hungarian/shamalt.uw.hu.xml;ISO-8859-1;cp1250
/raw/IBM866/kapranoff.ru.xml;IBM866;cp1125
/raw/windows-1256-arabic/_chromium_windows-1256_with_no_encoding_specified.html;MacCyrillic;cp1256
/raw/windows-1250-slovak/_ude_2.txt;Windows-1254;hp_roman8
/raw/windows-1250-hungarian/_ude_2.txt;Windows-1252;iso8859_10
/raw/iso-8859-2-hungarian/escience.hu.xml;ISO-8859-1;cp1250
/raw/Johab/hlpro-readme.txt;None;johab
/raw/utf-8/_ude_3.txt;utf-8;None
/raw/GB2312/godthink.blogsome.com.xml;GB2312;gb18030
/raw/IBM866/intertat.ru.xml;IBM866;cp1125
/raw/SHIFT_JIS/_ude_1.txt;SHIFT_JIS;shift_jis_2004
/raw/utf-8-sig/_ude_4.txt;UTF-8-SIG;utf_8
/raw/iso-8859-2-hungarian/ugyanmar.blogspot.com.xml;ISO-8859-1;cp1250
/raw/iso-8859-6-arabic/_chromium_ISO-8859-6_with_no_encoding_specified.html;MacCyrillic;iso8859_6
/raw/windows-1252/_ude_1.txt;Windows-1252;cp850
/raw/IBM866/money.rin.ru.xml;IBM866;cp1125
/raw/windows-1250-hungarian/objektivhir.hu.xml;ISO-8859-1;cp1250
/raw/windows-1252/github_bug_9.txt;Windows-1252;cp437
/raw/windows-1250-hungarian/bbc.co.uk.hu.pressreview.xml;Windows-1252;cp1250
/raw/IBM866/forum.template-toolkit.ru.4.xml;IBM866;cp1125
/raw/iso-8859-9-turkish/divxplanet.com.xml;ISO-8859-1;cp1254
/raw/SHIFT_JIS/_ude_4.txt;SHIFT_JIS;shift_jis_2004
/raw/IBM866/janulalife.blogspot.com.xml;IBM866;cp1125
/raw/EUC-KR/_chromium_windows-949_with_no_encoding_specified.html;EUC-KR;gb2312
/raw/iso-8859-7-greek/disabled.gr.xml;windows-1253;iso8859_7
/raw/iso-8859-2-polish/_ude_1.txt;ISO-8859-1;hp_roman8
/raw/iso-8859-2-hungarian/saraspatak.hu.xml;ISO-8859-1;cp1250
/raw/windows-1250-hungarian/bbc.co.uk.hu.xml;Windows-1252;cp1250
/raw/IBM866/forum.template-toolkit.ru.8.xml;IBM866;cp1125
/raw/GB2312/coverer.com.xml;GB2312;gb18030
/raw/windows-1250-romanian/_ude_1.txt;Windows-1252;iso8859_15
/raw/IBM866/_ude_1.txt;IBM866;cp1125
/raw/iso-8859-1/_ude_1.txt;ISO-8859-1;hp_roman8
/raw/TIS-620/trickspot.boxchart.com.xml;TIS-620;iso8859_11
/raw/IBM866/forum.template-toolkit.ru.6.xml;IBM866;cp1125
/raw/windows-1254-turkish/_ude_1.txt;Windows-1252;iso8859_15
/raw/windows-1250-polish/_ude_1.txt;Windows-1252;hp_roman8
/raw/IBM866/susu.ac.ru.xml;IBM866;cp1125
/raw/GB2312/w3cn.org.xml;GB2312;gb18030
/raw/EUC-TW/_ude_euc-tw1.txt;EUC-TW;gb18030
/raw/IBM866/greek.ru.xml;IBM866;cp1125
/raw/IBM866/forum.template-toolkit.ru.9.xml;IBM866;cp1125
/raw/GB2312/cappuccinos.3322.org.xml;GB2312;gb18030
/raw/windows-1250-czech/_ude_1.txt;Windows-1254;cp850
/raw/windows-1250-hungarian/_ude_1.txt;Windows-1252;mac_latin2
/raw/windows-1250-croatian/_ude_1.txt;Windows-1252;cp850
/raw/IBM866/forum.template-toolkit.ru.1.xml;IBM866;cp1125
/raw/Johab/mdir-doc.txt;None;johab
/raw/TIS-620/pharmacy.kku.ac.th.healthinfo-ne.xml;TIS-620;iso8859_11
/raw/GB2312/eighthday.blogspot.com.xml;GB2312;gb18030
/raw/ascii/_mozilla_bug638318_text.html;ascii;None
/raw/windows-1250-hungarian/bbc.co.uk.hu.learningenglish.xml;ISO-8859-1;cp1250
/raw/GB2312/14.blog.westca.com.xml;GB2312;gb18030
/raw/utf-8-sig/sample-english.bom.txt;UTF-8-SIG;utf_8
/raw/iso-8859-1/_ude_6.txt;ISO-8859-1;cp1250
/raw/EUC-JP/_mozilla_bug431054_text.html;EUC-JP;cp1252
/raw/windows-1250-hungarian/torokorszag.blogspot.com.xml;ISO-8859-1;cp1250
/raw/windows-1256-arabic/sample.4.ar.srt;MacCyrillic;cp1256

There is the question of UTF-8-SIG and UTF-8. CharsetNormalizer return 'UTF-8'.

You may find my JSON outputs :

Python 2.7 (Chardet): https://pastebin.com/JirQLXi8
Python 3.8 (Charset-Normalizer): https://pastebin.com/BufNHD8B

To retrieve your own outputs: Run check_compat.py after generating required JSONs in the ./results directory.

Regards,

potiuk · 2021-07-11T14:14:27Z

Hey @sigmavirus24 @nateprewitt ,

As promised I personally reached out to all Apache Software Foundation projects that are affected and informed them about the upcoming release. I've also tried to recruit them to help with testing.

In case we could have a 2.26.0 release candidate (I noticed you do not release release candidates through PyPI) they might help with testing (some already confirmed that they can help).

Here is the full list of projects/Issues I created (some of them via JIRA issues so they are not automatically linked):

Apache Atlas: https://issues.apache.org/jira/projects/ATLAS/issues/ATLAS-4352
Apache Beam: https://issues.apache.org/jira/browse/BEAM-12598
Apache Flink: https://issues.apache.org/jira/browse/FLINK-23345
Apache Liminal (incubating): https://issues.apache.org/jira/browse/LIMINAL-80
Apache Ranger: https://issues.apache.org/jira/browse/RANGER-3335
Apache Bookkeeper: Migrate to the next version of Python requests when released apache/bookkeeper#2752
Apache Mxnet (incubating): Migrate to the next version of Python requests when released apache/mxnet#20440
Apache Libcloud: Migrate to the next version of Python requests when released apache/libcloud#1594
Apache Skywalking: [Python] Migrate to the next version of Python requests when released apache/skywalking#7282
Apache Submarine: Migrate to the next version of Python requests when released apache/submarine#661
Apache Superset: Migrate to the next version of Python requests when released apache/superset#15631
Apache Trafficcontrol: Migrate to the next version of Python requests when released apache/trafficcontrol#6011

and last-but-not least

Apache Airflow: Upgrade to latest requests when it is released without chardet dependency apache/airflow#16929

Just let me know please if there is a release candidate that we can test and I will follow-up with the maintainers of the projects.

nateprewitt · 2021-07-13T14:57:03Z

@potiuk Requests 2.26.0 should now be generally available on PyPI with these changes.

potiuk · 2021-07-13T16:37:37Z

Thanks @nateprewitt. Already informed everyone! Thanks again for being responsive to the ASF needs !

potiuk · 2021-07-13T16:39:20Z

Some test results are coming already with thumbs up! apache/trafficcontrol#6011 (comment)

Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.

) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.

Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd. By changing out this dependency for one under MIT we remove all license ambiguity. As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, `requests[lgpl]`. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

hiranya911 · 2021-07-14T19:24:50Z

Some of our unit tests that create mock HTTP responses were affected by this. Here's a minimal repro:

from requests import models

import io

data = '{}' # Empty JSON body
resp = models.Response()
resp.raw = io.BytesIO(data.encode())
print(resp.json())

This used to work, but with the latest release raises the following error:

Traceback (most recent call last):
  File "/usr/local/google/home/hkj/Projects/firebase-admin-python/actions/firebase-admin-python/foo.py", line 9, in <module>
    print(resp.json())
  File "/usr/local/google/home/hkj/Projects/firebase-admin-python/actions/py3/lib/python3.9/site-packages/requests/models.py", line 910, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Setting resp.encoding = 'utf-8' resolved the issue for us.

Ousret · 2021-07-14T19:26:37Z

@hiranya911

Someone already reported this defect. jawah/charset_normalizer#58
Will be fixed soon.

Thanks for your vigilence,

Although using the (non-vendored) chardet library is fine for requests itself, but using a LGPL dependency the story is a lot less clear for downstream projects, particularly ones that might like to bundle requests (and thus chardet) in to a single binary -- think something similar to what docker-compose is doing. By including an LGPL'd module it is no longer clear if the resulting artefact must also be LGPL'd. By changing out this dependency for one under MIT we remove all license ambiguity. As an "escape hatch" I have made the code so that it will use chardet first if it is installed, but we no longer depend upon it directly, although there is a new extra added, `requests[lgpl]`. This should minimize the impact to users, and give them an escape hatch if charset_normalizer turns out to be not as good. (In my non-exhaustive tests it detects the same encoding as chartdet in every case I threw at it) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

…che#16974) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more.

…che#16974) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)

) Following merging the psf/requests#5797 and requests 2.26.0 release without LGPL chardet dependency, we can now bring back http as pre-installed provider as it does not bring chardet automatically any more. (cherry picked from commit c46e841)

ashb force-pushed the charset_normalizer branch from ea671c4 to d8cf70e Compare April 22, 2021 09:16

Ousret mentioned this pull request May 10, 2021

Migrate network stack away from requests? pypa/pip#9824

Closed

ashb closed this May 11, 2021

This was referenced May 11, 2021

Swap out requests to avoid LGPL dependency docker/docker-py#2837

Closed

Swap out requests to avoid LGPL dependency kubernetes-client/python#1461

Closed

ashb reopened this May 11, 2021

wu-sheng mentioned this pull request Jul 11, 2021

removed requests as required dependency apache/skywalking-python#128

Merged

potiuk mentioned this pull request Jul 13, 2021

Switch back http provider after requests removes LGPL dependency apache/airflow#16974

Merged

akx mentioned this pull request Jul 14, 2021

Make chardet/charset_normalizer optional? #5871

Open

tseaver mentioned this pull request Jul 14, 2021

Unit tests fail on master googleapis/python-cloud-core#117

Closed

MaybeNetwork mentioned this pull request Jul 16, 2021

Fix test issue caused by change in Requests library praw-dev/praw#1762

Closed

wangsha mentioned this pull request Aug 2, 2021

latest requests package switched chardet to charset_normalizer sarugaku/requirementslib#296

Closed

ad-m mentioned this pull request Aug 11, 2021

Google Cloud Support Slackbot GoogleCloudPlatform/professional-services#673

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch LGPL'd chardet for MIT licensed charset_normalizer #5797

Switch LGPL'd chardet for MIT licensed charset_normalizer #5797

ashb commented Apr 22, 2021 •

edited

Loading

potiuk commented Apr 22, 2021

potiuk commented Apr 26, 2021 •

edited

Loading

sigmavirus24 commented Apr 26, 2021

potiuk commented Apr 26, 2021 •

edited

Loading

kaxil commented Apr 26, 2021 •

edited

Loading

ashb commented Apr 26, 2021

Ousret commented May 5, 2021

potiuk commented May 7, 2021

sethmlarson commented May 7, 2021

ashb commented May 7, 2021

potiuk commented May 9, 2021

sigmavirus24 commented May 10, 2021

Ousret commented May 10, 2021

sigmavirus24 commented May 10, 2021

Ousret commented May 10, 2021

sigmavirus24 commented May 10, 2021

ashb commented May 11, 2021

sigmavirus24 commented May 11, 2021

ashb commented May 11, 2021

sigmavirus24 commented May 11, 2021

potiuk commented May 11, 2021

da1910 commented May 11, 2021 •

edited

Loading

potiuk commented May 11, 2021

Ousret commented May 12, 2021 •

edited

Loading

potiuk commented Jul 11, 2021

nateprewitt commented Jul 13, 2021 •

edited by sethmlarson

Loading

potiuk commented Jul 13, 2021

potiuk commented Jul 13, 2021

hiranya911 commented Jul 14, 2021

Ousret commented Jul 14, 2021

Switch LGPL'd chardet for MIT licensed charset_normalizer #5797

Switch LGPL'd chardet for MIT licensed charset_normalizer #5797

Conversation

ashb commented Apr 22, 2021 • edited Loading

potiuk commented Apr 22, 2021

potiuk commented Apr 26, 2021 • edited Loading

sigmavirus24 commented Apr 26, 2021

potiuk commented Apr 26, 2021 • edited Loading

kaxil commented Apr 26, 2021 • edited Loading

ashb commented Apr 26, 2021

Ousret commented May 5, 2021

potiuk commented May 7, 2021

sethmlarson commented May 7, 2021

ashb commented May 7, 2021

potiuk commented May 9, 2021

sigmavirus24 commented May 10, 2021

Ousret commented May 10, 2021

sigmavirus24 commented May 10, 2021

Ousret commented May 10, 2021

sigmavirus24 commented May 10, 2021

ashb commented May 11, 2021

sigmavirus24 commented May 11, 2021

ashb commented May 11, 2021

sigmavirus24 commented May 11, 2021

potiuk commented May 11, 2021

da1910 commented May 11, 2021 • edited Loading

Alexa Top 50 (2 timed out, 12 reported no encoding, 3 had a different result):

Moz's top 500 domains (17 timed out, 92 reported no encoding, 11 had different results):

potiuk commented May 11, 2021

Ousret commented May 12, 2021 • edited Loading

potiuk commented Jul 11, 2021

nateprewitt commented Jul 13, 2021 • edited by sethmlarson Loading

potiuk commented Jul 13, 2021

potiuk commented Jul 13, 2021

hiranya911 commented Jul 14, 2021

Ousret commented Jul 14, 2021

ashb commented Apr 22, 2021 •

edited

Loading

potiuk commented Apr 26, 2021 •

edited

Loading

potiuk commented Apr 26, 2021 •

edited

Loading

kaxil commented Apr 26, 2021 •

edited

Loading

da1910 commented May 11, 2021 •

edited

Loading

Ousret commented May 12, 2021 •

edited

Loading

nateprewitt commented Jul 13, 2021 •

edited by sethmlarson

Loading