Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate out of org.json library #26

Closed
wetneb opened this issue Aug 13, 2018 · 52 comments
Closed

Migrate out of org.json library #26

wetneb opened this issue Aug 13, 2018 · 52 comments

Comments

@wetneb
Copy link

wetneb commented Aug 13, 2018

Using org.json as a JSON library is problematic, as the licence of this library includes an additional clause "The Software shall be used for Good, not Evil."

This is regarded as a non-free licence (it is non OSI-compliant).

That makes it impossible for projects relying on this library (such as OpenRefine) to be OSI-compliant in turn.

@rufuspollock
Copy link

@georgeslabreche can you take a look at this as this is a blocker for integrations.

@lwinfree
Copy link
Member

The action item for this issue is to replace the org.json library in the datapackage-java code as it has a non OSI-compliant license. We should use a library with an open license such as MIT, Apache 2.0, or others listed on the OSI website.

@rufuspollock
Copy link

@lwinfree 👏 👏 it would be great to get this sorted as obviously usage in openrefine is 👍 💯

This should be pretty fast - it is just finding an OSI compliant java json lib and then refactoring to use that. If @roll can't sort this maybe someone at @datopian can help.

@iSnow
Copy link
Contributor

iSnow commented Oct 30, 2019

Based on usage stats, it's a choice between either GSON or Jackson, the others are either too old or small projects with relatively little use. For those two projects, long-term survival is not an issue.

I would question whether exposing methods tying the external API to one specific JSON provider is a good design choice, though. I'd think it should use one lib internally and only communicate via JSON strings, not JSON objects. This way, a client is free in choosing their JSON provider.

@lwinfree
Copy link
Member

Thanks @iSnow! Do you think you would have time to work on this? Or do you think you could help me estimate how much work this would be? I don't know any Java, so it is hard for me to estimate how long it would take to fix this.

@iSnow
Copy link
Contributor

iSnow commented Oct 30, 2019

Sure, @lwinfree.

I am currently in discussion with @roll about an issue I created that deals with streamlining the datapackage API. Part of that is getting the dependency on org.json classes out of the interface so people could use datapackage in their projects and not be tied to one JSON provider.

If that is accepted and the interface changes are landed, I would work on switching over to another JSON provider internally as well - it's not a super small task as the org.json classes are used for a lot of things, among them conversion to CSV. But it's totally doable.

Working on both in parallel is a bit too much for me, even if we split off into a feature branch - since we also have to redo JSON handling in tableschema-java, there would be too many moving parts.

@iSnow iSnow self-assigned this Oct 30, 2019
@lwinfree
Copy link
Member

@iSnow awesome. It is really great to have your input and help!

@rufuspollock
Copy link

@iSnow any update here? It would be so great to get the licensing issue resolved here. This would unblock getting this back in OpenRefine.

@iSnow
Copy link
Contributor

iSnow commented Dec 4, 2019

I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic. Since the organization I work for can live with the strange json.org license, other issues like more rigid parsing, security and decent test coverage have a higher prio. And, unfortunately, working on this in parallel isn't really feasible, too many moving parts.

@lwinfree
Copy link
Member

Thanks for the update @iSnow! That is really helpful information, and I understand that you have higher priorities. To try and get some additional help on the issue, we are going to post this on bountysource (if you end up having the time to work on it, please accept the bounty yourself :-) ).

@lwinfree
Copy link
Member

@wetneb
Copy link
Author

wetneb commented Dec 10, 2019

It's always hard to put a price tag on an issue, but in this case the bounty seems pretty small compared to the task. (just to be clear: I am not trying to bargain, I have no intention to work on this myself)

I checked what it would take to migrate, but it is not looking good - my guesstimate would be a solid 3-4 weeks for both Java libs together as the json.org code is woven into the whole logic.

I agree - for instance the licenses in a Data Package are stored as JSON objects directly, using the underlying library, as the example in the README shows. It would probably be healthier to change this (by introducing the appropriate License class to store this data, with appropriate serialization and deserialization).

Given that migrating out of org.json will probably break a lot of interfaces, it might be the appropriate time to also tackle more profound architectural changes (for instance those suggested in #29). But of course that means more work - it could well mean rewriting most of the code!

So in terms of effort estimate, I would rate this on the same sort of order of magnitude as what it took to write this library in the first place, IMHO.

@iSnow iSnow removed their assignment Dec 11, 2019
@iSnow
Copy link
Contributor

iSnow commented Dec 11, 2019

Thanks for the write-up, @wetneb - that's exactly it.

I am at liberty to spend some time on improving the code on company costs, but not on topics that don't matter to the org (like this one) and not the amount of hours that would be needed for refactoring both libraries to a different JSON provider. I also like working on it in my free time, but fixing licensing issues is not that interesting, tbh.

What I did with the tableschema-java repo is that I removed all API exposure of json.org classes (replaced with Java types where easily possible and String where not), which is the first step towards moving to Jackson (preferably) or GSON. I'll most likely do this with the datapackages-java code as well, as I wrote in my comment above but that's where my commitment ends.

Consequently I removed myself from this issue - if maybe someone drops by who wants to work on this, please go ahead, but check back here first, so I can give some guidance (eg. start with tableschema-java which is more mature and has better tests right now)

@lwinfree
Copy link
Member

That sounds good @iSnow, and thanks for the feedback @wetneb. We're going to look at our budget and increase the bounty to better reflect the amount of work.

@lwinfree
Copy link
Member

lwinfree commented Jan 7, 2020

Hi @iSnow & others - we've updated the bounty amount based on your feedback! Again, thanks for the comments :-)

@devmindyg
Copy link

hello @lwinfree and others. I would like to work on this issue, if you guys say it's ok.
I have a free time and can dedicate some hours per day for this project (and need extra money ;) ).

The points raised by @iSnow and @wetneb could be a start
What do you guys say?

@iSnow
Copy link
Contributor

iSnow commented Jan 14, 2020

Sounds great, @devmindyg.

Please use Jackson as a replacement JSON library as I am already relying on it in a small way. Please start with the Tableschema-java project (https://github.com/frictionlessdata/tableschema-java) as this is the more mature project and the Datapackage-java project depends on it. Please check back in with me before tackling the Datapackage-java project, as this needs more work and we need to plan and align our work packages.

To clarify: this is not about just dropping out the json.org library and replacing it with Jackson code, but about improving the library to some extend. The projects were built around taking a JSON object, storing it, taking properties from it, and writing it back out. This leads to a lot of ugly code and switch/case statements as it disregards Java type safety. I changed a lot to make it more idiomatic Java, meaning object inheritance and generic types. Please continue in this spirit.

You might want to start with moving the tests to Jackson so you know you have working tests and then switch the library code to Jackson. Jackson features the ObjectMapper that allows you to throw in a JSON string and get back a Java object, not just a Map<String, Object>.

It probably would be best if you (and I) worked on feature branches and committed to trunk once a work package is done.

@wetneb
Copy link
Author

wetneb commented Jan 14, 2020

Jackson would indeed be a sensible choice to migrate to. It has some support for JSON schema, so that could potentially replace the current "Everit" implementation: https://github.com/FasterXML/jackson-module-jsonSchema

@iSnow
Copy link
Contributor

iSnow commented Jan 14, 2020

To add to the comment by @wetneb the Everit JSON Schema parser library is specific to json.org JSON objects, it cannot be used with Jackson. So the schema validation logic needs to be refactored to use a different JSON Schema validator. The Jackson module mentioned is mostly for generating a JSON Schema from a POJO (which is something tableschema-java supports), and there's also https://github.com/networknt/json-schema-validator for JSON Schema validation.

@devmindyg
Copy link

@iSnow @wetneb points taken!
@iSnow do you want to work on feature branches directly here on this repo?
or can we move on with fork/PR for this refactoring?
Today i'm working on tests with jackson on my fork
https://github.com/devmindyg/datapackage-java

@iSnow
Copy link
Contributor

iSnow commented Jan 14, 2020

Forking/PR is fine, @devmindyg, but please keep in mind you absolutely should start with tableschema-java as it is the lower-level library, it has decent test coverage, and is in a relatively stable state. datapackage-java (the one you cloned) must come later, it is currently in no state for changing the JSON lib.

The bounty isn't really clear about this, but if you only switch the JSON library for datapackage-java, it will still have a transitive dependency on json.org code via tableschema-java, and OpenRefine will still not be able to use datapackage-java.

@lwinfree take note, the scope of the bounty hasn't been clearly defined, if @devmindyg claims the bounty for just refactoring this here library, you do not get what you are looking for.

@devmindyg
Copy link

Thanks for the clarification @iSnow
i'll start with tableschema-java

@iSnow
Copy link
Contributor

iSnow commented Jan 23, 2020

Hey @devmindyg thanks for working on the library. I absolutely like what I see, very good work, and please go ahead and submit a PR.

A couple of small things:

  • The tests are a mixture of JUnit 4.x and JUnit 5.x (Jupiter). All files with tests I created are 5.x, and the ones I inherited are 4.x, even the individual tests I added there. I know this is not optimal, but I focused on getting the functionality into shape over migrating all tests to 5.x. I see that in DocumentationCases.java, you imported org.junit.Assert.assertEquals, which is 4.x. Could you kindly switch it to `org.junit.jupiterAssertions.assertEquals? And watchout which files are 4.x and which are 5.x - everything with a @DiplayName annotation is 5.x. If you feel like it, you are free to migrate tests to 5.x, of course, that would be highly appreciated.

  • Please do not forget to pull in the changes before submitting a PR. There shouldn't be conflicts, but I am not 100% sure as I pushed a couple of chances since you cloned the repo. I'll hold off pushing further commits till you submitted the PR. I pulled all changes to from master to branch integration, so you can submit against that branch. If you'd rather I create a feature branch, tell me what you'd like to have it named.

  • While I applaud your efforts to make the code more readable and conforming to Java standards (spaces before curly braces, tabs vs. spaces and so on), please don't go overboard as I have to mentally ignore all those lines. I'd be happier if you separated PRs with code changes from PRs with formatting changes - it drastically cuts down on lines I have to go through as AFAIK the Github diff viewer doesn't allow me to ignore whitespace changes (or does it?). If you feel strongly about code formatting on the other hand, please go ahead and ignore this comment.

So again thanks and you have green light to submit a PR :)

@devmindyg
Copy link

Hi @iSnow
sorry for the formatting changes, my editor was set to auto format code :(
i'll be more careful regarding formatting.
I changed newly added assert statements to use jupiter api and submitted a PR to the integration branch.

@devmindyg
Copy link

Hello again,
@iSnow i started working on jackson refactoring in tableschema-java. There are some explicit type checks for org.json.JSONArray instances, I replaced the type checks for NodeArray from jackson, but as you stated here #26 (comment) . It may be a good idea to check for generic ArrayList (or may be another collection?). We can still rely on jackson NodeArray or NodeObject to do the marshalings that we need and easy convert to the desired ArrayList. What do you think?
Also, is it ok to keep the communication centralized here?

@iSnow
Copy link
Contributor

iSnow commented Jan 30, 2020

Yes, let's keep this the main communication channel.

I'd rather we completely parse the JSON to Java collections/objects and then use type-safe methods to access their properties. Currently, things like Schema have fromJson() methods like initFromSchemaJson(String json) that does a lot of heavy-lifting. Ideally, it could be replaced by methods in ObjectMapper like Schema schema = mapper.readValue(jsonString, Schema.class) - conversely the Schema serialization via getJson() should be replaced by String jsonString = mapper.writeValueAsString(schema);

I am not 100% sure how to replace the fromJson() method in Schema though, because the matching subclass for each field is decided by the forType(String type, String name)-method in Field.java which does some reflection magic. Maybe it would be possible to serialize the field-array via Jackson MixIns. If that fails, using the StdDeserializer should work.

What I'd like to ask you is to either research this and come up with a better idea or to create a branch and to prototypically implement the Schema de-serialization with the help of MixIns or custom Deserializers so that we can see whether that's a route we should take considering effort and code size. The foreign keys aren't the important part, what is essential is the List<Field> fields needs to be the same as for the current de-serialization.

Oh, and please pull in the changes I made in the last two days.

@wetneb
Copy link
Author

wetneb commented Jan 30, 2020

I haven't had a close look but the forType method sounds like something that could be replaced by a Jackson TypeIdResolver. Its use is explained here:
https://www.baeldung.com/jackson-advanced-annotations#jsontypeidresolver

Or if the list of subclasses is fixed (and not meant to be extensible by users), then annotations could do it too:
https://www.baeldung.com/jackson-inheritance#2-per-class-annotations

@iSnow
Copy link
Contributor

iSnow commented Jan 30, 2020

Thanks, @wetneb, very helpful. Yes, the list of subclasses is fixed and is not user-extensible as the list of subclasses is tied to the schema definition by frictionlessdata.

@rufuspollock
Copy link

rufuspollock commented Mar 16, 2020

@wetneb @iSnow @lwinfree any update on progress here? What's needed to get this finished? It would be great to this in ... 😄

@iSnow
Copy link
Contributor

iSnow commented Mar 20, 2020

@rufuspollock due to COVID-19, I am currently busy

@Amraneze
Copy link

Amraneze commented Apr 8, 2020

@iSnow, @rufuspollock do you want me to start on tableschema-java ?

@wetneb
Copy link
Author

wetneb commented Apr 8, 2020

It might be worth checking with @devmindyg first, to see if they got anywhere near completing this.

@iSnow
Copy link
Contributor

iSnow commented Apr 9, 2020

@Amraneze You'll have to wait for @devmindyg to declare whether they are still working on it.

@rufuspollock
Copy link

@devmindyg could you let us know in the next few days if you are still working on this? If not we are planning to see if someone else could take it on.

@rufuspollock
Copy link

@lwinfree could you state in the description of the issue the details of the bounty as the bountysource link is giving an error atm.

@jdbranham
Copy link
Contributor

Sounds like an interesting task.
I recently lost some work due to covid, so looking to pick up some side jobs.
I'd like to take a shot at it - if @devmindyg is not done.

@rufuspollock
Copy link

@jdbranham please go ahead - we have not heard from @devmindyg for a while. /cc @lwinfree @lauragift21

@lwinfree
Copy link
Member

Thanks for your interest @jdbranham! @devmindyg thanks for your work on this, but we will need to move ahead here.

@jdbranham
Copy link
Contributor

jdbranham commented Apr 21, 2020

There were no existing conflicts, so I merged master -> integration and working on the changes in my fork of the integration branch.
Thanks team!

@jdbranham
Copy link
Contributor

@iSnow @wetneb - I've got a working prototype using Jackson's @JsonTypeInfo that should simplify the deserialization process in general.
I'll check my progress in later this evening.

@jdbranham
Copy link
Contributor

@iSnow @rufuspollock @lwinfree
I've got [de]serialization working, but have a question about schema validation.

According to the everit library -

if you use Jackson to handle JSON in Java code, then java-json-tools/json-schema-validator is obviously a better choice, since it uses Jackson
if you want to use the org.json API then this library is the better choice

Should I implement schema validation with this library?
https://github.com/java-json-tools/json-schema-validator

Is the license acceptable ASL2.0 + LGPL?

@jdbranham
Copy link
Contributor

Actually this one is active, Apache 2.0, and lightweight -
https://github.com/networknt/json-schema-validator

I'll go ahead with this, but I've abstracted the schema validation, so it will be easier to switch if necessary.

@iSnow
Copy link
Contributor

iSnow commented Apr 23, 2020

Sounds great, @jdbranham. It's clear that we have to move away from the everit library, as this is too tightly bound to the org.json API. I don't have a strong opinion on which JSON-schema lib to use, the one you found seems like a very good choice.

From what I read, you are on a good path using the Jackson features to read/write JSON, so please carry on :)

@jdbranham
Copy link
Contributor

Thanks @iSnow -
I ran into a snag with the CSV read/write. The jackson library doesn't support complex objects =\

Workaround for the deserialization process - when the table row is parsed, we can replace the 'speech' quotes the previous library used, with regular double quotes. So it should be backwards compatible.

For serialization, it was a little tricker. Basically we change the Json ArrayNode into a list of JsonNode objects, and for the ObjectNode sub-class, we just serialize the object as a string. All the tests pass, so I guess it's good =)
References:
FasterXML/jackson-dataformat-csv#9
FasterXML/jackson-dataformats-text#97

Question about one test - when we infer the schema, one of the tests is expecting the field title to be present, but I don't see it set anywhere. I could make the field title match the name during deserialization, but I wanted to make sure this didn't cause any unwanted side-effects.

Let me know what you guys think about this - if it's cool, then I'll fix it, and start on the datapackage-java updates.

frictionlessdata/tableschema-java@master...savantly-net:integration

@jdbranham
Copy link
Contributor

Actually the title isn't considered in equals - it was the constraint initialization.
Problem fixed.

https://travis-ci.org/github/frictionlessdata/tableschema-java/builds/679147569

@jdbranham
Copy link
Contributor

jdbranham commented Apr 26, 2020

PR opened for datapackage-java
I set the dependency for tableschema-java to work with local development, so we'll need to update the artifact coordinates to point to the correct version [once the tableschema-java PR is merged].

datapackage-java PR -
#35
tableschema-java PR -
frictionlessdata/tableschema-java#55

@lwinfree
Copy link
Member

lwinfree commented May 1, 2020

Hi @iSnow + @jdbranham! Thanks for all the work here! Has frictionlessdata/tableschema-java#55 closed this issue? Or are there other parts remaining?

@iSnow
Copy link
Contributor

iSnow commented May 3, 2020

hi @lwinfree - not fully, I need to review the PR for datapackage-java, which I will do in the coming days, then it should be good to close the issue and pay the bounty.

@jdbranham
Copy link
Contributor

Do I need to do anything else? =)

@lwinfree
Copy link
Member

lwinfree commented Jun 1, 2020

Hi @jdbranham thank you for all your work and for pinging me. I'm going to close this so we can pay you the bounty. Cheers!
Fixed via #35

@lwinfree lwinfree closed this as completed Jun 1, 2020
@lwinfree
Copy link
Member

lwinfree commented Jun 2, 2020

hey @jdbranham can you claim the bounty on bounty source? Let me know if you have problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants