Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized deserialization for enum values #95

Closed
joelittlejohn opened this issue Jun 23, 2013 · 12 comments
Closed

Optimized deserialization for enum values #95

joelittlejohn opened this issue Jun 23, 2013 · 12 comments

Comments

@joelittlejohn
Copy link
Owner

Original author: sak...@gmail.com (March 13, 2013 12:48:52)

The way enums are currently generated is not as effective as it could be, especially when it comes to parsing them from their string values. Check out a quick example I wrote at https://gist.github.com/akutz/5151745. The reason I used both a pattern string and a map was to gain the efficiency of a case-insensitive comparison for testing the validity of the text as an enum vs always transforming the tested string. There is no reason to create a new String in memory for the comparison if we can handle it up-front with a case-insensitive regex pattern.

Original issue: http://code.google.com/p/jsonschema2pojo/issues/detail?id=95

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 13:50:49
Hi sakutz, if I understand correctly, you're talking about this generated code:

    @JsonCreator
    public static MyType.MyEnum fromValue(String value) {
        for (MyType.MyEnum c: MyType.MyEnum.values()) {
            if (c.value.equals(value)) {
                return c;
            }
        }
        throw new IllegalArgumentException(value);
    }

When you refer to creating a String in memory, are you talking about calling something like value.toUpperCase() before passing the value into this method?

Could we not solve this problem by simply changing the above code snippet to:

    @JsonCreator
    public static MyType.MyEnum fromValue(String value) {
        for (MyType.MyEnum c: MyType.MyEnum.values()) {
            if (c.value.equalsIgnoreCase(value)) {
                return c;
            }
        }
        throw new IllegalArgumentException(value);
    }

Or maybe I have misunderstood what the problem is or the proposed solution?

@joelittlejohn
Copy link
Owner Author

From sak...@gmail.com on March 13, 2013 13:54:01
I'm saying it's more efficient to pre-create a pattern and map for the enum that supports fast lookups than to perform an iteration over the entire list of values each time. Since an enum is initialized as a singleton and the pattern and map are created just once for the life of the JVM it means you incur a very early penalty and then all other membership checks or parse attempts are against a regex or cached map of enums.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 16:37:09
Ah I see. Sorry, I misunderstood and thought you were specifically talking about case-insensitivity here.

I think we do have a penalty at runtime here. It looks to me like we're trading off iteration for regex matching and hash calculation. The question is, how many enum values are required before the latter becomes more efficient (and is the cost ever big enough to warrant this extra complexity).

I'd be interested to see some stats on what the performance characteristics are for each approach as the number of enum constants grows.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 16:38:11
(sorry, by 'at runtime' I actually mean 'after initialization')

@joelittlejohn
Copy link
Owner Author

From sak...@gmail.com on March 13, 2013 16:38:18
Except these values are all calculated once when the enum class is loaded. So it's a one-time penalty versus a penalty for every membership check or parse attempt.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 16:53:42
Just to be clear, when we look up a value we are making the following trade-off:

Current strategy: iteration

New strategy: regex matching, hash calculation

For every string we parse, we need to perform a regex match and calculate a hash.

When the number of constants in the enum is small, iteration may be faster than a regex match plus hash calculation (particularly as the JIT may flatten this iteration into a single set of comparisons). When the number of enum constants is large, I'm sure that iteration will be more costly than regex match plus hash calculation.

I don't have a good idea what these performance characteristics would be, but I'd be interested to see the stats as the number of enum constants grows.

@joelittlejohn
Copy link
Owner Author

From sak...@gmail.com on March 13, 2013 17:11:49
I think you're leaving out an important detail regarding the repetition of the iteration:

Current strategy: repeated iteration of enum values for every membership check or parse attempt (you don't currently have a membership check other than the thrown exception)

Proposed strategy: one time iteration of enum values to build regex and calculate hashes

The current strategy has O(n), or linear, order of complexity. The proposed strategy has O(1), or constant, order of complexity since it uses hashes.

Yes, there is a penalty for the proposed strategy of O(n), but it is a one-time penalty. And unless you're concerned with how quickly your JVM loads your classes, it's not really a problem at all. In fact, unless you're in the business of writing an application that let's users load plug-ins at runtime where you've probably written your own class loader or use a IoC container like Spring or Giuce to manage the class loading for you, I cannot imagine where the proposed strategy isn't the preferred one.

No doubt it would be interesting to see how long the JVM takes to load enum with 1000 values under this new strategy. But even if it takes a full minute, if you're writing Java for a webapp or a server app, the start time is usually sacrificed without concern for faster runtime operations.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 17:20:28
Big O defines asymptotic behaviour, which is why for small n we must weight this trade-off.

I don't think I'm leaving out any detail, but I don't know how I can rephrase this another time. Maybe I'm misunderstanding your pseudo-code? The 'Proposed' strategy performs the following on parsing a value:

  1. regex match:
    https://gist.github.com/akutz/5151745#file-gistfile1-java-L47
  2. hash calculation:
    https://gist.github.com/akutz/5151745#file-gistfile1-java-L52

These are not one-time penalties, correct?

To repeat the above:

"When the number of constants in the enum is small, iteration may be faster than a regex match plus hash calculation (particularly as the JIT may flatten this iteration into a single set of comparisons). When the number of enum constants is large, I'm sure that iteration will be more costly than regex match plus hash calculation."

@joelittlejohn
Copy link
Owner Author

From sak...@gmail.com on March 13, 2013 17:22:41
Ah, I misunderstood. You meant the regex execution and calculation of the hash to match and lookup the values. I understand now.

Yes, these are not one-time penalties.

I'm not sure either. I'll test it when I get a chance and upload the test project to github.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 13, 2013 17:31:45
An interesting micro-benchmark problem.

I guess there may be situations in which mechanically generated schemas produce a very large number of enum constants, and this iteration time would become significant. In this case there would be no real workaround other than to replace the enum with a hand-coded version.

We should probably avoid replacing this with a case-insensitive mapping, this would not be compatible with the schema definition for enum and would potentially allow bad json data to be bound. This should simplify the implementation though, so if you do come to test this it would be good to test without the regex step too.

@joelittlejohn
Copy link
Owner Author

From joelittl...@gmail.com on March 24, 2013 21:37:32
I've done a few small benchmarks on a hash-based deserialize method versus an iteration based one. The hash-based lookup (with no regex) was quicker for even a small number of enum values so I've implemented this for 0.3.6.

Thanks for raising this sakutz, and please feel free to build a snapshot from the current master and try the new implementation.

@joelittlejohn
Copy link
Owner Author

From sak...@gmail.com on April 02, 2013 01:50:18
That's great! I got super busy at work and wasn't able to run the tests. Thanks for doing so!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant