Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filling out the functionality of defs. #426

Merged
merged 6 commits into from
Oct 6, 2021

Conversation

haberman
Copy link
Member

@haberman haberman commented Oct 5, 2021

This is a large-ish PR that fills in several previously missing features from upb defs.

The major changes here are:

  1. Implemented reflection for methods and services (these were previously ignored).
  2. Implemented options (including custom options) for all def types. These options are now vended from each type of def (eg. google_protobuf_MessageOptions for upb_msgdef, etc).
  3. Defs now follow the lexical structure of .proto files. upb used to store all defs in a flat array per file for simplicity. However existing APIs like Python expose the nested structure of .proto files, and require (for example) that you can easily iterate over all of the messages nested inside a given message.

There are a few other more minor changes here:

  1. We allow json_name duplicates if a special flag is set on the symtab. This is very inadvisable, since it makes JSON input ambiguous, but some existing code depends on this behavior.
  2. We support the legacy behavior that field type/label can be omitted and will use the default.
  3. There are new, simpler iteration APIs for upb_table that allow deletion of elements while iterating. We use these to support backing out a upb_symtab_add() operation if there is an error while building the defs.

This unfortunately regresses the ads descriptor loading benchmark a bit (understandable, as options support does require fundamentally more work than we were doing before). We will have to try reclaiming this later. Code size also grows significantly in upb/def.c: this is an expected result of the new functionality.

name                                      old time/op  new time/op  delta
ArenaInitialBlockOneAlloc                 7.04ns ± 0%  6.31ns ± 0%  -10.29%  (p=0.000 n=11+10)
ArenaOneAlloc                             21.5ns ± 2%  21.2ns ± 1%   -1.35%  (p=0.000 n=12+11)
LoadAdsDescriptor_Proto2                  14.8ms ± 3%  15.1ms ±11%     ~     (p=0.525 n=12+11)
LoadAdsDescriptor_Upb                     3.51ms ± 1%  3.74ms ± 2%   +6.67%  (p=0.000 n=11+12)
LoadDescriptor_Proto2                      249µs ± 0%   248µs ± 1%     ~     (p=0.552 n=9+11)
LoadDescriptor_Upb                        48.1µs ± 3%  46.8µs ± 1%   -2.63%  (p=0.002 n=10+11)
Parse_Proto2<FileDesc,InitBlock,Copy>     17.6µs ±17%  18.1µs ±16%     ~     (p=0.068 n=12+12)
Parse_Proto2<FileDesc,NoArena,Copy>       31.0µs ±11%  30.7µs ±11%     ~     (p=0.514 n=12+12)
Parse_Proto2<FileDescSV,InitBlock,Alias>  17.7µs ±19%  18.4µs ±14%   +4.26%  (p=0.033 n=12+12)
Parse_Proto2<FileDesc,UseArena,Copy>      19.7µs ± 4%  19.9µs ± 4%     ~     (p=0.151 n=11+11)
Parse_Upb_FileDesc<InitBlock,Alias>       8.35µs ± 0%  8.60µs ± 1%   +2.97%  (p=0.000 n=10+12)
Parse_Upb_FileDesc<InitBlock,Copy>        10.0µs ± 6%   9.9µs ± 1%     ~     (p=0.059 n=10+12)
Parse_Upb_FileDesc<UseArena,Alias>        8.75µs ± 1%  8.96µs ± 1%   +2.36%  (p=0.000 n=10+10)
Parse_Upb_FileDesc<UseArena,Copy>         10.4µs ± 1%  10.4µs ± 1%     ~     (p=0.387 n=10+11)
SerializeDescriptor_Proto2                5.60µs ± 3%  5.53µs ± 2%     ~     (p=0.201 n=11+9)
SerializeDescriptor_Upb                   11.7µs ± 0%  11.6µs ± 0%   -0.28%  (p=0.004 n=11+12)


    FILE SIZE        VM SIZE    
 --------------  -------------- 
   +23% +4.71Ki   +26% +4.61Ki    upb/def.c
    [NEW] +2.41Ki  [NEW] +2.37Ki    resolve_msgdef
     +25%    +832   +26%    +832    create_msgdef
    [NEW]    +450  [NEW]    +408    symtab_resolveany
     +20%    +400   +21%    +400    create_fielddef
     +21%    +288   +22%    +288    create_enumdef
    +4.7%    +240  +4.8%    +240    _upb_symtab_addfile
    [NEW]    +228  [NEW]    +192    opt_default
    [NEW]    +202  [NEW]    +160    count_exts_in_msg
    [NEW]     +99  [NEW]     +56    upb_msgdef_options
     +13%     +24   +18%     +24    upb_symtab_lookupext2
    +2.8%     +16  +3.1%     +16    _upb_symtab_loaddefinit
    +5.7%     +16  +6.7%     +16    upb_symtab_new
     +12%     +14  [ = ]       0    upb_oneofdef_field
    -6.2%      -7  [ = ]       0    upb_oneofdef_issynthetic
    -5.1%      -7  [ = ]       0    upb_oneofdef_itof
    -5.2%      -8  [ = ]       0    upb_oneofdef_name
    -5.8%     -16  -6.7%     -16    check_ident
    -3.7%     -64  -3.8%     -64    resolve_fielddef
    [DEL]    -100  [DEL]     -56    upb_msgdef_mapentry
    [DEL]    -243  [DEL]    -200    count_types_in_msg
  +3.8%    +383  +3.6%    +328    upb/table.c
    [NEW]    +216  [NEW]    +168    upb_strtable_removeiter
    [NEW]    +203  [NEW]    +160    upb_strtable_next2
    -2.7%      -7  [ = ]       0    upb_inttable_begin
    -3.8%      -7  [ = ]       0    upb_strtable_begin
    -5.2%      -8  [ = ]       0    upb_strtable_done
    -6.7%     -14  [ = ]       0    upb_strdup2
  +5.2%    +368  +5.2%    +368    [section .rodata]
    +5.2%    +362  +5.2%    +362    [section .rodata]
     +30%      +6   +30%      +6    parse_proto.msg
  +8.1%     +47  [ = ]       0    [section .strtab]
     +12%     +40  [ = ]       0    [section .strtab]
     +26%      +7  [ = ]       0    _GLOBAL_OFFSET_TABLE_
  +2.5%     +24  +2.5%     +24    [section .dynsym]
  +2.9%     +24  +2.9%     +24    [section .rela.plt]
    +2.9%     +24  +2.9%     +24    _GLOBAL_OFFSET_TABLE_
  +1.6%     +24  [ = ]       0    [section .symtab]
  +2.8%     +16  +2.8%     +16    [section .plt]
  +2.6%      +8  +2.6%      +8    [section .got.plt]
    +2.6%      +8  +2.6%      +8    _GLOBAL_OFFSET_TABLE_
  +1.4%      +7  +1.4%      +7    [section .dynstr]
  +0.0%      +5  [ = ]       0    upb/decode.c
    +2.1%      +5  [ = ]       0    decode_isdonefallback
  +1.2%      +4  +1.2%      +4    [section .hash]
  +2.4%      +2  +2.4%      +2    [section .gnu.version]
  +0.0%      +2  [ = ]       0    bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/descriptor.upb.c
    +8.6%      +8  [ = ]       0    google_protobuf_OneofDescriptorProto_msginit
    +6.8%      +7  [ = ]       0    google_protobuf_DescriptorProto_ExtensionRange_msginit
    +7.5%      +7  [ = ]       0    google_protobuf_FieldDescriptorProto_msginit
    +7.4%      +7  [ = ]       0    google_protobuf_MethodDescriptorProto_msginit
    +8.0%      +7  [ = ]       0    google_protobuf_SourceCodeInfo_msginit
    -1.9%      -2  [ = ]       0    google_protobuf_ServiceDescriptorProto_msginit
    -6.5%      -6  [ = ]       0    google_protobuf_ServiceOptions_msginit
    -7.6%      -7  [ = ]       0    google_protobuf_OneofOptions_msginit
    -6.8%      -7  [ = ]       0    google_protobuf_SourceCodeInfo_Location_msginit
   -12.5%     -12  [ = ]       0    google_protobuf_FileOptions_msginit
  -0.3%     -10  [ = ]       0    upb/msg.c
    -4.5%     -10  [ = ]       0    _upb_map_new
  -9.0%     -13  -9.0%     -13    [LOAD #2 [RX]]
 -60.5% -1.37Ki  [ = ]       0    [Unmapped]
  +2.5% +4.22Ki  +4.0% +5.36Ki    TOTAL

Biggest/key changes:
1. Defs are now nested per the .proto file syntax.
2. Options are parsed and vended.
@haberman haberman requested review from fowles and esorot October 5, 2021 07:16
upb/def.c Outdated
return c >= low && c <= high;
}

static bool upb_isletter(char c) {
return upb_isbetween(c, 'A', 'Z') || upb_isbetween(c, 'a', 'z') || c == '_';
return upb_isbetween(c | 0x20, 'a', 'z') || c == '_';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look up the ascii table to realize that | 0x20 is equivalent to lowercasing a letter. I think that c | ('a' - 'A') is a bit more obvious, but perhaps just a comment would be best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

upb/def.c Outdated
}

const upb_fielddef *upb_msgdef_nestedext(const upb_msgdef *m, int i) {
UPB_ASSERT(i >= 0 && i < m->nested_ext_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below you are consistent with 0 <= i && i < max, so I think you should update these as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -978,7 +1272,7 @@ const upb_fielddef *upb_symtab_lookupext2(const upb_symtab *s, const char *name,
return unpack_def(v, UPB_DEFTYPE_FIELD);
case UPB_DEFTYPE_MSG: {
const upb_msgdef *m = unpack_def(v, UPB_DEFTYPE_MSG);
return m->message_set_ext; /* May be NULL if not in MessageeSet. */
return m->in_message_set ? &m->nested_exts[0] : NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't &m->nested_exts[0] equivalent to m->nested_exts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I find &m->nested_exts[0] more semantically clear here; we are returning the address of the first thing in an array. This mirrors the usage in eg. upb_msgdef_nestedext().

}
}

const char *last_dot = strrchr(name, '.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always worry about string apis that take a const char* and don't have a size associated with it... feels like a buffer overflow waiting to happen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely avoid APIs like strcpy() that have a precondition that a target buffer is big enough. But strrchr() is no more dangerous than strlen(): the only danger is that the buffer isn't NULL-terminated, which will show up quickly in tests.

upb/def.c Outdated
@@ -1480,6 +1842,9 @@ static char* makejsonname(symtab_addctx *ctx, const char* name) {
return json_name;
}

/* Adds a symbol to the symtab. The def's pointer to upb_filedef* must be set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what def? is this ctx->def or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, v is a packed def pointer.

upb/def.c Outdated

// Resolve subdef by type name, if necessary.
switch ((int)f->type_) {
case 0: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does 0 not have a better named constant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

upb/def.c Outdated

if (!ctx.arena) {
if (!ctx.arena && !ctx.tmp_arena) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be || not &&?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch.

@@ -263,6 +263,38 @@ bool upb_strtable_resize(upb_strtable *t, size_t size_lg2, upb_arena *a);

/* Iterators ******************************************************************/

/* New-style iterators. Much simpler, iterator state is held in size_t.
*
* intptr_t iter = UPB_INTTABLE_BEGIN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment says state is a size_t but code uses intptr_t.

Also the "new" and "much simpler" don't really add much to the comment (and will become out of date as soon as these have settled for a while).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// proto (descriptor.proto) so we don't worry about it.
const protobuf::Descriptor* max32 = nullptr;
const protobuf::Descriptor* max64 = nullptr;
for (auto message : this_file_messages) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto* in general the const and * tokens are very high value for readers

@haberman haberman merged commit e5c1583 into protocolbuffers:master Oct 6, 2021
copybara-service bot pushed a commit that referenced this pull request Aug 28, 2022
This function was introduced in #426 but it appears it was never used.  I am not sure what the purpose was, but in any case it is not needed.

With this function removed, we no longer need to tag pointers for the DefPool "files" table.

PiperOrigin-RevId: 470565730
copybara-service bot pushed a commit that referenced this pull request Aug 28, 2022
This function was introduced in #426 but it appears it was never used.  I am not sure what the purpose was, but in any case it is not needed.

With this function removed, we no longer need to tag pointers for the DefPool "files" table.

PiperOrigin-RevId: 470567000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants