Introduce symbol table API into pad #22850

happy-barney · 2024-12-09T09:45:46Z

Changes summary

For purpose of this changes let's distinguish between:

sigil - expression type representation in source code
symbol type - internal identification of symbol type

For implementation of sigil-less symbols we will still need their symbol type, eg:

attributes
named patterns
data contracts
...

replace literals representing type of symbol stored in pad with explicit tokens

For example, '&' in code can mean also:

subroutine symbol type
subroutine sigil
code-block prototype
predefined variable name

This set of changes replaces its usage as symbol type with C token Perl_Symbol_Table_Code.

Extend padname API with "split" macros

Instead of relying on layout of string content explicit macros providing symbol type and symbol name are introduced, eg:

// current approach:
*PadnamePV (pn)
PadnamePV (pn)[0]

// new approach
Padname_Symbol_Table (pn)

change allows to change underlying data type and values (eg to index to registry array)
using explicit token expresses intention and makes it easier to search code base

Provide padname functions with split symbol type and symbol name parameters

Alternatives to existing functions but with two parameters.
Example:

// current
pad_add_name_pvs("$(self)", 0, NULL, NULL);

// new approach
pad_add_symbol_pvs (Perl_Symbol_Table_Scalar, "(self)", 0, NULL, NULL);

Possible follow-ups

replace more characters literals with macros, eg:
- Perl_Var_Match
- Perl_Prototype_Code
provide builtin symbol table registry
use this approach also with global symbols

Open questions

what (if anything) to put into perldelta ?
mark old functions as deprecated ?

…fication Using dedicated C tokens instead of literal values will improve readability of source code by separating intention from other usages of literal characters: - sigil (as language structure) - predefined variable name - prototype It will also make it feasible to add functionality, like - new sigil-less symbol tables (attributes, patterns, contracts, ...) - add support for custom symbol tables

…m symbol table Returns true value when provided `PADNAME *` represents symbol belonging to provided symbol table.

By wrapping implementation into single intent expressing token it can evolve without affecting its usage.

Using macro instead of direct access: - identifies code where such value is used / required - hides any future changes from code which uses it, eg - type change from 'char' to 'U32'

Returns name without symbol type so it will properly work when we will change symbol table id type or when we will add new symbol types. For backward compatibility user of this macro can rely on -1 being valid index.

stored in pad, excluding symbol table id (sigil) eg: - `$` => 0 - `$_` => 1 - `$self` => 4

…ous symbol

…tifier

into handy macros to avoid code duplication as well as simplify future changes.

eg: scalar, array, and hash are "variable"

Move printf format and proper expansion of parameters, into macros so when code evolves, it will be isolated from usage.

…sigil

extracting common part of condition and using `Perl_Symbol_Table_*` constants

…t symbol table

… symbol table

…nts first Common problem is when macro expands as multiple arguments. Without intermediate step allowing to expand arguments before macro call there will be an error due treating such expansion as single argument.

… symbol table

…ymbol table

to improve readability of what is actually tested

There were two instances of same condition. From code looking like 3-state dispatch I guess it was meant to be `1`

…tested

…it symbol table

… symbol table

…it symbol table

… symbol table

jkeenan · 2024-12-10T12:52:57Z

What is the point of this very large change? AFAICT, it's not a bug fix or a response to an open issue. Does it respond to some previously expressed need?

happy-barney · 2024-12-10T13:03:43Z

@jkeenan already discussed that on IRC, this PR is quite huge, it in fact combines parts of 3 changesets (in my eyes, maybe more by someone else)

I will extract first two changes into separated PRs, each announced first for discussion on mailing list.

bulk88 · 2024-12-14T14:07:20Z

// current
pad_add_name_pvs("$(self)", 0, NULL, NULL);

// new approach
pad_add_symbol_pvs (Perl_Symbol_Table_Scalar, "(self)", 0, NULL, NULL);

If we are making brand new functions/APIs, YYYY_ANY_XXX_pvs("") should be taking C lit str "self" not "(self)". The paren chars shouldn't be parsed/stripped/copied/re-null-termed, at end user runtime, long long after CC time/XS/Core binaries created time. C lit str "self" in theory could be fed into printf() as an err message const char * arg. This style C lits seem unusable for XS/interp core core, croak_XXX("caller screwed up %s %s %s" ,"(self)","(record)","(update_type)");.

"foo" can also go through SV_COW/HEK(hash key cache/method cache)/C const RO Static/CC const string lit-dedup APIs easily.

"()" is unlikely to have any reuse in libperl or xs__.so or reuse in B::CC/B::CC-like "op structs to C" compilers.

bulk88 · 2024-12-14T15:33:45Z

pad.c

+    Newxz(alloc2,
+          STRUCT_OFFSET(struct padname_with_str, xpadn_str[0]) + len + 1,
+          char);
+    alloc = (struct padname_with_str *)alloc2;


Why Newxz() when 4 lines below Copy() will write over the null bytes? No need for interp backend or libc to do a redundant memset() .

it is current implementation modified with treating symbol table and name as separated arguments.

bulk88 · 2024-12-14T15:51:13Z

enum Perl_Symbol_Table {
    Perl_Symbol_Table_Array  = '@',
    Perl_Symbol_Table_Code   = '&',
    Perl_Symbol_Table_Hash   = '%',
    Perl_Symbol_Table_Scalar = '$',
};

&/@/%/$ are easier on the eyes, and seeing #2-#4 in C/XS code, instantly means the C statement is interacting with a PP-concept or the PP-type system. Tokens _Array and _Code don't visually separate/abstract anything. No idea if someone looks at the C statement/identifier _Array/_Code, is it "C in C" or "P5/PP in C" or "JSON in C"?

Perl C/XS uses AV/HV/CV tokens everywhere. Short to type. 2 char tokens won't instantly blow past 80 char EOL/justify/format. Short tokens are a historical nice feature of P5 C API. The very long "Perl_Symbol_Table_*" prefix feels like another prog lang/framework.

The proposed ident "Perl_Symbol_Table_Scalar" can only fit 3x before IDE EOL overflow @80ch.

Even tokens SVt_PVLV (SVt_NULL ????)/SVt_PVAV/SVt_PVHV/SVt_PVCV like in
GV *Perl_gv_add_by_type(pTHX_ GV *gv, svtype type); would be more readable and IDE grep-able than the proposed Enum tokens. PDNt_PVHV or PDNt_HV are possible spellings if SVt_PVHV is unusable. But SVt_PVHV and rest of SVt_* fit into a U8. Example of unusable, you want 0-based values for some optimization reason, but, SVt_PVHV is 13.

old

padname *
Perl_newPADNAMEpvn(
    const char *s,
    unsigned __int64 len
);

new

PADNAME *
Perl_new_padname_symbol_pvn(
    perl_symbol_table_id symbol_table,
    const char *         name,
    STRLEN               name_len
);

I really don't like that this code is adding a 3rd arg, when the old API was just 2. And this new API forever adds runtime overhead at all the callers vs the current API which has an overhead of 1 byte!!! for the sigil (embedded into C lit string).

BA 01 00 00 00          mov     edx, 1          ; len
48 8D 0D 58 84 1C 00    lea     rcx, asc_1802E0C04 ; "&"
48 8B F0                mov     rsi, rax
E8 EC 27 00 00          call    Perl_newPADNAMEpvn

On -O1 AMD64, function arg constant int 0x1 has 5 bytes of overhead. If "0x1" or "0x24" the ""argument"" gets embedded in the C lit, its overhead is 1 byte, not 5 bytes (separate C arg), assuming "$self" vs "self" didn't change the CC/linkers's dedup logic since the CPAN XS .so/.dll hypothetically has exactly one C reference to "$self" or "self". ARM/RISC will probably be 4 bytes instead of x64's 5 bytes.

A better way to do it, this example might have flaws, don't copy paste it without throughly thinking it over.

old

PADOFFSET
Perl_pad_add_name_pvn(pTHX_ const char *namepv, STRLEN namelen,
                U32 flags, HV *typestash, HV *ourstash)
{
    PADOFFSET offset;
    PADNAME *name;

    PERL_ARGS_ASSERT_PAD_ADD_NAME_PVN;

    if (flags & ~(padadd_OUR|padadd_STATE|padadd_NO_DUP_CHECK|padadd_FIELD))
        Perl_croak(aTHX_ "panic: pad_add_name_pvn illegal flag bits 0x%" UVxf,
                   (UV)flags);

    name = newPADNAMEpvn(namepv, namelen);

new/proposed

PERL_CALLCONV PADOFFSET
Perl_pad_add_symbol_pvn(pTHX_ perl_symbol_table_id symbol_table, const char *namepv, STRLEN namelen, U32 flags, HV *typestash, HV *ourstash);
#define PERL_ARGS_ASSERT_PAD_ADD_SYMBOL_PVN     \
        assert(namepv); assert(!typestash || SvTYPE(typestash) == SVt_PVHV); \
        assert(!ourstash || SvTYPE(ourstash) == SVt_PVHV)

I see U32 flags.

currrent blead, and meaning of U32 flags

#define padadd_STALEOK		0x08	   /* allow stale lexical in active
                                            * sub, but only one level up */
#define padadd_FIELD            0x10       /* set PADNAMEt_FIELD */
#define padfind_FIELD_OK        0x20       /* pad_findlex is permitted to see fields */

bits 0xF0 or 0xE0 are unassigned. 2 bits can be taken to encode $%&@. Alot better than 5 bytes at every caller.

Adding 4 functions Perl_PDNM_add_AV() Perl_PDNM_add_CV() Perl_PDNM_add_HV() Perl_PDNM_add_SV(), which encode the sigil into the C func ptr, is another way to do it. Attention must be paid, so that the 4 are truly extern exported symbols and don't get inlined by the CC, voided the point of doing it Perl_PDNM_add_AV() style.

This branch would be better off as a "xs_english.h" with "#define PERL_XS_ENGLISH" vs the ABI changes/runtime bloat changes done, to "improve" code readability, when I find its less readable b/c of very long tokens, and bluring what C statements/sub expressions are "C as C" vs "P5 as C".

/* CAT2:
 *	This macro concatenates 2 tokens together.
 */
/* STRINGIFY:
 *	This macro surrounds its token with double quotes.
 */
#if 42 == 1
#define CAT2(a,b)	a/**/b
#define STRINGIFY(a)	"a"
#endif
#if 42 == 42
#define PeRl_CaTiFy(a, b)	a ## b
#define PeRl_StGiFy(a)	#a
#define CAT2(a,b)	PeRl_CaTiFy(a,b)
#define StGiFy(a)	PeRl_StGiFy(a)
#define STRINGIFY(a)	PeRl_StGiFy(a)
#endif
#if 42 != 1 && 42 != 42
#include "Bletch: How does this C preprocessor concatenate tokens?"
#endif

If someone thinks "$"/"&"/"@"/"%" are unreadable as "Perl in C", CAT2() can be used to glue #define Perl_Symbol_Table_Array "@" onto the PP padname identifier (C str lit) at interp CC time, not end user interp runtime. No ABI overhead since after the C compiler the ABI is the same (C lit strs with sigils inlined/prefixed).

Even though I offer improvements/suggestions above, my suggestions are generic talking or abstract talking. The proposed padname API has too many flaws on ASCII src code level (future hacking/maint/reading), and ABI (after C compiler) level. I don't think its fixable.

happy-barney · 2024-12-15T15:10:48Z

@bulk88
re (self) see original code, it was $(self) ... my change here only separates symbol type from symbol name

re constants for names:
let me answer with examples.
currently, running ack --noheading "case '\\$'" returns lines:

toke.c:4638:          case '$':
toke.c:5280:    case '$':
toke.c:5375:        case '$': TOKEN (PERLY_DOLLAR);
toke.c:9440:    case '$':

class.c:732:                case '$':
class.c:1029:            case '$': optype = OP_PADSV; break;
class.c:1138:        case '$':

regcomp.c:3646:            case '$':           /* (?$...) */
regcomp.c:5622:    case '$':
regcomp.c:6374:                case '$':

mg.c:1231:    case '$': /* $$ */
mg.c:3516:    case '$': /* $$ */

ext/XS-APItest/APItest.xs:929:            case '$': {

ext/XS-APItest/APItest.c:938:            case '$': {

ext/re/re_comp.c:3646:            case '$':           /* (?$...) */
ext/re/re_comp.c:5622:    case '$':
ext/re/re_comp.c:6374:                case '$':

op.c:14490:            case '$':
op.c:14584:                    case '$':

builtin.c:345:            case '$':

gv.c:2428:        case '$':		/* $$ */

vms/vms.c:8761:    case '$':

with my change (fully extended, will be another PR), some these lines will look like (rest I didn't identify properly yet):

class.c:732:                case Perl_Symbol_Type:
class.c:1029:            case Perl_Symbol_Type: optype = OP_PADSV; break;
class.c:1138:        case Perl_Symbol_Type:

mg.c:1231:    case Perl_Var_Process_ID: /* $$ */
mg.c:3516:    case Perl_Var_Process_ID: /* $$ */

op.c:14490:            case Perl_Prototype_Scalar:
op.c:14584:                    case Perl_Prototype_Scalar:

builtin.c:345:            case Perl_Sigil_Scalar:

gv.c:2428:        case Perl_Var_Process_ID:		/* $$ */

happy-barney force-pushed the hpb/symbol-table branch from 51e4140 to 9719855 Compare December 9, 2024 23:05

Branislav Zahradník added 29 commits December 10, 2024 11:51

Clean trailing whitespaces

0cb83e4

[pad] Perl_Symbol_Table_Scalar - identify usage of '$' as symbol type

b45a669

[pad] Perl_Symbol_Table_Code - identify usage of '&' as symbol table

80a78c2

[pad] Perl_Symbol_Table_Array - identify usage of '@' as symbol type

f26f41d

[pad] Perl_Symbol_Table_Hash - identify usage of '%' as symbol type

2ae5f3a

[pad] Padname_Is_Symbol_Table - whether PADNAME represents symbol fro…

ef9707b

…m symbol table Returns true value when provided `PADNAME *` represents symbol belonging to provided symbol table.

[pad] Padname_Is_Symbol_Table_Scalar - use macro

cabc797

By wrapping implementation into single intent expressing token it can evolve without affecting its usage.

[pad] Padname_Is_Symbol_Table_Code - use macro

784b654

[pad] Padname_Is_Symbol - factor out duplicated predicate

581c4fa

[pad] Padname_Symbol_Table - extract symbol table id

675f75f

Using macro instead of direct access: - identifies code where such value is used / required - hides any future changes from code which uses it, eg - type change from 'char' to 'U32'

[pad] Padname_Symbol_Name - extract symbol name

cda643a

Returns name without symbol type so it will properly work when we will change symbol table id type or when we will add new symbol types. For backward compatibility user of this macro can rely on -1 being valid index.

[pad] Padname_Symbol_Name_Length - length of symbol name

113c640

stored in pad, excluding symbol table id (sigil) eg: - `$` => 0 - `$_` => 1 - `$self` => 4

[pad] Padname_Symbol_Is_Anonymous - whether PADNAME represents anonym…

0db3048

…ous symbol

[pad] Perl_Sigil_To_Symbol_Table - convert sigil to symbol table iden…

f117281

…tifier

[pad] Perl_Symbol_Table_Title_lc/ucfirst - factor out symbol table title

8af996b

into handy macros to avoid code duplication as well as simplify future changes.

[pad] Padname_Symbol_Table_Title_* - factor out symbol table title

a09287f

eg: scalar, array, and hash are "variable"

[pad] Padname_Symbol_Printf_Format and Params

36d413a

Move printf format and proper expansion of parameters, into macros so when code evolves, it will be isolated from usage.

[pad] Padname_Symbol_Printf_Format/Params - use dedicated format for …

5bbe791

…sigil

[pad] refactor if/else sequences into switch/case (symbol type)

fc1e956

extracting common part of condition and using `Perl_Symbol_Table_*` constants

[pad] perl_symbol_table_id - type alias of symbol table id

3a2302c

[pad] new_padname_symbol_pvn - newPADNAMEpvn alternative with explici…

8b67580

…t symbol table

[pad] new_padname_symbol_pvn - replace usage of newPADNAMEpvn

e6001bd

[pad] pad_add_symbol_pvn - pad_add_name_pvn alternative with explicit…

7a98c55

… symbol table

[pad] pad_add_symbol_pvn - replace usage of pad_add_name_pvn

1b237ce

[pad] pad_add_symbol_pvs - pad_add_name_pvs alternative with explicit…

b3e2973

… symbol table

[pad] pad_add_symbol_pvs - replace usage of pad_add_name_pvs

213328d

[pad] pad_add_symbol_pv - pad_add_name_pv alternative with explicit s…

c51ac70

…ymbol table

Branislav Zahradník added 12 commits December 10, 2024 11:51

[pad] pad_add_symbol_sv - pad_add_name_sv alternative with explicit s…

c37bc9c

…ymbol table

[pad] pad_add_symbol_sv - replace usage of pad_add_name_sv

f57c569

[pad] pad_findlex - make function work with explicit symbol table

cdb5627

[XS-APItest] [pad_scalar] Export tested function id as constant

0d38172

to improve readability of what is actually tested

[XS-APItest] [pad_scalar] Possible bug fix

db133fc

There were two instances of same condition. From code looking like 3-state dispatch I guess it was meant to be `1`

[XS-APItest] [pad_scalar] Add few assert messages to express what is …

9520792

…tested

[pad] pad_find_my_symbol_pvn - pad_findmy_pvn alternative with explic…

328acf2

…it symbol table

[pad] pad_find_my_symbol_pvn - replace usage of pad_findmy_pvn

543056c

[pad] pad_find_my_symbol_pv - pad_findmy_pv alternative with explicit…

04606c2

… symbol table

[pad] pad_find_my_symbol_pvs - pad_findmy_pvs alternative with explic…

e41bdd0

…it symbol table

[pad] pad_find_my_symbol_sv - pad_findmy_sv alternative with explicit…

8dc2fc1

… symbol table

[pad] pad_find_my_symbol_sv - replace usage of pad_findmy_sv

7f7f7c7

happy-barney force-pushed the hpb/symbol-table branch from 9719855 to 7f7f7c7 Compare December 10, 2024 10:52

bulk88 reviewed Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce symbol table API into pad #22850

Introduce symbol table API into pad #22850

happy-barney commented Dec 9, 2024

jkeenan commented Dec 10, 2024

happy-barney commented Dec 10, 2024

bulk88 commented Dec 14, 2024

bulk88 Dec 14, 2024

happy-barney Dec 15, 2024

bulk88 commented Dec 14, 2024

happy-barney commented Dec 15, 2024

Introduce symbol table API into pad #22850

Are you sure you want to change the base?

Introduce symbol table API into pad #22850

Conversation

happy-barney commented Dec 9, 2024

Changes summary

replace literals representing type of symbol stored in pad with explicit tokens

Extend padname API with "split" macros

Provide padname functions with split symbol type and symbol name parameters

Possible follow-ups

Open questions

jkeenan commented Dec 10, 2024

happy-barney commented Dec 10, 2024

bulk88 commented Dec 14, 2024

bulk88 Dec 14, 2024

Choose a reason for hiding this comment

happy-barney Dec 15, 2024

Choose a reason for hiding this comment

bulk88 commented Dec 14, 2024

happy-barney commented Dec 15, 2024