Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce symbol table API into pad #22850

Open
wants to merge 41 commits into
base: blead
Choose a base branch
from

Conversation

happy-barney
Copy link

Changes summary

For purpose of this changes let's distinguish between:

  • sigil - expression type representation in source code
  • symbol type - internal identification of symbol type

For implementation of sigil-less symbols we will still need their symbol type, eg:

  • attributes
  • named patterns
  • data contracts
  • ...

replace literals representing type of symbol stored in pad with explicit tokens

For example, '&' in code can mean also:

  • subroutine symbol type
  • subroutine sigil
  • code-block prototype
  • predefined variable name

This set of changes replaces its usage as symbol type with C token Perl_Symbol_Table_Code.

Extend padname API with "split" macros

Instead of relying on layout of string content explicit macros providing symbol type and symbol name are introduced, eg:

// current approach:
*PadnamePV (pn)
PadnamePV (pn)[0]

// new approach
Padname_Symbol_Table (pn)
  • change allows to change underlying data type and values (eg to index to registry array)
  • using explicit token expresses intention and makes it easier to search code base

Provide padname functions with split symbol type and symbol name parameters

Alternatives to existing functions but with two parameters.
Example:

// current
pad_add_name_pvs("$(self)", 0, NULL, NULL);

// new approach
pad_add_symbol_pvs (Perl_Symbol_Table_Scalar, "(self)", 0, NULL, NULL);

Possible follow-ups

  • replace more characters literals with macros, eg:
    • Perl_Var_Match
    • Perl_Prototype_Code
  • provide builtin symbol table registry
  • use this approach also with global symbols

Open questions

  • what (if anything) to put into perldelta ?
  • mark old functions as deprecated ?

Branislav Zahradník added 29 commits December 10, 2024 11:51
…fication

Using dedicated C tokens instead of literal values will improve readability
of source code by separating intention from other usages of literal characters:
- sigil (as language structure)
- predefined variable name
- prototype

It will also make it feasible to add functionality, like
- new sigil-less symbol tables (attributes, patterns, contracts, ...)
- add support for custom symbol tables
…m symbol table

Returns true value when provided `PADNAME *` represents symbol belonging
to provided symbol table.
By wrapping implementation into single intent expressing token
it can evolve without affecting its usage.
Using macro instead of direct access:
- identifies code where such value is used / required
- hides any future changes from code which uses it, eg
  - type change from 'char' to 'U32'
Returns name without symbol type so it will properly work
when we will change symbol table id type or when we will add new
symbol types.

For backward compatibility user of this macro can rely on -1 being
valid index.
stored in pad, excluding symbol table id (sigil)

eg:
- `$` => 0
- `$_` => 1
- `$self` => 4
into handy macros to avoid code duplication as well as
simplify future changes.
eg: scalar, array, and hash are "variable"
Move printf format and proper expansion of parameters,
into macros so when code evolves, it will be isolated
from usage.
extracting common part of condition and using `Perl_Symbol_Table_*` constants
…nts first

Common problem is when macro expands as multiple arguments.
Without intermediate step allowing to expand arguments before
macro call there will be an error due treating such expansion as
single argument.
@jkeenan
Copy link
Contributor

jkeenan commented Dec 10, 2024

What is the point of this very large change? AFAICT, it's not a bug fix or a response to an open issue. Does it respond to some previously expressed need?

@happy-barney
Copy link
Author

@jkeenan already discussed that on IRC, this PR is quite huge, it in fact combines parts of 3 changesets (in my eyes, maybe more by someone else)

I will extract first two changes into separated PRs, each announced first for discussion on mailing list.

@bulk88
Copy link
Contributor

bulk88 commented Dec 14, 2024

// current
pad_add_name_pvs("$(self)", 0, NULL, NULL);

// new approach
pad_add_symbol_pvs (Perl_Symbol_Table_Scalar, "(self)", 0, NULL, NULL);

If we are making brand new functions/APIs, YYYY_ANY_XXX_pvs("") should be taking C lit str "self" not "(self)". The paren chars shouldn't be parsed/stripped/copied/re-null-termed, at end user runtime, long long after CC time/XS/Core binaries created time. C lit str "self" in theory could be fed into printf() as an err message const char * arg. This style C lits seem unusable for XS/interp core core, croak_XXX("caller screwed up %s %s %s" ,"(self)","(record)","(update_type)");.

"foo" can also go through SV_COW/HEK(hash key cache/method cache)/C const RO Static/CC const string lit-dedup APIs easily.

"()" is unlikely to have any reuse in libperl or xs__.so or reuse in B::CC/B::CC-like "op structs to C" compilers.

Newxz(alloc2,
STRUCT_OFFSET(struct padname_with_str, xpadn_str[0]) + len + 1,
char);
alloc = (struct padname_with_str *)alloc2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Newxz() when 4 lines below Copy() will write over the null bytes? No need for interp backend or libc to do a redundant memset() .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is current implementation modified with treating symbol table and name as separated arguments.

@bulk88
Copy link
Contributor

bulk88 commented Dec 14, 2024

enum Perl_Symbol_Table {
    Perl_Symbol_Table_Array  = '@',
    Perl_Symbol_Table_Code   = '&',
    Perl_Symbol_Table_Hash   = '%',
    Perl_Symbol_Table_Scalar = '$',
};

&/@/%/$ are easier on the eyes, and seeing #2-#4 in C/XS code, instantly means the C statement is interacting with a PP-concept or the PP-type system. Tokens _Array and _Code don't visually separate/abstract anything. No idea if someone looks at the C statement/identifier _Array/_Code, is it "C in C" or "P5/PP in C" or "JSON in C"?

Perl C/XS uses AV/HV/CV tokens everywhere. Short to type. 2 char tokens won't instantly blow past 80 char EOL/justify/format. Short tokens are a historical nice feature of P5 C API. The very long "Perl_Symbol_Table_*" prefix feels like another prog lang/framework.

The proposed ident "Perl_Symbol_Table_Scalar" can only fit 3x before IDE EOL overflow @80ch.

Even tokens SVt_PVLV (SVt_NULL ????)/SVt_PVAV/SVt_PVHV/SVt_PVCV like in
GV *Perl_gv_add_by_type(pTHX_ GV *gv, svtype type); would be more readable and IDE grep-able than the proposed Enum tokens. PDNt_PVHV or PDNt_HV are possible spellings if SVt_PVHV is unusable. But SVt_PVHV and rest of SVt_* fit into a U8. Example of unusable, you want 0-based values for some optimization reason, but, SVt_PVHV is 13.

old

padname *
Perl_newPADNAMEpvn(
    const char *s,
    unsigned __int64 len
);

new

PADNAME *
Perl_new_padname_symbol_pvn(
    perl_symbol_table_id symbol_table,
    const char *         name,
    STRLEN               name_len
);

I really don't like that this code is adding a 3rd arg, when the old API was just 2. And this new API forever adds runtime overhead at all the callers vs the current API which has an overhead of 1 byte!!! for the sigil (embedded into C lit string).

BA 01 00 00 00          mov     edx, 1          ; len
48 8D 0D 58 84 1C 00    lea     rcx, asc_1802E0C04 ; "&"
48 8B F0                mov     rsi, rax
E8 EC 27 00 00          call    Perl_newPADNAMEpvn

On -O1 AMD64, function arg constant int 0x1 has 5 bytes of overhead. If "0x1" or "0x24" the ""argument"" gets embedded in the C lit, its overhead is 1 byte, not 5 bytes (separate C arg), assuming "$self" vs "self" didn't change the CC/linkers's dedup logic since the CPAN XS .so/.dll hypothetically has exactly one C reference to "$self" or "self". ARM/RISC will probably be 4 bytes instead of x64's 5 bytes.

A better way to do it, this example might have flaws, don't copy paste it without throughly thinking it over.

old

PADOFFSET
Perl_pad_add_name_pvn(pTHX_ const char *namepv, STRLEN namelen,
                U32 flags, HV *typestash, HV *ourstash)
{
    PADOFFSET offset;
    PADNAME *name;

    PERL_ARGS_ASSERT_PAD_ADD_NAME_PVN;

    if (flags & ~(padadd_OUR|padadd_STATE|padadd_NO_DUP_CHECK|padadd_FIELD))
        Perl_croak(aTHX_ "panic: pad_add_name_pvn illegal flag bits 0x%" UVxf,
                   (UV)flags);

    name = newPADNAMEpvn(namepv, namelen);

new/proposed

PERL_CALLCONV PADOFFSET
Perl_pad_add_symbol_pvn(pTHX_ perl_symbol_table_id symbol_table, const char *namepv, STRLEN namelen, U32 flags, HV *typestash, HV *ourstash);
#define PERL_ARGS_ASSERT_PAD_ADD_SYMBOL_PVN     \
        assert(namepv); assert(!typestash || SvTYPE(typestash) == SVt_PVHV); \
        assert(!ourstash || SvTYPE(ourstash) == SVt_PVHV)

I see U32 flags.

currrent blead, and meaning of U32 flags

#define padadd_STALEOK		0x08	   /* allow stale lexical in active
                                            * sub, but only one level up */
#define padadd_FIELD            0x10       /* set PADNAMEt_FIELD */
#define padfind_FIELD_OK        0x20       /* pad_findlex is permitted to see fields */

bits 0xF0 or 0xE0 are unassigned. 2 bits can be taken to encode $%&@. Alot better than 5 bytes at every caller.

Adding 4 functions Perl_PDNM_add_AV() Perl_PDNM_add_CV() Perl_PDNM_add_HV() Perl_PDNM_add_SV(), which encode the sigil into the C func ptr, is another way to do it. Attention must be paid, so that the 4 are truly extern exported symbols and don't get inlined by the CC, voided the point of doing it Perl_PDNM_add_AV() style.

This branch would be better off as a "xs_english.h" with "#define PERL_XS_ENGLISH" vs the ABI changes/runtime bloat changes done, to "improve" code readability, when I find its less readable b/c of very long tokens, and bluring what C statements/sub expressions are "C as C" vs "P5 as C".

/* CAT2:
 *	This macro concatenates 2 tokens together.
 */
/* STRINGIFY:
 *	This macro surrounds its token with double quotes.
 */
#if 42 == 1
#define CAT2(a,b)	a/**/b
#define STRINGIFY(a)	"a"
#endif
#if 42 == 42
#define PeRl_CaTiFy(a, b)	a ## b
#define PeRl_StGiFy(a)	#a
#define CAT2(a,b)	PeRl_CaTiFy(a,b)
#define StGiFy(a)	PeRl_StGiFy(a)
#define STRINGIFY(a)	PeRl_StGiFy(a)
#endif
#if 42 != 1 && 42 != 42
#include "Bletch: How does this C preprocessor concatenate tokens?"
#endif

If someone thinks "$"/"&"/"@"/"%" are unreadable as "Perl in C", CAT2() can be used to glue #define Perl_Symbol_Table_Array "@" onto the PP padname identifier (C str lit) at interp CC time, not end user interp runtime. No ABI overhead since after the C compiler the ABI is the same (C lit strs with sigils inlined/prefixed).

Even though I offer improvements/suggestions above, my suggestions are generic talking or abstract talking. The proposed padname API has too many flaws on ASCII src code level (future hacking/maint/reading), and ABI (after C compiler) level. I don't think its fixable.

@happy-barney
Copy link
Author

@bulk88
re (self) see original code, it was $(self) ... my change here only separates symbol type from symbol name

re constants for names:
let me answer with examples.
currently, running ack --noheading "case '\\$'" returns lines:

toke.c:4638:          case '$':
toke.c:5280:    case '$':
toke.c:5375:        case '$': TOKEN (PERLY_DOLLAR);
toke.c:9440:    case '$':

class.c:732:                case '$':
class.c:1029:            case '$': optype = OP_PADSV; break;
class.c:1138:        case '$':

regcomp.c:3646:            case '$':           /* (?$...) */
regcomp.c:5622:    case '$':
regcomp.c:6374:                case '$':

mg.c:1231:    case '$': /* $$ */
mg.c:3516:    case '$': /* $$ */

ext/XS-APItest/APItest.xs:929:            case '$': {

ext/XS-APItest/APItest.c:938:            case '$': {

ext/re/re_comp.c:3646:            case '$':           /* (?$...) */
ext/re/re_comp.c:5622:    case '$':
ext/re/re_comp.c:6374:                case '$':

op.c:14490:            case '$':
op.c:14584:                    case '$':

builtin.c:345:            case '$':

gv.c:2428:        case '$':		/* $$ */

vms/vms.c:8761:    case '$':

with my change (fully extended, will be another PR), some these lines will look like (rest I didn't identify properly yet):

class.c:732:                case Perl_Symbol_Type:
class.c:1029:            case Perl_Symbol_Type: optype = OP_PADSV; break;
class.c:1138:        case Perl_Symbol_Type:

mg.c:1231:    case Perl_Var_Process_ID: /* $$ */
mg.c:3516:    case Perl_Var_Process_ID: /* $$ */

op.c:14490:            case Perl_Prototype_Scalar:
op.c:14584:                    case Perl_Prototype_Scalar:

builtin.c:345:            case Perl_Sigil_Scalar:

gv.c:2428:        case Perl_Var_Process_ID:		/* $$ */

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants