Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: Provide "extended" package matching support #662

Closed
Pfoerd opened this issue Sep 2, 2021 · 11 comments · Fixed by #896
Closed

Feature proposal: Provide "extended" package matching support #662

Pfoerd opened this issue Sep 2, 2021 · 11 comments · Fixed by #896

Comments

@Pfoerd
Copy link
Contributor

Pfoerd commented Sep 2, 2021

Story

As an ArchUnit test developer, I would like to be able to use an extended syntax for package identifiers, since the existing one based on PackageMatcher is very limited. At least, I would like to be able to define alternatives such as (a|b). This allows to support architectures that for some reasons need a namespace for some components by using different packages even though they logically belong to the same architectural entity like a layer, domain, etc. This is especially useful if your code needs to satisfy some "overarching" structural requirements that come from outside your actual project, e.g. from an enterprise architecture.

Sample Use Case

You have a structure org.my.app.subdomain1 and org.my.app.subdomain2 and want to logically place them in the same top-level domain domain1 in your ArchUnit tests, but for certain reasons you are not allowed to introduce another package level.

Known Alternatives:

Most ArchUnit APIs we use offer an alternative to the "straightforward" variant of just providing a packageIdentifier String e.g. by using custom Predicates oder SliceAssignments. But this makes the tests very verbose, complex and hard to understand.

API proposals:

  • Extend the existing PackageMatcher in a backward compatible way
  • Alternatively a new "PackgeMatcherExt" (name WIP 😉) could be introduced, together with corresponding SliceAssignements and Predicates. The advantage of sending something like this upstream is that it could share the basic matching algorithms for the existing syntax with the existing PackageMatcher. This is currently not possible due to its current design (final monolithic class without interfaces). In addition, special API extensions could be introduced for using the "extended" package identifier syntax, e.g. something like layer("domain").definedByExtendedPkgId("org.my.app.(subdomain1|2)..") (name WIP 😉).

Further ideas

Further "nice-to-haves" that could be achieved by such a solutions are non-capturing group alternatives like (?:a|b) and a "not" operator like (!a|b).(?:!a)

If you're interested, let me know, because we have an existing solution and would be happy to share the relevant parts with you and send it upstream.

P.S.: I really appreciate your work on ArchUnit, it's a really cool tool 😃!

@codecholeric
Copy link
Collaborator

Hey, sorry that I'm only getting back here now, task pipeline congestion 😉 I agree, that this is very similar to #685, and could probably be tackled together.
I'm just pondering what the best way is, as this can of course get nicely complex 😂
We could add a|b, but then the case described in #685 would be something like ..(application|domain.api).(*).. and we would actually need non-capturing groups, because the first one must be ignored to only consider the wildcard capture 🤔
Also it seems there are two use cases in here, one is simple package matching (classes in package matching) and one is grouping (assign classes to capturing groups of package pattern).

One thing I'm starting to wonder is, if there wouldn't be a way simpler and way more powerful alternative: If the package matcher syntax is not good enough for your case, you can use

PackageMatcher.of("regex|.*\\.foo?\\..*")

(or any other identifier prefix to switch the mode). And there you can supply any regex you want. You don't have the .. feature out of the box then, but on the other hand any fancy feature of regex is already supported 🤔
And it should be easy to add because at the moment the simplified package identifier ..foo.. is converted to a regex internally anyway...
What would you think of that? You could then do things like

classes().that().resideInAPackage("regex|.*\\.(?:foo|bar\\.baz)\\.(subdomain1|subdomain2)\\..*")

out of the box. One advantage would be a lot of power while it is very cheap to implement 🤔 And maybe readability is in comparison not that much worse than extending the existing package matcher with features that look very closely to regex anyway 🤔

I'm still not sure, if this is good enough to solve problems like #685 across all APIs for all sorts of users though 😂 Might still be necessary to support multiple patterns in the slices API 🤷‍♂️

@hankem
Copy link
Member

hankem commented Oct 24, 2021

I like the idea! 💙 Just one thought:

PackageMatcher.of("regex|.*\\.foo?\\..*")
(or any other identifier prefix to switch the mode)

Maybe an explicit method like PackageMatcher.ofRegEx(".*\\.foo?\\..*")? 😉

classes().that().resideInAPackage("regex|.*\\.(?:foo|bar\\.baz)\\.(subdomain1|subdomain2)\\..*")

I could also imagine to have an additional power-user API ClassesThat.resideInAPackage(PackageMatcher) (similarly to other pairs of methods, where the simple API e.g. takes a String and the power-user API e.g. a DescribedPredicate). Then you wouldn't have to introduce a mode switching prefix, which might also be confusing to users who aren't aware of this power-user feature. (They say: "Explicit is better than implicit"... 😜)

@codecholeric
Copy link
Collaborator

Yes, I agree about the explicit part, I was mainly thinking how it could be implemented with as little changes as possible 😂 And explicit means we have to extend every method that takes package identifier strings for a second version 😉 (while implicitly I know that under the hood it always ends up in PackageMatcher.of(..) at the moment, no matter the entry point)
But probably it's better to invest the work to obtain a more guiding API instead of taking the shortcut 👍

@codecholeric
Copy link
Collaborator

PS: My first idea was PackageMatcher.fromRegex(..) 😂 But I let it go, because I thought that it requires changes of all package API methods 🤷‍♂️

@Pfoerd
Copy link
Contributor Author

Pfoerd commented Nov 9, 2021

I really like your approaches. I wonder if there is a way we can bring them all together, because I think all approaches have their pros and cons:

  • add minor syntax extensions to support, for example, simple alternatives like en.foo.(a|b) and perhaps en.(foo.a|bar). Since support for the latter can be considered a compatible extension to the former, it could be added incrementally
  • Add API support to provide a full regex-based PackageMatcher. For this case, I would also prefer an explicit solution by adding a corresponding foo(packageMatcher : PackageMatcher) function to all APIs that get a string-based package identifier today.

@codecholeric
Copy link
Collaborator

Sounds good! Would you want to create a PR for those minor extensions to start with? I would really only support the or-case via pipe then inside of brackets. Probably supporting dots inside of these alternatives does not add much complexity 🙂

@Pfoerd
Copy link
Contributor Author

Pfoerd commented Dec 6, 2021

Okay i will provide a PR for the minor extensions! Just to make sure that I'm on the right track: after some prototyping I ended up just adding some replacements in the convertToRegex method:

private String convertToRegex(String packageIdentifier) {
    return packageIdentifier.
            replaceAll("(\\([\\w.*]*)(?:\\|)([\\w.*]*\\))", "$1%OR%$2"). // replacing all capturing altivernatives '(a|b)' with '(a%OR%b)'
            replaceAll("([\\w*]*)(?:\\|)([\\w*]*)", "(?:$1|$2)"). // replacing all non-capturing alternatives 'a|b' with '(?:a|b)'
            replace("%OR%", "|"). // replacing all '(a%OR%b)' with '(a|b)'
            replace(TWO_STAR_CAPTURE_LITERAL, TWO_STAR_REGEX_MARKER).
            replace("*", "\\w+").
            replace(".", "\\.").
            replace(TWO_STAR_REGEX_MARKER, TWO_STAR_CAPTURE_REGEX).
            replace("\\.\\.", TWO_DOTS_REGEX);
}

results:

"..a|b.pk*..        , some.a.pkg.whatever            , true",
"..c|b.pk*..        , some.a.pkg.whatever            , false",
"..a|b*.pk*..       , some.bitrary.pkg.whatever      , true",
"..a|b*.pk*..       , some.a.pkg.whatever            , true",
"..a|b*.pk*..       , some.arbitrary.pkg.whatever    , false",
"..*c|d*.pk*..      , some.arbitrary.pkg.whatever    , false",

"..(a|b).pk*.(c|d).. , some.a.pkg.d.whatever , a:d",
"..a|b.pk*.(c|d).. , some.a.pkg.d.whatever , d",
"..(a|b*).pk*.(**).end , some.bitrary.pkg.in.between.end , bitrary:in.between",
"..(a.b|c.d).pk* , some.a.b.pkg , a.b",

Some basic tests confirm that this should do the job. Do you think a refactored variant of this should match the requirements?

EDIT: I made a very small change that enables dots inside a capturing group alternative. If we want to support non-capturing alternatives with dots inside imho we would have to introduce a new control-char to mark the scope of a non-capturing-alternative e.g. foo.[a.b|c.d].bar. What do you think?

@codecholeric
Copy link
Collaborator

What we should definitely do is double check if we would solve #685 (comment) with this 🤔
I guess in this case it would then be (?:..application.(*).*..)|(?:..domain.api|logic.(*)..)? Because if you want to demand one subpackage of application you can't pull it together, no? What I'm wondering then is, will users really understand this? Except for 0.01% power users that dream of regex at night? 🤔 Or am I missing something?
Of course it could also be that we should just support both, multiple patterns in the Slices API and the extended syntax 🤷‍♂️ Then this wouldn't need to easily solve #685

@Pfoerd
Copy link
Contributor Author

Pfoerd commented Dec 8, 2021

You are right, i think if it should ensure subpackages for application and infrastructure but not for domain the expression is getting very complex in any case.

That's why i suggested to support both: minor syntax extensions for simple alternatives (like my issue here) and API support to provide a full regex-based PackageMatcher (this could be used for the usecase of #685 ).

EDIT: maybe a good limitation could be to not allow "nested" alternations to keep the syntax of package identifiers simple and readable? If one needs complex matching like nested alternations he would have to use the full regex-based approach (e.g. PackageMatcher.fromRegex("..."))

Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Mar 1, 2022
Issue: TNG#662
Signed-off-by: e.solutions <info@esolutions.de>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
@Pfoerd
Copy link
Contributor Author

Pfoerd commented Mar 1, 2022

@codecholeric maybe you can check the commit I just pushed, it adds the syntax support for simple alternatives:

  • pack.[a.c|b*].d matches pack.a.c.d or pack.bar.d, but neither pack.a.d nor pack.b.c.d
  • ..service.(a|b*).. matches a.service.bar.more and group 1 would be bar
  • Nested alternatives like (?:..application.(*).*..)|(?:..domain.api|logic.(*)..) are not allowed

This should be sufficient for the use cases mentioned in this issue. If you like it I can create a pull request.

Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Mar 1, 2022
Issue: TNG#662
Signed-off-by: e.solutions <info@esolutions.de>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
@Pfoerd
Copy link
Contributor Author

Pfoerd commented May 24, 2022

@codecholeric, do you have time for some feedback on the suggested solution? Thanks!

Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Jun 22, 2022
Issue: TNG#662
Signed-off-by: e.solutions <info@esolutions.de>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Jun 22, 2022
Issue: TNG#662
Signed-off-by: e.solutions <info@esolutions.de>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Jun 22, 2022
Issue: TNG#662
Signed-off-by: e.solutions <17569373+Pfoerd@users.noreply.github.com>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
Pfoerd added a commit to Pfoerd/ArchUnit that referenced this issue Jul 8, 2022
Issue: TNG#662
Signed-off-by: e.solutions <17569373+Pfoerd@users.noreply.github.com>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
codecholeric pushed a commit to Pfoerd/ArchUnit that referenced this issue Jul 11, 2022
Issue: TNG#662
Signed-off-by: e.solutions <17569373+Pfoerd@users.noreply.github.com>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
codecholeric pushed a commit to Pfoerd/ArchUnit that referenced this issue Jul 11, 2022
This change will allow defining alternations within `()` and `[]`. I.e. it is now possible to match both `some.a.pkg` and `some.b.pkg` with a single pattern `some.[a|b].pkg` or capture it via `some.(a|b).pkg`. The demand for this comes from users dealing with legacy package structures, or where the package structure is dictated from the outside (e.g. via enterprise architecture). They might sometimes need to "collect the slice" from different parts of the package tree, e.g. `..[application|domain].(*)`. While this is already possible using the `SliceAssignment` API this is not very straight forward and also demands quite some boilerplate.

Issue: TNG#662
Signed-off-by: e.solutions <17569373+Pfoerd@users.noreply.github.com>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
Signed-off-by: Peter Gafert <peter.gafert@tngtech.com>
codecholeric pushed a commit to Pfoerd/ArchUnit that referenced this issue Jul 11, 2022
This change will allow defining alternations within `()` and `[]`. I.e. it is now possible to match both `some.a.pkg` and `some.b.pkg` with a single pattern `some.[a|b].pkg` or capture it via `some.(a|b).pkg`. The demand for this comes from users dealing with legacy package structures, or where the package structure is dictated from the outside (e.g. via enterprise architecture). They might sometimes need to create the slices from different parts of the package tree, e.g. `..[application|domain].(*)`. While this is already possible using the `SliceAssignment` API this is not very straight forward and also demands quite some boilerplate.

We apply the following constraints to the new API to keep it simple to implement and easy and consistent for the user:
1) No nested groups via `[]` or `()`
  1.1) `()` within `()` would just be confusing and there does not seem to be any valid use case for it
  1.2) `()` within `[]` would be ambiguous and hard to implement (e.g. within `[first.(*)|second.(*)]` the second `(*)` would be capturing group 2 within the regex, even though group 1 and 2 are mutually exclusive)
  1.3) `[]` within `[]` would make the implementation a lot more complex, since we would now need to match the correct brackets of nested groups when creating the regex
  1.4) `[]` within `()` would be easy to implement, but make the API inconsistent ("why can I nest it like this, but not the other way around?")
2) No toplevel alternations. This goes together with 1), since it would make the implementation more complex to prevent nested capturing groups

Issue: TNG#662
Signed-off-by: e.solutions <17569373+Pfoerd@users.noreply.github.com>
on-behalf-of: @e-solutions-GmbH <info@esolutions.de>
Signed-off-by: Peter Gafert <peter.gafert@tngtech.com>
codecholeric added a commit that referenced this issue Jul 12, 2022
Adds the ability to define alternatives such as `(a|b)` in package identifiers. This allows to support architectures that for some reasons need a namespace for some components by using different packages even though they logically belong to the same architectural entity like a layer, domain, etc. This is especially useful if your code needs to satisfy some "overarching" structural requirements that come from outside your actual project, e.g. from an enterprise architecture.

* `pack.[a.c|b*].d` matches `pack.a.c.d` or `pack.bar.d`, but neither `pack.a.d` nor `pack.b.c.d`
* `..service.(a|b*)..` matches `a.service.bar.more` and group 1 would be `bar`
* Nested alternatives like `(?:..application.(*).*..)|(?:..domain.api|logic.(*)..)` are not allowed to keep the syntax of package identifiers simple and readable.

Resolves: #662
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants