Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement grouped conv interface #80870

Closed
wants to merge 8,927 commits into from

Conversation

srcarroll
Copy link
Contributor

No description provided.

@banach-space banach-space self-requested a review February 7, 2024 19:22
dsandersllvm and others added 29 commits June 7, 2024 14:01
This reverts commit 2ec122d.
Casting the result of `Section.getAddressWithOffset()` goes wrong if we
are on a 32-bit platform whose addresses are regarded as signed; in that
case, just doing
```
(uint64_t)Section.getAddressWithOffset(...)
```
or
```
reinterpret_cast<uint64_t>(Section.getAddressWithOffset(...))
```
will result in sign-extension.

We use these expressions when constructing branch stubs, which is before
we know the final load address, so we can just switch to the
`Section.getLoadAddressWithOffset(...)` method instead.

Doing that is also more consistent, since when calculating relative
offsets for relocations, we use the load address anyway, so the code
currently only works because `Section.Address` is equal to
`Section.LoadAddress` at this point.

Fixes llvm#94478.
…s on LA64 (llvm#93813)

Materializing constants on LoongArch is simpler if the constant is sign
extended from i32. By default i32 constant operands of phis are zero
extended.
    
This patch adds a hook to allow LoongArch to override this for i32. We
have an existing isSExtCheaperThanZExt, but it operates on EVT which we
don't have at these places in the code.
Implements fmaxf16 and fminf16, which are two missing functions listed
here: llvm#93566
This patch make all errors start with a lowercase letter and removes
trailing periods and newlines. This fixes inconsistencies between error
messages and facilitate concatenating them.
…ges (llvm#94259)

This patch changes the crashlog image loading default behaviour to not
only load images from the crashed thread but also for the application
specific backtrace thread.

This patch also move the Application Specific Backtrace / Last Exception
Backtrace tag from the thread queue field to the thread name.

rdar://128276576

Signed-off-by: Med Ismail Bennani <ismail@bennani.ma>
Following of llvm#86912

The motivation of the patch series is that, for a module interface unit
`X`, when the dependent modules of `X` changes, if the changes is not
relevant with `X`, we hope the BMI of `X` won't change. For the specific
patch, we hope if the changes was about irrelevant declaration changes,
we hope the BMI of `X` won't change. **However**, I found the patch
itself is not very useful in practice, since the adding or removing
declarations, will change the state of identifiers and types in most
cases.

That said, for the most simple example,

```
// partA.cppm
export module m:partA;

// partA.v1.cppm
export module m:partA;
export void a() {}

// partB.cppm
export module m:partB;
export void b() {}

// m.cppm
export module m;
export import :partA;
export import :partB;

// onlyUseB;
export module onlyUseB;
import m;
export inline void onluUseB() {
    b();
}
```

the BMI of `onlyUseB` will change after we change the implementation of
`partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new
identifiers and types (the function prototype).

So in this patch, we have to write the tests as:

```
// partA.cppm
export module m:partA;
export int getA() { ... }
export int getA2(int) { ... }

// partA.v1.cppm
export module m:partA;
export int getA() { ... }
export int getA(int) { ... }
export int getA2(int) { ... }

// partB.cppm
export module m:partB;
export void b() {}

// m.cppm
export module m;
export import :partA;
export import :partB;

// onlyUseB;
export module onlyUseB;
import m;
export inline void onluUseB() {
    b();
}
```

so that the new introduced declaration `int getA(int)` doesn't introduce
new identifiers and types, then the BMI of `onlyUseB` can keep
unchanged.

While it looks not so great, the patch should be the base of the patch
to erase the transitive change for identifiers and types since I don't
know how can we introduce new types and identifiers without introducing
new declarations. Given how tightly the relationship between
declarations, types and identifiers, I think we can only reach the ideal
state after we made the series for all of the three entties.

The design of the patch is similar to
llvm#86912, which extends the
32-bit DeclID to 64-bit and use the higher bits to store the module file
index and the lower bits to store the Local Decl ID.

A slight difference is that we only use 48 bits to store the new DeclID
since we try to use the higher 16 bits to store the module ID in the
prefix of Decl class. Previously, we use 32 bits to store the module ID
and 32 bits to store the DeclID. I don't want to allocate additional
space so I tried to make the additional space the same as 64 bits. An
potential interesting thing here is about the relationship between the
module ID and the module file index. I feel we can get the module file
index by the module ID. But I didn't prove it or implement it. Since I
want to make the patch itself as small as possible. We can make it in
the future if we want.

Another change in the patch is the new concept Decl Index, which means
the index of the very big array `DeclsLoaded` in ASTReader. Previously,
the index of a loaded declaration is simply the Decl ID minus
PREDEFINED_DECL_NUMs. So there are some places they got used
ambiguously. But this patch tried to split these two concepts.

As llvm#86912 did, the change will
increase the on-disk PCM file sizes. As the declaration ID may be the
most IDs in the PCM file, this can have the biggest impact on the size.
In my experiments, this change will bring 6.6% increase of the on-disk
PCM size. No compile-time performance regression observed. Given the
benefits in the motivation example, I think the cost is worthwhile.
…vm#92746)

This patch add support of intrinsics GNU extension GETCWD
llvm#84203. Some usage info and
example has been added to `flang/docs/Intrinsics.md`. The patch contains
both the lowering and the runtime code and works on both Windows and
Linux.


|   System   |   Implmentation  |
|-----------|--------------------|
| Windows | _getcwd               |
| Linux       |getcwd                  |
…86512)

This patch implements a `__is_bitwise_cloneable` builtin in clang.

The builtin is used as a guard to check a type can be safely bitwise
copied by memcpy. It's functionally similar to
`__is_trivially_copyable`, but covers a wider range of types (e.g.
classes with virtual functions). The compiler guarantees that after
copy, the destination object has the same object representations as the
source object. And it is up to user to guarantee that program semantic
constraints are satisfied.

Context:
https://discourse.llvm.org/t/extension-for-creating-objects-via-memcpy
…lvm#93814)

Although i32 type is illegal in the backend, LA64 has pretty good
support for i32 types by using W instructions.

By adding n32 to the DataLayout string, middle end optimizations will
consider i32 to be a native type. One known effect of this is enabling
LoopStrengthReduce on loops with i32 induction variables. This can be
beneficial because C/C++ code often has loops with i32 induction
variables due to the use of `int` or `unsigned int`.

If this patch exposes performance issues, those are better addressed by
tuning LSR or other passes.
This commit enhances the docsting of `translateModuleToLLVMIR` as a
followup to llvm#94445
As the comment already indicates, only replacement with undef
is problematic, as it introduces an additional use of undef.
Use the correct ValueTracking helper.
If we're only checking for undef, then also only look for undef
elements in the vector (rather than undef and poison).
…#91715)

- There is no restriction on a loop with controlled convergent
operations when
  the relevant tokens are defined and used within the loop.

- When a token defined outside a loop is used inside (also called a loop
convergence heart), unrolling is allowed only in the absence of
remainder or
  runtime checks.

- When a token defined inside a loop is used outside, such a loop is
said to be
"extended". This loop can only be unrolled by also duplicating the
extended part
  lying outside the loop. Such unrolling is disabled for now.

- Clean up loop hearts: When unrolling a loop with a heart, duplicating
the
heart will introduce multiple static uses of a convergence control token
in a
cycle that does not contain its definition. This violates the static
rules for
tokens, and needs to be cleaned up into a single occurrence of the
intrinsic.

- Spell out the initializer for UnrollLoopOptions to improve
readability.


Original implementation [D85605] by Nicolai Haehnle
<nicolai.haehnle@amd.com>.
…lvm#93806)

The m_ZExtOrSelf() family of matchers currently incorrectly calls
std::forward twice on the same value. However, just removing those causes
other complications, because then template arguments get incorrectly
inferred to const references instead of the underlying value types.
Things become a mess.

Instead, just completely remove the use of std::forward and rvalue
references from SDPatternMatch. I don't think they really provide value
in this context, especially as they're not used consistently in the
first place.
…Y` are known signed/unsigned

Several transforms:
    1) If known `Y < 0`:
        - slt -> ult: https://alive2.llvm.org/ce/z/9zt2iK
        - sle -> ule: https://alive2.llvm.org/ce/z/SPoPNF
        - sgt -> ugt: https://alive2.llvm.org/ce/z/IGNxAk
        - sge -> uge: https://alive2.llvm.org/ce/z/joqTvR
    2) If known `Y >= 0`:
        - `(X & PosY) s> X --> X s< 0`
            - https://alive2.llvm.org/ce/z/7e-5BQ
        - `(X & PosY) s> X --> X s< 0`
            - https://alive2.llvm.org/ce/z/jvT4Gb
    3) If known `X < 0`:
        - `(NegX & Y) s> NegX --> Y s>= 0`
            - https://alive2.llvm.org/ce/z/ApkaEh
        - `(NegX & Y) s<= NegX --> Y s< 0`
            - https://alive2.llvm.org/ce/z/oRnfHp

Closes llvm#94417
…UEs (llvm#94458)

`SelectionDAGBuilder::handleDebugValue` has a parameter `Order` which
represents the insert-at position for the new DBG_VALUE. Prior to this patch
`SelectionDAGBuilder::SDNodeOrder` is used instead of the `Order` parameter.

The only code-paths where `Order != SDNodeOrder` are the two calls calls to
`handleDebugValue` from `salvageUnresolvedDbgValue`.
`salvageUnresolvedDbgValue` is called from `resolveOrClearDbgInfo` and
`dropDanglingDebugInfo`. The former is called after SelectionDAG completes one
block.

Some dbg.values can't be lowered to DBG_VALUEs right away. These get recorded
as 'dangling' - their order-number is saved - and get salvaged later through
`dropDanglingDebugInfo`, or if we've still got dangling debug info once the
whole block has been emitted, through `resolveOrClearDbgInfo`. Their saved
order-number is passed to `handleDebugValue`.

Prior to this patch, DBG_VALUEs inserted using these functions are inserted at
the "current" `SDNodeOrder` rather than the intended position that is passed to
the function.

Fix and add test.
davemgreen and others added 29 commits June 7, 2024 14:02
Change the target triple to remove some unnecessary instructions.
This change is an implementation of
llvm#87367 investigation on
supporting IEEE math operations as intrinsics.
Which was discussed in this RFC:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

This PR is just for Tan.

Now that x86 tan backend landed:
llvm#90503 we can add other
backends since the shared pieces are in tree now.

Changes:
- `llvm/include/llvm/Analysis/VecFuncs.def` - vectorization of tan for
arm64 backends.
- `llvm/lib/Target/AArch64/AArch64FastISel.cpp` - Add tan to the libcall
table
- `llvm/lib/Target/AArch64/AArch64ISelLowering.cpp` - Add tan expansion
for f128, f16, and vector\neon operations
- `llvm/lib/Target/AArch64/GISel/AArch64LegalizerInfo.cpp` define
`G_FTAN` as a legal arm64 instruction

resolves llvm#94755
Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.
This PR depends on llvm#90260

We changed the order in which functions are outlined in Machine
Outliner.

The formula for priority is found via a black-box Bayesian optimization
toolbox. Using this formula for sorting consistently reduces the
uncompressed size of large real-world mobile apps. We also ran a few
benchmarks using LLVM test suites, and showed that sorting by priority
consistently reduces the text segment size.

|run (CTMark/)   |baseline (1)|priority (2)|diff (1 -> 2)|
|----------------|------------|------------|-------------|
|lencod          |349624      |349264      |-0.1030%     |
|SPASS           |219672      |219480      |-0.0874%     |
|kc              |271956      |251200      |-7.6321%     |
|sqlite3         |223920      |223708      |-0.0947%     |
|7zip-benchmark  |405364      |402624      |-0.6759%     |
|bullet          |139820      |139500      |-0.2289%     |
|consumer-typeset|295684      |290196      |-1.8560%     |
|pairlocalalign  |72236       |72092       |-0.1993%     |
|tramp3d-v4      |189572      |189292      |-0.1477%     |

This is part of an enhanced version of machine outliner -- see
[RFC](https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-1-fulllto-part-2-thinlto-nolto-to-come/78732).
Parameter "Version" is confusing in deserializeV012 and deserializeV3
because we also have member variable "Version".  Fortunately,
parameter "Version" and member variable "Version" always have the same
value because IndexedMemProfReader::deserialize initializes the member
variable and passes it to deserializeV012 and deserializeV3.

This patch removes the parameter.
This patch integrates CallStackRadixTreeBuilder into the V3 format,
reducing the profile size to about 27% of the V2 profile size.

- Serialization: writeMemProfCallStackArray just needs to write out
  the radix tree array prepared by CallStackRadixTreeBuilder.
  Mappings from CallStackIds to LinearCallStackIds are moved by new
  function CallStackRadixTreeBuilder::takeCallStackPos.

- Deserialization: Deserializing a call stack is the same as
  deserializing an array encoded in the obvious manner -- the length
  followed by the payload, except that we need to follow a pointer to
  the parent to take advantage of common prefixes once in a while.
  This patch teaches LinearCallStackIdConverter to how to handle those
  pointers.
The "Emulated" sub-directories under "ArmSVE" and
"ArmSME" have been removed. Associated tests
have been moved up a directory and now include
the "REQUIRES" constraint for the arm-emulator.
Allow KnownBits to represent "always poison" values via conflict.

close: llvm#94436
…#94646)

These tests pass on 64-bit. They were fixed by 5fdd094 on 32-bit.
So XFAIL only for 32-bit before clang 19.
If we are extracting the even lanes and the odd lanes and adding them, we can
use an addp instruction.
llvm#94550)

For regex patterns that produce zero-length matches, there is one
(imaginary) match in-between every character in the sequence being
searched (as well as before the first character and after the last
character). It's easiest to demonstrate using replacement:
`std::regex_replace("abc"s, "!", "")` should produce `!a!b!c!`, where
each exclamation mark makes a zero-length match visible.

Currently our implementation doesn't correctly set the prefix of each
zero-length match, "swallowing" the characters separating the imaginary
matches -- e.g. when going through zero-length matches within `abc`, the
corresponding prefixes should be `{'', 'a', 'b', 'c'}`, but before this
patch they will all be empty (`{'', '', '', ''}`). This happens in the
implementation of `regex_iterator::operator++`. Note that the Standard
spells out quite explicitly that the prefix might need to be adjusted
when dealing with zero-length matches in
[`re.regiter.incr`](http://eel.is/c++draft/re.regiter.incr):
> In all cases in which the call to `regex_search` returns `true`,
`match.prefix().first` shall be equal to the previous value of
`match[0].second`... It is unspecified how the implementation makes
these adjustments.

[Reproduction example](https://godbolt.org/z/8ve6G3dav)
```cpp
#include <iostream>
#include <regex>
#include <string>

int main() {
  std::string str = "abc";
  std::regex empty_matching_pattern("");

  { // The underlying problem is that `regex_iterator::operator++` doesn't update
    // the prefix correctly.
    std::sregex_iterator i(str.begin(), str.end(), empty_matching_pattern), e;
    std::cout << "\"";
    for (; i != e; ++i) {
      const std::ssub_match& prefix = i->prefix();
      std::cout << prefix.str();
    }
    std::cout << "\"\n";
    // Before the patch: ""
    // After the patch: "abc"
  }

  { // `regex_replace` makes the problem very visible.
    std::string replaced = std::regex_replace(str, empty_matching_pattern, "!");
    std::cout << "\"" << replaced << "\"\n";
    // Before the patch: "!!!!"
    // After the patch: "!a!b!c!"
  }
}
```

Fixes llvm#64451

rdar://119912002
Re-apply llvm#87550 with fixes.

Details:
Some tests in fuchsia failed because of the newly added assertion.
This was because `GetExceptionBreakpoint()` could be called before
`g_dap.debugger` was initted.

The fix here is to just lazily populate the list in
GetExceptionBreakpoint() rather than assuming it's already been initted.
(There is some nuisance here because we can't simply just populate it in
DAP::DAP(), which is a global ctor and is called before
`SBDebugger::Initialize()` is called. )
This patch reverts 9b832b7 (llvm#87111):
- [libc++] Deprecated `shared_ptr` Atomic Access APIs as per P0718R2
- [libc++] Implemented P2869R3: Remove Deprecated `shared_ptr` Atomic Access APIs from C++26

As explained in [1], the suggested replacement in P2869R3 is `__cpp_lib_atomic_shared_ptr`,
which libc++ does not yet implement. Let's not deprecate the old way of doing things before
the new way of doing things exists.

[1]: llvm#87111 (comment)
…rep expression

(and remove an unused argument)
Add SHAPE runtime API (will be used for assumed-rank, lowering is
generating other cases inline).

I tried to make it in a way were there is no dynamic allocation in the
runtime/deallocation expected to be inserted by inline code for arrays
that we know are small (lowering will just always stack allocate a rank
15 array to avoid dynamic stack allocation or heap allocation).
)

Summary:
AMDGPU supports a `target-id` feature which is used to qualify targets
with different incompatible features. These are both rules and target
features. Currently, we pass `-target-cpu` twice when offloading to
OpenMP, and do not pass the target-id features at all. The effect was
that passing something like `--offload-arch=gfx90a:xnack+` would show up
as `-target-cpu=gfx90a:xnack+ -target-cpu=gfx90a`. Thus ignoring the
xnack completely and passing it twice. This patch fixes that to pass it
once and then separate it like how HIP does.
…m#94592)

As discussed in llvm#94443, this PR changes the wording to be more correct.
Otherwise, older copies of LLD may not understand the latest bitcode
versions (for example, if we increase
`ModuleSummaryIndex::BitCodeSummaryVersion`)

Related to
llvm#90692 (comment)
@srcarroll srcarroll closed this Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.