Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release] v0.17.0 Release Candidate Notes #17178

Closed
ysh329 opened this issue Jul 20, 2024 · 0 comments
Closed

[Release] v0.17.0 Release Candidate Notes #17178

ysh329 opened this issue Jul 20, 2024 · 0 comments

Comments

@ysh329
Copy link
Contributor

ysh329 commented Jul 20, 2024

Introduction

The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements!

The main tags are below (bold text is with lots of progress):

  • Community, RFCs
  • AOT, Hexagon, OpenCL & CLML, Web, Metal
  • Relax, Dlight, Disco
  • TIR, TVMScript
  • Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.17.dev0...v0.17.0.rc0.

Community

  • #17018 - New committer: Balint Cristian

RFCs

This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors.

  • #108 - [RFC] Add NNEF frontend

AOT

  • #17077 - Correctly calculate workspace for vector types

Adreno

  • #16927 - [SCRIPT]Fix in build config for adreno

BYOC

  • #16895 - Add layout check and update shape check for cublas FP8 BYOC

BugFix

  • #17138 - [Fix][TIR] Fix outdated call to create extern buffer in make_extern
  • #17132 - Restrict CopyOnWrite to _type_final
  • #17096 - Update FAttrsGetter to return Map<String, ObjectRef>
  • #17078 - [NCCL] Release NCCL thread_local resources in destructor
  • #17044 - [Support] Fix copy constructor for support::OrderedSet
  • #17000 - [MSC] split name_string with index by colon from the right
  • #16923 - [Fix][Dlight] Fix GeneralReduction for log-sum-exp
  • #16924 - [Fix] Fix SSA conversion for SizeVar retention
  • #16903 - CudaDeviceAPI::GetAttr may check kExist when GPUs absent
  • #16901 - rocm shared memory issue on MI250

CI

  • #17055 - [SME][Test] Add additional conv2d tests for asymmetric parameters
  • #17007 - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for arm_cpu
  • #16930 - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter
  • #16948 - Update image tag to 20240428-060115-0b09ed018
  • #16931 - Use LLVM17 for tests on ci_cpu
  • #16942 - Enable Conda setup v3
  • #16939 - Upgrade CUDA to 12.4

CRT

  • #17097 - [Bugfix]Return error code on error from ModuleGetFunction

Disco

  • #17035 - [QoL] Implement broadcast/scatter methods for Session
  • #16992 - [Bugfix]Handle NDArray larger than OS buffer for pipe
  • #16978 - Implement num_workers property for disco.Session
  • #16989 - Treat hangup of disco worker process as kShutdown
  • #16993 - Allow allocation that only exists on worker0
  • #16979 - Expose disco.Session.shutdown through the python API
  • #16919 - Improve error message for CallPacked

Dlight

  • #17082 - Use 16x32 spatial x reduction thread extents in GEMV scheduling
  • #17052 - Skip GEMV rules when more than one vector
  • #17026 - Perf improvement for low_batch_gemv on Metal
  • #17016 - Update Adreno GEMV Rules
  • #16972 - [GPU] Enhance opencl thread limit for schedules
  • #16973 - [GPU] Improved gemv outer fallback schedule
  • #16958 - Check for target in function attributes
  • #16894 - Enhance vectorization for gpu matmul
  • #16884 - Add check for matmul dtype and fix reduction rule

Docs

  • #17146 - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley."
  • #17015 - [DOC] Update Model Links to Include Commit

Frontend

Hexagon

  • #17123 - Add support for v75

LLVM

  • #17046 - [Arith][SVE] Add rewrite rules for indices split by scalable expressions
  • #16966 - [SVE] Add support for representing and creating buffer-level predicates
  • #17001 - [SVE] Use only powers of two as possible vscale values
  • #16962 - [SVE] Add codegen support for vscale_range() function attribute
  • #16968 - Stringref API deprecation fixes
  • #16965 - [SVE] Add get_active_lane_mask builtin
  • #16899 - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for arm_cpu
  • #16893 - [SVE] Check for SVE target in VectorizeLoop
  • #16862 - [SVE] Support splitting by vscale in tir::split and te::split

MetaSchedule

  • #17012 - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul
  • #17066 - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered

Metal

  • #17059 - Enable Debug Label
  • #17025 - Support metal device profiling

OpenCL & CLML

  • #16933 - [CLML] Fix in clml pattern check condition
  • #16929 - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy

ROCm

  • #17141 - [Backend]Fix error when building TVM with LLVM 19

Relax

  • #17139 - Fix cublas dispatch for corner cases
  • #17127 - [KVCache] Support fork in sliding window sink part
  • #17115 - Support input_axis_separator to allow 2D to 1D conversion
  • #17119 - [Bugfix]Set purity=false for LazySetOutput
  • #17118 - [VM] Improved error messages for mismatched parameter count
  • #17110 - Alloc BYOC workspace with R.builtin.alloc_tensor
  • #17089 - [ONNX] Add support for HardSigmoid
  • #17100 - [KVCache] Unlimited depth blocks
  • #17075 - [Transform] Modify FuseTIR pass to propagate buffer attributes
  • #17088 - [ONNX] Add support for HardSwish
  • #17085 - [PyTorch] Add support for torch.nn.Hardsigmoid
  • #17083 - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip
  • #17086 - Ignore dynamic parameters in RewriteDataflowReshape
  • #17084 - [PyTorch] Add support for torch.nn.Hardswish
  • #17074 - [KVCache][Test] Fix TIR attn kernels for uncommon group size
  • #17067 - Add missing white spaces in error messages
  • #17061 - [Frontend][Onnx] Cast Op special handling for ShapeExpr input
  • #17033 - [Bugfix] Apply FuseOps to nested DataflowBlock
  • #17032 - [Bugfix] Annotate ComputePrimValue output as host function
  • #17034 - [Bugfix] Bind symbolic variables in R.match_cast
  • #16960 - [UnitTest] Validate IRModule with multiple targets
  • #16995 - [KVCache] Support KVCache decode from forked sequence and pop more tokens
  • #16959 - [Transform] Handle identical PrimFunc with distinct VDevice
  • #16589 - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder
  • #16988 - [KVCache] Fix the aux data syncing order of paged KV cache
  • #16922 - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph
  • #16982 - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas
  • #16955 - Implement relax.op.view
  • #16971 - Support nested ModuleList in nn.Module
  • #16826 - Express dynamic arguments of strided_slice as arguments
  • #16476 - [Unity][Cutlass] Fix C source generation of dense operation
  • #16940 - Allow PrimValue as index in relax.op.take
  • #16934 - [TIR] Introduce new cumsum op for gpu
  • #16859 - [QoL]Use SeqExpr in IR types when SeqExpr is required
  • #16904 - Prevent to generate duplicate func in dispatch_sort_scan
  • #16905 - [Bugfix]Raise exception for OOM allocation
  • #16827 - Handle binary operations between Tensor and PrimValue
  • #16902 - Allow specifying entry_funcs for BYOC
  • #16860 - [QoL]Infer StructInfo for relax::Tuple on construction
  • #16861 - [QoL]Return well-formed IR from relax::Function::CreateEmpty
  • #16886 - [Frontend] Fix sort, argsort and topk in nn module
  • #16883 - Stabilize relax pass mutation order

Relay

  • #16983 - [BugFix]skip leaf args when matching 'path' part for dominator pattern
  • #16996 - fixed to make TupleGetItem inherits the previous span

Runtime

  • #17057 - Stateless interface of PagedKVCache leaf node commit
  • #17049 - Support PagedKVCache with tree attention
  • #17045 - Fix PagedKVCache for PopN and enhance tests
  • #16998 - Compatibility with dmlc::Stream API changes
  • #17037 - [ROCm] Enable ROCm host memory support
  • #17036 - Use preferred host memory (pinned memory) in KV cache
  • #16994 - Allow query of available device memory through DeviceAPI
  • #16997 - [Disco] Restore checks for hangup of disco pipe
  • #16938 - Allow offset to be specified in NDArray::CreateView
  • #16890 - [VULKAN] Support total_global_memory
  • #16880 - Implemented Datatype.itemsize()

TIR

  • #17134 - [Schedule] Remove @type_check for set_axis_separator
  • #17112 - [DLight] Enable SimdGroup op for Metal
  • #17098 - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments
  • #17039 - Fix Bug in VectorizeLoop
  • #17030 - Fix Shuffle rewrite
  • #16947 - Support narrow dtype for let binding
  • #16952 - Enhance CLZ intrinsic support
  • #16945 - [Compute-at] Make compute-ated block simple when the predicate could be merged
  • #16879 - Make T.reinterpret nop when dtype is the same

TOPI

  • #17091 - Add dense schedule for fp16 and fp32 using gemm
  • #17048 - [SME]Add conv2d NHWC SME fp16->fp32 schedule
  • #17040 - Fix SME conv2d schedule import and intrin argument
  • #17003 - [SME]Add conv2d NHWC SME fp32 schedule
  • #16977 - Remove blockIdx.z in topi sort
  • #16951 - Revert unification of conv2d NHWC hybrid scheduling for arm_cpu targets

TVMScript

  • #17107 - Better Type Annotation for TIR OP
  • #16967 - Fix error reporting inside Macro func
  • #16916 - Support T.launch_thread with i64 dtype
  • #16876 - Optionally use ruff format instead of black
  • #16877 - [Bug] Add test case for missing symbolic bounds

cuda & cutlass & tensorrt

  • #16980 - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state

web

  • #17031 - Fix string to uint8 array for special characters
  • #17028 - Add dtype and offset for CreateView in runtime
  • #16910 - Support string[] in setPackedFunc() and exceptionally long arrays

Misc

  • #17135 - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply
  • #17125 - [Utils] Define line-length for "ruff format"
  • #17152 - GraphExecutor: Fix wild pointer assign when input and output are reshape
  • #17150 - [WebGPU] Fall back to 256MB for maxBufferSize if needed
  • #17128 - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate
  • #16976 - [WebGPU] Implement tir.dp4a with WGSL built-in function dot4I8Packed
  • #17124 - [WebGPU] Add tir.dp4a
  • #17113 - [CudaGraph] Handle exceptions thrown while capturing cuda graph
  • #17094 - [Utility][Container] Support non-nullable types in Array::Map
  • #17101 - [RPC] Raise error if server process terminated
  • #17092 - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible
  • #17054 - [SME] Utilize predication in fp32 matmul and conv2d schedules
  • #17079 - [CMake] Show NVCC include directories in compile_commands.json
  • #17076 - [SME] Extract gemm block correctly when fused with bias
  • #17071 - [WebGPU] Translate int8x4 into u32
  • #17065 - [FP8][Codegen] Add make_fp8 vector constructors
  • #17064 - Add docs of v0.15.0 and v0.16.0
  • #16985 - [CODEGEN] Vector-Codegen support for llvm-pure-intrin
  • #17058 - Introduce outer reduction for metal
  • #17051 - Use adapter.info when available instead of requestAdapterInfo
  • #16981 - [SME] Add scalable fp16->fp32 dense schedule
  • #17029 - [Contrib] Implement NDArray cache update
  • #17027 - [picojson] Let objects be ordered when serializing
  • #17021 - [WebGPU] Update error messages to be more user-friendly
  • #17010 - Support multinomial_from_uniform dispatch
  • #16999 - [USMP] add missing const specifier for global_const_workspace
  • #17005 - [WebGPU] Handle device OOM in createBuffer
  • #16921 - [SME] Introduce scalable fp32 dense schedule
  • #16957 - chore: remove repetitive words
  • #16909 - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types
  • #16987 - [JVM] Automatic Compatibility of JVM AttachCurrentThread
  • #16974 - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading
  • #16896 - [CUBLAS] Enable offloading of R.matmul + R.dequantize
  • #16956 - Add script for testing release package
  • #16908 - Overriding the StructuralEqual() for easy usage
  • #16932 - Enable gemv schedule for adreno
  • #16935 - [3rdparty] Bump FlashInfer for sampling functions
  • #16937 - [Thrust] Increase static workspace size
  • #16915 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2
  • #16741 - Restore "pytest.mark.gpu" for RELAX tests
  • #16914 - [CMAKE] Make LOG_BEFORE_THROW explicit
  • #16913 - Enhance Release Note Script and Remove Useless File
  • #16907 - [Upd] Fixed lld search in rocm
  • #16900 - [CMAKE] Misc improvment of Util
  • #16897 - [Target] Don't register AArch64 target tags without LLVM compiler support
  • #16892 - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul
  • #16888 - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC
  • #16887 - [Contrib] Enable fp16 for thrust sort
  • #16881 - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch
@ysh329 ysh329 added type: bug needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it and removed type: bug needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Jul 20, 2024
@ysh329 ysh329 closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant