Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HS_FLAG_UTF8 flag doesn't seem to work as expected on aarch64 platforms #135

Closed
iavjkae opened this issue Oct 31, 2022 · 8 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@iavjkae
Copy link

iavjkae commented Oct 31, 2022

Here are my test result on aarch64

pattern: 星{2}
scan text: 星星点灯
flags: HS_FLAG_UTF8 | HS_FLAG_SINGLEMATCH
expect: matched
actual: unmatched

on linux x86_64 platform this works fine


env

  • vectorscan 5.4.8
  • gcc 9.5.0
  • ragel 6.10
  • boost 1.67.0
  • Linux hostname-PC 4.19.0-arm64-desktop #5214 SMP Tue Sep 6 16:40:43 CST 2022 aarch64 GNU/Linux

test code

  std::string pattern = u8"星{2}";
  hs_compile_error_t *error = nullptr;
  hs_database_t *db         = nullptr;
  int flags = HS_FLAG_UTF8 | HS_FLAG_SINGLEMATCH;

  hs_error_t ret = hs_compile(pattern.c_str(), flags, HS_MODE_BLOCK,
                              nullptr,&db,&error);
  if (ret != HS_SUCCESS) {
      std::cout << "hs_compile error msg: " << error->message << ". expression: " << error->expression << std::endl;
      hs_free_compile_error(error);
      error = nullptr;
  }
  ASSERT_EQ(ret, HS_SUCCESS);

  hs_scratch_t *scratch = nullptr;
  ret = hs_alloc_scratch(db, &scratch);
  ASSERT_EQ(ret, HS_SUCCESS);

  std::string scanText = u8"星星点灯";
  static int matchCount = 0;
  ret = hs_scan(db, scanText.c_str(), scanText.size(), 0, scratch,
                [](unsigned int id, unsigned long long from,
                   unsigned long long to, unsigned int flags,
                   void *context) -> int {
        printMatchedInfo(id, from, to, flags, context);
        matchCount++;
        return 0;
      },
      nullptr);

  EXPECT_EQ(ret, HS_SUCCESS);
  EXPECT_EQ(matchCount,1);

  // cleanup
  hs_free_database(db);
  hs_free_scratch(scratch);
  matchCount = 0;

cmake file

cmake_minimum_required(VERSION 3.16)
cmake_policy(SET CMP0074 NEW)
cmake_policy(SET CMP0115 NEW)
if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.24.0")
    cmake_policy(SET CMP0135 NEW)
endif()
project(utf8-test)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
include(FetchContent)

FetchContent_Declare(vectorscan URL https://github.com/VectorCamp/vectorscan/archive/refs/tags/vectorscan/5.4.8.tar.gz)
FetchContent_MakeAvailable(vectorscan)
FetchContent_GetProperties(vectorscan)

FetchContent_Declare(googletest URL https://github.com/google/googletest/archive/refs/tags/release-1.12.1.tar.gz)
FetchContent_MakeAvailable(googletest)
find_package(Boost REQUIRED)

file(GLOB TEST_SRCS tests/*.cc)
add_executable(utf8-test ${TEST_SRCS})
target_include_directories(utf8-test PRIVATE ${vectorscan_SOURCE_DIR}/src)
target_link_libraries(utf8-test PUBLIC hs gtest gmock)

steps to reproduce

# on project dir
mkdir build
cd build
cmake ..
cmake --build .
./utf8-test

cmake output

-- The C compiler identification is GNU 9.5.0
-- The CXX compiler identification is GNU 9.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/local/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/local/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Deprecation Warning at build/_deps/vectorscan-src/CMakeLists.txt:1 (cmake_minimum_required):
Compatibility with CMake < 2.8.12 will be removed from a future version of
CMake.

Update the VERSION argument value or use a ... suffix to tell
CMake that the project does not need compatibility with older versions.

-- Performing Test ARCH_X86_64
-- Performing Test ARCH_X86_64 - Failed
-- Performing Test ARCH_IA32
-- Performing Test ARCH_IA32 - Failed
-- Performing Test ARCH_AARCH64
-- Performing Test ARCH_AARCH64 - Success
-- Performing Test ARCH_ARM32
-- Performing Test ARCH_ARM32 - Failed
-- Performing Test ARCH_PPC64EL
-- Performing Test ARCH_PPC64EL - Failed
-- Default build type 'Release with debug info'
-- using release build
-- Boost version: 1.67.0
-- Found Python: /usr/bin/python3.7 (found version "3.7.3") found components: Interpreter
-- Build date: 2022-10-31
-- Building static libraries
-- gcc version 9.5.0
CMake Warning at build/_deps/vectorscan-src/CMakeLists.txt:184 (message):
Something went wrong determining gcc tune: -mtune=armv8-a not valid,
falling back to -mtune=native

-- ARCH_C_FLAGS :
-- ARCH_CXX_FLAGS :
-- g++ version 9.5.0
-- Looking for include file unistd.h
-- Looking for include file unistd.h - found
-- Looking for C++ include arm_neon.h
-- Looking for C++ include arm_neon.h - found
-- Looking for posix_memalign
-- Looking for posix_memalign - found
-- Looking for _aligned_malloc
-- Looking for _aligned_malloc - not found
-- Performing Test HAS_C_HIDDEN
-- Performing Test HAS_C_HIDDEN - Success
-- Performing Test HAS_CXX_HIDDEN
-- Performing Test HAS_CXX_HIDDEN - Success
-- Looking for _LIBCPP_VERSION
-- Looking for _LIBCPP_VERSION - not found
-- generator is Unix Makefiles
-- Performing Test HAS_C_ATTR_IFUNC
-- Performing Test HAS_C_ATTR_IFUNC - Success
-- Performing Test HAVE_NEON
-- Performing Test HAVE_NEON - Success
-- Performing Test HAVE_CC_BUILTIN_ASSUME_ALIGNED
-- Performing Test HAVE_CC_BUILTIN_ASSUME_ALIGNED - Success
-- Performing Test HAVE_CXX_BUILTIN_ASSUME_ALIGNED
-- Performing Test HAVE_CXX_BUILTIN_ASSUME_ALIGNED - Success
-- Performing Test HAVE__BUILTIN_CONSTANT_P
-- Performing Test HAVE__BUILTIN_CONSTANT_P - Success
-- Performing Test C_FLAG_Wvla
-- Performing Test C_FLAG_Wvla - Success
-- Performing Test C_FLAG_Wpointer_arith
-- Performing Test C_FLAG_Wpointer_arith - Success
-- Performing Test C_FLAG_Wstrict_prototypes
-- Performing Test C_FLAG_Wstrict_prototypes - Success
-- Performing Test C_FLAG_Wmissing_prototypes
-- Performing Test C_FLAG_Wmissing_prototypes - Success
-- Performing Test CXX_FLAG_Wvla
-- Performing Test CXX_FLAG_Wvla - Success
-- Performing Test CXX_FLAG_Wpointer_arith
-- Performing Test CXX_FLAG_Wpointer_arith - Success
-- Performing Test CC_SELF_ASSIGN
-- Performing Test CC_SELF_ASSIGN - Failed
-- Performing Test CXX_SELF_ASSIGN
-- Performing Test CXX_SELF_ASSIGN - Failed
-- Performing Test CC_PAREN_EQUALITY
-- Performing Test CC_PAREN_EQUALITY - Failed
-- Performing Test CXX_UNUSED_CONST_VAR
-- Performing Test CXX_UNUSED_CONST_VAR - Success
-- Performing Test CXX_IGNORED_ATTR
-- Performing Test CXX_IGNORED_ATTR - Success
-- Performing Test CXX_REDUNDANT_MOVE
-- Performing Test CXX_REDUNDANT_MOVE - Success
-- Performing Test CXX_WEAK_VTABLES
-- Performing Test CXX_WEAK_VTABLES - Failed
-- Performing Test CXX_MISSING_DECLARATIONS
-- Performing Test CXX_MISSING_DECLARATIONS - Success
-- Performing Test CXX_UNUSED_LOCAL_TYPEDEFS
-- Performing Test CXX_UNUSED_LOCAL_TYPEDEFS - Success
-- Performing Test CXX_WUNUSED_VARIABLE
-- Performing Test CXX_WUNUSED_VARIABLE - Success
-- Performing Test CC_STRINGOP_OVERFLOW
-- Performing Test CC_STRINGOP_OVERFLOW - Success
-- Building for current host CPU: -march=armv8-a -mtune=native
-- Looking for mmap
-- Looking for mmap - found
-- Doxygen not found, unable to generate API reference
-- Sphinx not found, unable to generate developer reference
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29")
-- Checking for module 'libpcre>=8.41'
-- No package 'libpcre' found
-- PCRE version 8.41 or above not found
-- PCRE 8.41 or above not found
-- Could not find libpcap - some examples will not be built
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found Boost: /usr/include (found version "1.67.0")
-- Configuring done
-- Generating done
-- Build files have been written to: /home/skywo/workzone/vectorscan-utf8-test/build

result snapshot

image

@markos markos self-assigned this Nov 1, 2022
@markos markos added the bug Something isn't working label Nov 1, 2022
@markos
Copy link

markos commented Nov 1, 2022

I confirm this. I was able to replicate it on my arm system and even on latest develop branch.

@usmannisar
Copy link

Can you check if compiling it with the -fsigned-char flag works? The C standard doesn't specify the signedness of char. On x86 char is signed by default while on Arm it is unsigned by default. Compiling with -fsigned-char will force char to be signed on Arm

@iavjkae
Copy link
Author

iavjkae commented Nov 2, 2022

Can you check if compiling it with the -fsigned-char flag works? The C standard doesn't specify the signedness of char. On x86 char is signed by default while on Arm it is unsigned by default. Compiling with -fsigned-char will force char to be signed on Arm

thank you for your suggestion.
I recompiled with -fsigned-char flag and now it works. thanks.
@markos Should it be compiled with the -fsigned-char flag by default on aarch64 platforms?

@markos
Copy link

markos commented Nov 2, 2022

It's probably a revisit of #98. Reopening that and I will try to do a proper fix and replace char with explicit int8_t/uint8_t where appropriate.

@dmbaggett
Copy link

dmbaggett commented Jan 23, 2023

I have a bit more to report on this, in case it helps anyone else. On Graviton3 (AWS arm64 CPU with SVE extensions) running Ubuntu 22, building vectorscan from the 5.4.7 tag with GCC 11 -O2 or higher produces a library that crashes on startup when used with rspamd; it seems like GCC just generates bad code for this arch for some reason. (I did not attempt to debug what exactly it was doing.)

Building with clang-14 works much better and supports arbitrary optimization properly (I used -O3 -march=armv8.4-a+crc+crypto+sve) but this surfaces the issue mentioned above about signedness of char. I was able to address this by patching the Ragel inputs (.rl files) to add

alphtype unsigned char;

after each machine declaration. I also modifed ragel.cmake to call ragel with -G0 to force Ragel to emit a goto-based tokenizer instead of a table-based one, to get around char signedness mismatch compiler warnings.

Despite all that, I was still only able to get a working Vectorscan on Graviton3 with 5.4.7 code. When I build from 5.4.8 or current head code, rspamd just aborts at startup in hs_compile_multi. I didn't try to debug this either.

Finally, I noticed that HS_PATCH is still defined as 0 even though it seems like it should be 7 (for version 5.4.7).

@markos
Copy link

markos commented Mar 21, 2023

@dmbaggett Apologies for the delay. I can confirm that adding this line in Parser.rl this fixes the test on aarch64. I will add them to all the .rl files as you suggested. Regarding the failure for rspamd with 5.4.8, this is probably: #140. A new release will be done shortly with all the fixes.

@markos
Copy link

markos commented Mar 21, 2023

yeah, a small update, this breaks the x86 tests, sigh, I'll have to find a way to solve it for both platforms.

markos added a commit that referenced this issue Mar 23, 2023
…ar-on-arm

Set Ragel.rl char type to unsigned, #135
@markos
Copy link

markos commented Mar 23, 2023

We can confirm this is fixed in #141 . A new release is going to happen very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants