Skip to content

SX-Aurora/veda-tensorflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VEDA TensorFlow

VEDA TensorFlow is a library to add device support for the NEC SX-Aurora TSUBASA into TensorFlow using the Pluggable Device API.

Github PyPI License Python Versions Maintenance Maintenance

Release Notes

VersionComment
v7
  • Added TF v2.13.* support
  • Added TF v2.12.* support
  • Fixed <v2.10.* support
v6
  • Added TF v2.11.* support
  • Added TF v2.10.* support
  • Upgraded to VEDA CPP API
v5
  • Added TF v2.9.* support
v4
  • Added BroadcastTo operation
  • Increased host_memory_allocate alignment to be 64, as lower values keep failing in isAligned()
v3
  • Bugfixes for loss functions
  • Added missing optimizers: SGD, Adadelta, Adagrad, Adam, and Adamax
  • Fixed possible segfault in PluggableDevice host_memory_allocate
v2
  • Minor changes to enable TF v2.7.1 and v2.8.0
  • Fixed vedaInit error checking to ignore if already initialized
v1 Initial Release

F.A.Q.

I get the error message: "Internal: platform is already registered with name: "NEC_SX_AURORA"

This error is caused by the combination of RH-Python38 package and using a VirtualEnv. Due to improper checking for symlinks in TensorFlow the device support library gets loaded and initialized twice causing this error message.

You can use the following workaround as long as the bug is not resolved in TensorFlow.

# BEGIN BUGFIX
import sys
import os

sys.path = list(set(os.path.realpath(p) for p in sys.path))

import site
getsitepackages = site.getsitepackages
def getsitepackages_(prefixes=None):
    return list(filter(lambda x: 'lib64' not in x, getsitepackages(prefixes)))
site.getsitepackages = getsitepackages_
# END BUGFIX

import tensorflow
...

I get the error message "tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid Device id '1' but visible device count is 1"

This is a known problem within TF due to TF throws: "'visible_device_list' listed an invalid Device id" when using non-GPU PluggableDevices when using CUDA and VE devices at the same time. The VE devices get added to list of GPUs, ultimately creating invalid devices indices.

Either you need to manually patch your TF installation (see the TF issue), or use VEDA_VISIBLE_DEVICES=100 or CUDA_VISIBLE_DEVICES= to disable either the CUDA or VE devices.