Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Accept list for __setitem__ #156

Closed
christian2022 opened this issue Dec 9, 2021 · 10 comments
Closed

Feature: Accept list for __setitem__ #156

christian2022 opened this issue Dec 9, 2021 · 10 comments

Comments

@christian2022
Copy link

Set multiple indices provided as a list, e.g.

bitarray[[1,2,5]] = True

Today only bitarray[1:5] = True works.

@ilanschnell
Copy link
Owner

Thanks for your suggestion. I guess this means the __setitem__ method should be able to take any iterable whose elements can be used as indicies. I don't think such a feature would find wide enough use for introducing the additional complexity. Just use:

for i in multiple_indices:
    a[i] = True

@christian2022
Copy link
Author

I'd like to see it compatible to numpy's advanced indexing. Not sure if they support iterable in general or only tuple and list.

@ilanschnell
Copy link
Owner

Ah, thanks. I'm not sure how useful advanced indexing features would be for a one dimensional array. As I wish to keep bitarray low-level and simple, maybe it would make sense to create a (potentially separate) library which allows creating N-dimensional bitarrays (with all the numpy features, and relying on the existing one dimensional bitarray).

@christian2022
Copy link
Author

I agree, n dimensional would be an additional beast, but just setting a list of indices for one-dimensional would help a lot. Because looping in python is slow.

@ilanschnell
Copy link
Owner

I don't want to add more functionality to the core bitarray libarray. Also Python's bytearray object does not allow list (or iterables) as an argument on the __setitem__ method. Therefore, I'm thinking about adding a utility function bitarray.util.setitems() which takes an iterable. The signature and docstring would look like this:

setitems(bitarray, iterable, value)

Iterate over indices and sets corresponding elements of bitarray to value.

So your initial example bitarray[[1,2,5]] = True could be written as:

from bitarray.util import setitems

setitems(a, [1, 2, 5], True)

What do you think?

@christian2022
Copy link
Author

I personally love the elegance of numpy's syntax - but the decision is up to you. I'm fine with your proposal.

What I wonder more is if some efficiency can be gained above a simple loop in C. My indices will be in random order (hashes for bloom filter) and this might result in many pages swaps.

@ilanschnell
Copy link
Owner

I would expect the number of indices in the Bloom filter to be rather small, not more than maybe 10. One could sort those indices first, but I'm not sure if it's worth it.

For a Bloom filter, a function which checks if all of the hash indices are True should also be fast. In https://github.com/ilanschnell/bitarray/blob/master/examples/bloom.py this done by:

all(self.array[i] for i in self._hashes(key))

Having a getitems() function (which returns a bitarray of values corresponding to indices) would not be efficient, because as soon as an element is False, one cat terminate the loop. So having a C function which does the equivalent would probably help:

allitems(bitarray, iterable_of_indices)

Iterate over indices and return False as soon as corresponding element in bitarray is 0.

@christian2022
Copy link
Author

The number of hash values per item is low, but if you add thousands of items in a batch to a Bloom filter, you will have many assignents and then swapping of memory pages might become a limiting factor. But no need to provide this in setitems, as the caller can also take care for this.
Not directly related, but why processing a sorted array is faster might be an interesting read.

For getitems my use case is to identify which items from a batch of items (each with multiple hashes per item) have (probably) already been added. So getting the bitarray

[all(self.array[i] for i in self._hashes(item)) for item in batch]

which returns a mask for each item in the batch.

Related to getitems: As the number of hashes per item is low (at least for my Bloom filter use case), I fear more performance is lost by checking for early breaking instead of running branchless till the end.

allitems: A better signature for my use case would be allitems(bitarray, iterable_of_iterable_of_indices.
And even better for my use case: a function that returns a mask, which items have been added:

result = bitarray(len(batch))
for i in range(len(batch)):
    for j in self._hashes(batch[i]):
        result[i] |= ~self.array[j]
        self.array[j] = True
return result

@vsraptor
Copy link

vsraptor commented Apr 2, 2022

here is what I do for 2D bitarray (.mx is 1D) using Cython and it is pretty fast... you can redo it for 1D will be much simpler .. cause you use only one slice.


	def __getitem__(self, slices not None):
		cdef IXTYPE ix, nix
		cdef CTYPE r,c, rng
		cdef int i = 0, j = 0

		assert slices[0] is not None
		assert slices[1] is not None

		# exact [r,c]
		# if type0 is int and type1 is int :
		if isinstance(slices[0], (int,np.integer)) and isinstance(slices[1], (int,np.integer)) :
			ix = slices[0] * self.ncols + slices[1]
			return getbit(self.mx,ix)
			# return self.mx[ix]

		if   isinstance(slices[0], slice) : rixs = range(*slices[0].indices(self.nrows)) 
		elif isinstance(slices[0], (list,np.ndarray)) : rixs = slices[0]
		elif isinstance(slices[0], (int,np.integer)) : rixs = [slices[0]]

		if   isinstance(slices[1], slice) : cixs = range(*slices[1].indices(self.ncols))
		elif isinstance(slices[1], (list,np.ndarray)) : cixs = slices[1]
		elif isinstance(slices[1], (int,np.integer)) : cixs = [slices[1]]

		#cant use clen for range()
		new = BinaryMatrix((len(rixs), len(cixs)))

		# print(rixs, cixs)
		for i,r in enumerate(rixs) :
			for j,c in enumerate(cixs) :
				ix = r * self.ncols + c
				nix = i * new.ncols + j
				# print(f'{ix},{nix},i:{i},j:{j},{r},{c}')
				# new.mx[nix] = self.mx[ix]
				setbit( new.mx, nix, getbit(self.mx, ix) )

		return new		


	def __setitem__(self, slices not None, value not None):
		cdef IXTYPE ix
		cdef CTYPE r,c
		cdef int i = 0, j = 0

		assert slices[0] is not None
		assert slices[1] is not None

		if isinstance(slices[0], (int,np.integer)) and isinstance(slices[1], (int,np.integer)) :
			ix = slices[0] * self.ncols + slices[1]
			setbit(self.mx,ix, value)
			# self.mx[ix] = 1
			return

		if   isinstance(slices[0], slice) : rixs = range(*slices[0].indices(self.nrows)) 
		elif isinstance(slices[0], (list,np.ndarray)) : rixs = slices[0]
		elif isinstance(slices[0], (int,np.integer)) : rixs = [slices[0]]

		if   isinstance(slices[1], slice) : cixs = range(*slices[1].indices(self.ncols))
		elif isinstance(slices[1], (list,np.ndarray)) : cixs = slices[1]
		elif isinstance(slices[1], (int,np.integer)) : cixs = [slices[1]]

		vtype = type(value)

		if vtype is int :
			for r in rixs :
				for c in cixs :
					ix = r * self.ncols + c
					setbit( self.mx, ix, value )

		elif vtype is list or vtype is np.ndarray :
			ncols = len(value[0])
			for i,r in enumerate(rixs) :
				for j,c in enumerate(cixs) :
					ix = r * self.ncols + c
					# print(f'{ix},i:{i},j:{j},{r},{c}')
					setbit( self.mx, ix, value[i][j] )

@ilanschnell
Copy link
Owner

#204 has been merged, such that the original feature request

bitarray[[1,2,5]] = True

is working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants