Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster way to get all address balances #264

Closed
simonohanlon101 opened this issue Apr 23, 2019 · 13 comments
Closed

Faster way to get all address balances #264

simonohanlon101 opened this issue Apr 23, 2019 · 13 comments
Labels
question Question related to using BlockSci

Comments

@simonohanlon101
Copy link

System Information (if applicable)

AWS: r5.2xlarge
BlockSci version: 0.5
Using AMI: no
Total memory: 64GB

My use-case is that I would like to calculate the current balance of all addresses on the blockchain. I may want to run this query quite a lot, and possibly for historical block heights. My current way to do this would be:

import blocksci
chain = blocksci.Blockchain('/data/blocksci')
balances = chain.addresses( address_type=blocksci.address_type.pubkey ).balance()

Is there a more efficient (faster) way to do this. I understand blocksci is optimised for iterating over txes rather than addresses. Is there therefore a better way to achieve the desired result?

Many thanks,

Simon

@simonohanlon101
Copy link
Author

simonohanlon101 commented Apr 24, 2019

I managed to come up with a faster solution using the mapreduce framework to iterate over all the blocks and assign never_spent outputs to addresses, then reduce these over all blocks.

import blocksci
from collections import Counter

def calculateBlockBals(block):
    balances = Counter()
    addresses = list(block.outputs.unspent.address)
    values = list(block.outputs.unspent.value)
    
    for idx in range(len(addresses)):
        
        addr_type = int(addresses[idx].type)
        
        if addr_type in [0,6]:
            continue
        elif addr_type == 5:
            addr = str(addresses[idx].addresses)
        else:
            addr = addresses[idx].address_string
            
        if addr in balances:
            balances[addr] = balances[addr] + (values[idx] / 1e8)
        else:
            balances[addr] = values[idx] / 1e8
        
    return balances

def calculateAllBals(chain):
    def mapFunc(block):
        return [calculateBlockBals(block)]

    def reduceFunc(x, y):
        return x + y
    
    parts = chain.mapreduce_block_ranges(mapFunc, reduceFunc, cpu_count=14)
    return parts
  
%time data = calculateAllBals(chain)

# CPU times: user 1min 39s, sys: 9.78 s, total: 1min 49s
# Wall time: 3min 32s

This seems to be pretty performant. If this isn't the correct way or if there is a better way i'd still be interested to hear about it.

Thanks!

@jpzk
Copy link

jpzk commented Apr 24, 2019

Looks legit. :) BlockSci team can you integrate this implementation into the API? Further, will you support more pubkey related API methods in the future? Thanks

@maltemoeser
Copy link
Member

@simonohanlon101 in v0.5 I think your multithreaded approach is probably the most performant. In v0.6 the calculateBlockBals implementation could be replaced with:

return Counter(block.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
))

That might be interesting to compare performance-wise with just running

chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)

I'll see if I have the time to do some evaluations of these.

@jpzk: what do you mean by supporting more API methods? Is there something missing wrt to accessing address data?

@maltemoeser
Copy link
Member

Somewhat related to #192

@maltemoeser maltemoeser added the question Question related to using BlockSci label Apr 25, 2019
@maltemoeser
Copy link
Member

@jpzk The next version of BlockSci (v0.6) will have a generic group_by function, which should be exactly what you're looking for. In terms of aggregation functions, there's currently support for sum, min and max. So the code above could simply be replaced by

address_balances = chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)

And because the aggregation is happening in C++ (and not in Python as in the map_reduce example), it's also pretty fast (some benchmarks).

@simonohanlon101
Copy link
Author

@jpzk The next version of BlockSci (v0.6) will have a generic group_by function, which should be exactly what you're looking for. In terms of aggregation functions, there's currently support for sum, min and max. So the code above could simply be replaced by

address_balances = chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)

And because the aggregation is happening in C++ (and not in Python as in the map_reduce example), it's also pretty fast (some benchmarks).

Wow, this is great, thank you. I'll try to also run a v0.6 as well.

Thank you.

@maltemoeser
Copy link
Member

Awesome, I've added this issue to the FAQ. Closing for now, feel free to reopen if you have follow-up questions.

@joequant
Copy link

What's the fastest way of getting historical balances?

@maltemoeser
Copy link
Member

maltemoeser commented Jul 17, 2019

@joequant addr.balance(height) takes an optional parameter for a (historical) block height at which you'd like to get the balance. For the examples above, simply choosing a block range up to the point you're interested in should be sufficient

@jiagengliu
Copy link

@joequant addr.balance(height) takes an optional parameter for a (historical) block height at which you'd like to get the balance. For the examples above, simply choosing a block range up to the point you're interested in should be sufficient

Is it possible to pass a list of block heights and compute the balances faster than computing individually? @maltemoeser

@maltemoeser
Copy link
Member

@jiagengliu nope, though you could probably write such a function easily based on the existing one

@ixnuhs
Copy link

ixnuhs commented Apr 9, 2020

@maltemoeser I run the script provided by @simonohanlon101 using the same AWS instance type with 1TB HDD. It took 3min CPU time but 40min wall time. I am wondering what a reason could be for having way longer wall time?

@maltemoeser
Copy link
Member

@shun61 most likely the HDD. I'd recommend to use an SSD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question related to using BlockSci
Projects
None yet
Development

No branches or pull requests

6 participants