Faster way to get all address balances #264

simonohanlon101 · 2019-04-23T10:52:20Z

System Information (if applicable)

AWS: r5.2xlarge
BlockSci version: 0.5
Using AMI: no
Total memory: 64GB

My use-case is that I would like to calculate the current balance of all addresses on the blockchain. I may want to run this query quite a lot, and possibly for historical block heights. My current way to do this would be:

import blocksci
chain = blocksci.Blockchain('/data/blocksci')
balances = chain.addresses( address_type=blocksci.address_type.pubkey ).balance()

Is there a more efficient (faster) way to do this. I understand blocksci is optimised for iterating over txes rather than addresses. Is there therefore a better way to achieve the desired result?

Many thanks,

Simon

The text was updated successfully, but these errors were encountered:

simonohanlon101 · 2019-04-24T09:56:08Z

I managed to come up with a faster solution using the mapreduce framework to iterate over all the blocks and assign never_spent outputs to addresses, then reduce these over all blocks.

import blocksci
from collections import Counter

def calculateBlockBals(block):
    balances = Counter()
    addresses = list(block.outputs.unspent.address)
    values = list(block.outputs.unspent.value)
    
    for idx in range(len(addresses)):
        
        addr_type = int(addresses[idx].type)
        
        if addr_type in [0,6]:
            continue
        elif addr_type == 5:
            addr = str(addresses[idx].addresses)
        else:
            addr = addresses[idx].address_string
            
        if addr in balances:
            balances[addr] = balances[addr] + (values[idx] / 1e8)
        else:
            balances[addr] = values[idx] / 1e8
        
    return balances

def calculateAllBals(chain):
    def mapFunc(block):
        return [calculateBlockBals(block)]

    def reduceFunc(x, y):
        return x + y
    
    parts = chain.mapreduce_block_ranges(mapFunc, reduceFunc, cpu_count=14)
    return parts
  
%time data = calculateAllBals(chain)

# CPU times: user 1min 39s, sys: 9.78 s, total: 1min 49s
# Wall time: 3min 32s

This seems to be pretty performant. If this isn't the correct way or if there is a better way i'd still be interested to hear about it.

Thanks!

jpzk · 2019-04-24T10:00:34Z

Looks legit. :) BlockSci team can you integrate this implementation into the API? Further, will you support more pubkey related API methods in the future? Thanks

maltemoeser · 2019-04-25T13:19:24Z

@simonohanlon101 in v0.5 I think your multithreaded approach is probably the most performant. In v0.6 the calculateBlockBals implementation could be replaced with:

return Counter(block.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
))

That might be interesting to compare performance-wise with just running

chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)

I'll see if I have the time to do some evaluations of these.

@jpzk: what do you mean by supporting more API methods? Is there something missing wrt to accessing address data?

maltemoeser · 2019-04-25T13:19:42Z

Somewhat related to #192

maltemoeser · 2019-04-25T21:19:40Z

@jpzk The next version of BlockSci (v0.6) will have a generic group_by function, which should be exactly what you're looking for. In terms of aggregation functions, there's currently support for sum, min and max. So the code above could simply be replaced by

address_balances = chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)

And because the aggregation is happening in C++ (and not in Python as in the map_reduce example), it's also pretty fast (some benchmarks).

simonohanlon101 · 2019-04-26T15:06:42Z

@jpzk The next version of BlockSci (v0.6) will have a generic group_by function, which should be exactly what you're looking for. In terms of aggregation functions, there's currently support for sum, min and max. So the code above could simply be replaced by
address_balances = chain.blocks.outputs.where(lambda o: ~o.is_spent).group_by( \
    lambda output: output.address, \
    lambda outputs: outputs.value.sum \
)
And because the aggregation is happening in C++ (and not in Python as in the map_reduce example), it's also pretty fast (some benchmarks).

Wow, this is great, thank you. I'll try to also run a v0.6 as well.

Thank you.

maltemoeser · 2019-04-27T18:01:20Z

Awesome, I've added this issue to the FAQ. Closing for now, feel free to reopen if you have follow-up questions.

joequant · 2019-07-17T01:08:57Z

What's the fastest way of getting historical balances?

maltemoeser · 2019-07-17T01:17:41Z

@joequant addr.balance(height) takes an optional parameter for a (historical) block height at which you'd like to get the balance. For the examples above, simply choosing a block range up to the point you're interested in should be sufficient

jiagengliu · 2020-02-20T16:15:53Z

@joequant addr.balance(height) takes an optional parameter for a (historical) block height at which you'd like to get the balance. For the examples above, simply choosing a block range up to the point you're interested in should be sufficient

Is it possible to pass a list of block heights and compute the balances faster than computing individually? @maltemoeser

maltemoeser · 2020-02-21T12:36:08Z

@jiagengliu nope, though you could probably write such a function easily based on the existing one

ixnuhs · 2020-04-09T19:13:03Z

@maltemoeser I run the script provided by @simonohanlon101 using the same AWS instance type with 1TB HDD. It took 3min CPU time but 40min wall time. I am wondering what a reason could be for having way longer wall time?

maltemoeser · 2020-04-10T12:46:31Z

@shun61 most likely the HDD. I'd recommend to use an SSD.

maltemoeser added the question Question related to using BlockSci label Apr 25, 2019

maltemoeser closed this as completed Apr 27, 2019

graymauser mentioned this issue Jul 21, 2019

Error: unable to decode 831061 transactions graymauser/btcposbal2csv#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster way to get all address balances #264

Faster way to get all address balances #264

simonohanlon101 commented Apr 23, 2019

simonohanlon101 commented Apr 24, 2019 •

edited

Loading

jpzk commented Apr 24, 2019

maltemoeser commented Apr 25, 2019

maltemoeser commented Apr 25, 2019

maltemoeser commented Apr 25, 2019

simonohanlon101 commented Apr 26, 2019

maltemoeser commented Apr 27, 2019

joequant commented Jul 17, 2019

maltemoeser commented Jul 17, 2019 •

edited

Loading

jiagengliu commented Feb 20, 2020

maltemoeser commented Feb 21, 2020

ixnuhs commented Apr 9, 2020

maltemoeser commented Apr 10, 2020

Faster way to get all address balances #264

Faster way to get all address balances #264

Comments

simonohanlon101 commented Apr 23, 2019

System Information (if applicable)

simonohanlon101 commented Apr 24, 2019 • edited Loading

jpzk commented Apr 24, 2019

maltemoeser commented Apr 25, 2019

maltemoeser commented Apr 25, 2019

maltemoeser commented Apr 25, 2019

simonohanlon101 commented Apr 26, 2019

maltemoeser commented Apr 27, 2019

joequant commented Jul 17, 2019

maltemoeser commented Jul 17, 2019 • edited Loading

jiagengliu commented Feb 20, 2020

maltemoeser commented Feb 21, 2020

ixnuhs commented Apr 9, 2020

maltemoeser commented Apr 10, 2020

simonohanlon101 commented Apr 24, 2019 •

edited

Loading

maltemoeser commented Jul 17, 2019 •

edited

Loading