Skip to content

A toolkit for using apache spark to efficiently query Bitcoin Blockchain data.

License

Notifications You must be signed in to change notification settings

JeremyRubin/BTCSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bitcoin Spark Framework (BTCSpark)

What is BTCSpark?

BTCSpark is a layer for accessing the Bitcoin Blockchain from Apache Spark.

The goal of BTCSpark is to offer high quality, easy to use, performant, and free software to Bitcoin developers and analysts.

NOTE: BTCSpark is currently unmaintained. BlockSci is a similar project with better performance, available here.

Benchmarks

The following benchhmark finds the Transaction Output Amount Distribution (TOAD). On an AWS 6 node (5 slave, one master) m3.large cluster, with the blockchain in hadoop on ephemera storage, this take 8.4 minutes to run using the nativ_lazy_blockchain implementation.

    block_objs = sb.fetch_chain()
    unlazy = lambda x: x()
    txns = block_objs.map(unlazy)\
                     .flatMap(lambda b: 
                          b.txns)\
                     .map(unlazy)
    txns.flatMap(lambda txn:
                 map(lambda txo:
                     ((txo.value>>14)<<14, 1),
                 txn.tx_outs.map(unlazy)))\
        .reduceByKey(lambda x,y: x+y)\
        .saveAsTextFile("txouts_values")

Finding the BIP100 Blocks takes 5.0 minutes on the same cluster.

    block_objs.map(unlazy)\
              .map(lambda b: b.txns[0]().tx_ins[0]().signature_script)\
              .filter(lambda f: "BIP100" in f)\
              .saveAsTextFile(result_name("BIP100_Blocks"))

Note: Unless you have a lot of memory, or you've reduced the working set largely, it isn't recommended to use caching as the overhead of re-parsing isn't horrible.

License

BTCSpark is released under the terms of the AGPL license. See COPYING for more information. Non-free license may also be purchased from Jeremy Rubin for organizations who are unable to use AGPL licensed software.

About

A toolkit for using apache spark to efficiently query Bitcoin Blockchain data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published