utdemir / distributed-dataset Public

Notifications You must be signed in to change notification settings
Fork 5
Star 116

A distributed data processing framework in Haskell.

BSD-3-Clause license

116 stars 5 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 343 Commits
distributed-dataset-aws		distributed-dataset-aws
distributed-dataset-opendatasets		distributed-dataset-opendatasets
distributed-dataset		distributed-dataset
examples/gh		examples/gh
nix		nix
.gitignore		.gitignore
.hlint.yaml		.hlint.yaml
.sosrc		.sosrc
LICENSE		LICENSE
README.md		README.md
cabal.project		cabal.project
default.nix		default.nix
docs-ghpages.sh		docs-ghpages.sh
shell.nix		shell.nix
stack.yaml		stack.yaml

Repository files navigation

distributed-dataset

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

An example: /examples/gh/Main.hs
API documentation: https://utdemir.github.io/distributed-dataset/
Introduction blogpost: https://utdemir.com/posts/ann-distributed-dataset.html

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

Clone the repository.

$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset

Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
```
$ aws configure
```
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
```
$ aws s3api create-bucket --bucket my-s3-bucket
```

Build an run the example:

If you use Nix on Linux:

(Recommended) Use my binary cache on Cachix to reduce compilation times:

nix-env -i cachix # or your preferred installation method
cachix use utdemir

Then:

$ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket

If you use stack (requires Docker, works on Linux and MacOS):

$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

In order to develop distributed-dataset, you can use;
- On Linux: Nix, cabal-install or stack.
- On MacOS: stack with docker.
Use ormolu to format source code.

Nix

You can use my binary cache on cachix so that you don't recompile half of the Hackage.
nix-shell will drop you into a shell with ormolu, cabal-install and steeloverseer alongside with all required haskell and system dependencies. You can use cabal new-* commands there.
Easiest way to get a development environment would be to run sos at the top level directory inside of a nix-shell.

Stack

Make sure that you have Docker installed.
Use stack as usual, it will automatically use a Docker image
Run ./make.sh stack-build before you send a PR to test different resolvers.

Related Work

Papers

Towards Haskell in Cloud by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Matei Zaharia, et al.

Projects

Apache Spark.
Sparkle: Run Haskell on top of Apache Spark.
HSpark: Another attempt at porting Apache Spark to Haskell.

About

A distributed data processing framework in Haskell.

haskell spark aws-lambda distributed data-processing

BSD-3-Clause license

Report repository

Releases

No releases published

Packages

No packages published

Languages