Skip to content

3. Prepare Input Dataset

JOHN THORPE edited this page Apr 2, 2021 · 1 revision

Chapter 3. Prepare Input Dataset

Previous page: 2. Build Dorylus | Next page: 4. Setup Lambda Functions | Home: Home

Our system takes an input graph and an initial features file. You first need to prepare the raw input graph and raw initial features file. Then you can use our tools under the inputs/ folder to convert them into binary snaps (for efficient loading), and partition the graph properly.

Get Your Raw Inputs Ready

Prepare your raw input graph in a text file (for example small.graph). Vertices are numbered starting from 0. The file contains a bunch of lines, each representing an edge from SourceVID to DestinationVID. Input graph is by default considered to be a directed graph. An example looks like:

# Exmaple raw graph, with 6 vertices and 5 directed edges.
0 1
0 2
1 3
2 4
3 5

Prepare your initial features in a text file (for example features). The i-th line in the file (starting from 0) represents the initial feature values for the vertex VID = i. Values in a line should be separated by a comma (, ), and each line must contain the same number of values (Corresponding to the input layer feature dimension). An example looks like:

# Example initial features for the above graph, with input dimension = 4.
0.3, 0.2, 0.5, 0.7
0.7, 0.5, 0.5, 0.3
0.1, 0.2, 0.8, 0.7
0.3, 0.4, 0.5, 0.1
0.3, 0.4, 0.2, 0.1
0.3, 0.6, 0.5, 0.8

Prepare your labels in a text file (for example labels). The i-th line in the file (starting from 0) represents the label value for the vertex VID = i. Each line should only contain one unsigned integer value. Label values should range from 0 upto the dimension of features in the output layer (not included). They will get One-hot encoded by our engine, while the first value of the output features corresponds to the similarity in label 0, and so on. An example looks like:

# Example labels file for the above graph, with output dimension = 4 (so labels can only be 0 - 3).
0
3
2
1
2
3

Build the Data Preparation Tools

We now use the Gemini system's partitioner. This system needs to be reworked.

Convert into Binary Snaps & Partitioning

After successfully compiled the utilities, generate binary snaps and partition the graph within one command:

Local$ cd inputs/
Local$ ./prepare <PathToRawGraph> <Undirected? (0/1)> <NumVertices> <NumPartitions> <PathToRawFeatures> <DimFeatures> <PathToRawLabels> <LabelKinds>

Number of partitions (<NumPartitions>) should be set to exactly the number of data servers (graph servers) in your EC2 cluster, and must be greater than 1. Each node grabs its own partition to process. Graph is normally directed (<Undirected> = 0).

When done, you should have a folder named data/ under inputs/. This will be the dataset ready for battle. Its directory tree looks like:

<Name>
|-- features.bsnap
|-- graph.bsnap                 # File names should be exactly the same.
|-- parts_3
    |-- graph.bsnap.comm
    |-- graph.bsnap.edges
    |-- graph.bsnap.parts

Put your dataset on the NFS server

To start, take the prepared and partitioned dataset and put it on an NFS server so it can be accessed by all nodes. Mount the NFS server on each of ht workers under a directory called /filepool. Dorylus will look for the datasets here so make sure it is mounted under this directory specifically.

Previous page: 2. Build Dorylus | Next page: 4. Setup Lambda Functions | Home: Home