Overflow error on 3D datasets #38

pierre-guillou · 2021-05-17T10:06:06Z

Hello and thank you for your work!

I am trying to use Ripser to compute persistence diagrams of rather large 3D datasets, coming from the Open-SciVis-Datasets website.

However, Ripser throws a std::overflow_error when I try to compute persistence pairs up to the dimension 2. If I understand correctly the C++ code, this exception is related to the computation of binomial coefficients which depend on the size of the input distance matrix and the maximum dimension requested.

This issue might be related to #25 and #32 since they feature similar problems. I already opened an issue in the scikit-tda Python wrapper (I wanted to use Ripser via this Python API) but since this is more related to the C++ code, this issue seems better to belong here.

Below, you will find a Python script I used to trigger this overflow error. It takes a raw file from Open-Scivis-Datasets and iterates over the edges of the cubical complex to generate a sparse matrix in sparse triplet format that is fed to the Ripser executable. Although the diagram is computed as expected with the smallest datasets (nucleon, marschner_lobb, silicium, up to 120k vertices), the error occurs with fuel, neghip (more than 250k vertices) and every larger dataset.

It there a way to circumvent this computation of these binomial coefficients to handle large datasets (up to 1M vertices)?

Thanks in advance for your help,
Best regards,
Pierre Guillou

import argparse
import subprocess
import time

import numpy as np


def load_raw(input_raw):
    extension = input_raw.split(".")[-1]
    if extension != "raw":
        print("Need a .raw file")
        raise TypeError

    # detect extent and data type from file name
    extent, dtype = input_raw.split(".")[0].split("_")[-2:]
    extent = [int(dim) for dim in extent.split("x")]

    dtype_np = {
        "uint8": np.uint8,
        "int16": np.int16,
        "uint16": np.uint16,
        "float32": np.float32,
        "float64": np.float64,
    }
    dtype = dtype_np[dtype]

    with open(input_raw) as src:
        data = np.fromfile(src, dtype=dtype)
        return data.reshape(extent)


def compute_persistence(sparse_triplets, output_diagram, ripser_executable):
    proc = subprocess.Popen(
        [ripser_executable, "--format", "sparse", "--dim", "2"],
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        universal_newlines=True,
    )
    out, _ = proc.communicate(sparse_triplets)

    with open(output_diagram, "w") as dst:
        dst.write(out)


def build_sparse_triplets(data):
    triplets = list()
    # loop over every edge of the cubical complex to get the sparse matrix triplets
    with np.nditer(data, flags=["multi_index"]) as it:
        for v in it:
            i = np.ravel_multi_index(it.multi_index, data.shape)
            for d in range(data.ndim):
                for s in [-1, 1]:
                    coords = list(it.multi_index)
                    s = coords[d] + s
                    if s < 0 or s >= data.shape[d]:
                        continue
                    coords[d] = s
                    j = np.ravel_multi_index(coords, data.shape)
                    if i < j:  # upper triangle distance matrix
                        triplets.append([i, j, max(data.flat[j], v)])

    return "\n".join([f"{i} {j} {v}" for i, j, v in triplets])


def main(input_raw, output_diagram, ripser_executable):
    data = load_raw(input_raw)
    start = time.time()
    sparse_triplets = build_sparse_triplets(data)
    print(f"Build sparse triplets in {time.time() - start:.2f}s")
    compute_persistence(sparse_triplets, output_diagram, ripser_executable)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Wrapper around Ripser C++")
    parser.add_argument("input_raw", help="Input raw file from OpenSciVis datasets")
    parser.add_argument(
        "-o",
        "--output_diagram",
        help="Output diagram in Ripser format",
        default="out.ripser",
    )
    parser.add_argument(
        "-e",
        "--ripser_executable",
        help="Path to the Ripser executable",
        default="ripser/ripser",
    )
    args = parser.parse_args()

    main(args.input_raw, args.output_diagram, args.ripser_executable)

The text was updated successfully, but these errors were encountered:

MassEast · 2021-11-10T14:09:32Z

Can you also use Ripser on high dimensional data, like 10D?

ulupo · 2021-11-10T14:23:22Z

Homology dimension k computations for a dataset of size N require that it be possible in principle to enumerate all (k+1)-simplices of the full simplex with N vertices. In your case (k=2) this means enumerating all 3-simplices (which are 4-tuples of vertices), and this requires that the binomial coefficient (N choose 4) not exceed 2^63 - 1 (the maximum index for signed 64-bit integers). With N=250000, N choose 4 is approximately 1.6e20, while 2^63 is about 9.2e18.

[Edit: I may be off by 1 in some of the above, but the main point remains.]

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

ulupo · 2021-11-10T14:30:01Z

@MassEast the dimensionality of the data is not an obstacle, Ripser simply looks at the matrix of pairwise distances between all pairs of points in the dataset. This takes a little longer to compute in higher dimensions, but only marginally so, especially compared to the time it takes to compute persistence.

julien-tierny · 2024-04-25T07:26:13Z

After modifying the backend to work with 128-bit integers, you should be able to run Ripser on your dataset. Both @ubauer and @MonkeyBreaker have tried this at some point AFAIK. It can be done!

Thanks @ulupo for the information.
@ubauer @MonkeyBreaker would it be possible that you post the diff to get ripser running with 128-bit integers (it doesn't seem like a one-line edit). Thanks!

ubauer · 2024-04-28T20:10:53Z

There is a branch at https://github.com/Ripser/ripser/tree/128bit implementing this. It should be mostly up to date, and it's actually pretty much a one line edit. Since 128 bit ints seem to be non standardized, this might not work out of the box for some compilers.

julien-tierny · 2024-04-29T06:07:43Z

Thanks!

xgao32 mentioned this issue Sep 22, 2021

Jupyter notebook crash on input of large sparse distance matrix scikit-tda/ripser.py#133

Open

ubauer closed this as completed Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow error on 3D datasets #38

Overflow error on 3D datasets #38

pierre-guillou commented May 17, 2021

MassEast commented Nov 10, 2021

ulupo commented Nov 10, 2021 •

edited

Loading

ulupo commented Nov 10, 2021 •

edited

Loading

julien-tierny commented Apr 25, 2024 •

edited

Loading

ubauer commented Apr 28, 2024

julien-tierny commented Apr 29, 2024

Overflow error on 3D datasets #38

Overflow error on 3D datasets #38

Comments

pierre-guillou commented May 17, 2021

MassEast commented Nov 10, 2021

ulupo commented Nov 10, 2021 • edited Loading

ulupo commented Nov 10, 2021 • edited Loading

julien-tierny commented Apr 25, 2024 • edited Loading

ubauer commented Apr 28, 2024

julien-tierny commented Apr 29, 2024

ulupo commented Nov 10, 2021 •

edited

Loading

ulupo commented Nov 10, 2021 •

edited

Loading

julien-tierny commented Apr 25, 2024 •

edited

Loading