Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Flush profiles to disk per row groups #486

Merged

Conversation

simonswine
Copy link
Collaborator

This PR is flushes out profiles to a temporary on disk profile parquet file once a row group has been filled.

Further more it avoids using parquet.Buffer for sorting and rather uses a binary insert into the memory slice.

@simonswine simonswine force-pushed the 20230119_flush_profiles_row_groups_to_disk branch from 349a4bf to 74a34de Compare January 24, 2023 13:37
@simonswine
Copy link
Collaborator Author

With the latest changes I am also writing out the SeriesFingerprint to the parquet file. I need to do this to later renumber the SeriesIndexes. I think this also could help recover from an unclean shutdown.

I think this is ready for taking a first look @cyriltovena

@simonswine simonswine force-pushed the 20230119_flush_profiles_row_groups_to_disk branch from 5788815 to 37a6335 Compare January 25, 2023 10:11
@cyriltovena
Copy link
Collaborator

I think for the read path we should narrow down column to what we need per type of request instead of exposing the whole profile object, the will help leverage parquet goodness by only pulling requested columns.

Request type are basically based on the querier interface

type Querier interface {
...
	SelectMatchingProfiles(ctx context.Context, params *ingestv1.SelectProfilesRequest) (iter.Iterator[Profile], error)
	MergeByStacktraces(ctx context.Context, rows iter.Iterator[Profile]) (*ingestv1.MergeProfilesStacktracesResult, error)
	MergeByLabels(ctx context.Context, rows iter.Iterator[Profile], by ...string) ([]*typesv1.Series, error)
	MergePprof(ctx context.Context, rows iter.Iterator[Profile]) (*profile.Profile, error)
...
}

@simonswine
Copy link
Collaborator Author

@cyriltovena I would like to get another round of reviews, this afternoon I felt brave enough to deploy briefly to dev, but that resulted in all queries timing out, I will do look a bit more into this tomorrow

@cyriltovena
Copy link
Collaborator

Looking good.

@simonswine simonswine marked this pull request as ready for review February 3, 2023 14:22
@simonswine simonswine force-pushed the 20230119_flush_profiles_row_groups_to_disk branch from 2c1a6e1 to 555c2f0 Compare February 7, 2023 13:13
This ensure the readLock only needs to be hold as short as necssary.
* Ingest same profile series concurrenctly
* Test Flush as well
Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we resolve symbols multiple times for the same block, but that can be improved later.

@simonswine
Copy link
Collaborator Author

simonswine commented Feb 8, 2023

I think we resolve symbols multiple times for the same block, but that can be improved later.

I don't think it matters *much as long all of them are in memory

@simonswine simonswine merged commit 0dbcdf8 into grafana:main Feb 8, 2023
simonswine added a commit to simonswine/pyroscope that referenced this pull request Jun 30, 2023
…profiles_row_groups_to_disk

Flush profiles to disk per row groups
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants