-
Notifications
You must be signed in to change notification settings - Fork 747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(query): table meta optimize #11015
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Test environmentMemory: 32G Test Script#!/bin/bash
echo "start create table"
MYSQL_HOST="127.0.0.1"
MYSQL_USER="root"
MYSQL_PORT="3311"
mysql -h $MYSQL_HOST -u $MYSQL_USER -P $MYSQL_PORT -e "
Drop table hits;
CREATE TABLE hits
(
WatchID BIGINT NOT NULL,
JavaEnable SMALLINT NOT NULL,
Title TEXT NOT NULL,
GoodEvent SMALLINT NOT NULL,
EventTime TIMESTAMP NOT NULL,
EventDate Date NOT NULL,
CounterID INTEGER NOT NULL,
ClientIP INTEGER NOT NULL,
RegionID INTEGER NOT NULL,
UserID BIGINT NOT NULL,
CounterClass SMALLINT NOT NULL,
OS SMALLINT NOT NULL,
UserAgent SMALLINT NOT NULL,
URL TEXT NOT NULL,
Referer TEXT NOT NULL,
IsRefresh SMALLINT NOT NULL,
RefererCategoryID SMALLINT NOT NULL,
RefererRegionID INTEGER NOT NULL,
URLCategoryID SMALLINT NOT NULL,
URLRegionID INTEGER NOT NULL,
ResolutionWidth SMALLINT NOT NULL,
ResolutionHeight SMALLINT NOT NULL,
ResolutionDepth SMALLINT NOT NULL,
FlashMajor SMALLINT NOT NULL,
FlashMinor SMALLINT NOT NULL,
FlashMinor2 TEXT NOT NULL,
NetMajor SMALLINT NOT NULL,
NetMinor SMALLINT NOT NULL,
UserAgentMajor SMALLINT NOT NULL,
UserAgentMinor VARCHAR(255) NOT NULL,
CookieEnable SMALLINT NOT NULL,
JavascriptEnable SMALLINT NOT NULL,
IsMobile SMALLINT NOT NULL,
MobilePhone SMALLINT NOT NULL,
MobilePhoneModel TEXT NOT NULL,
Params TEXT NOT NULL,
IPNetworkID INTEGER NOT NULL,
TraficSourceID SMALLINT NOT NULL,
SearchEngineID SMALLINT NOT NULL,
SearchPhrase TEXT NOT NULL,
AdvEngineID SMALLINT NOT NULL,
IsArtifical SMALLINT NOT NULL,
WindowClientWidth SMALLINT NOT NULL,
WindowClientHeight SMALLINT NOT NULL,
ClientTimeZone SMALLINT NOT NULL,
ClientEventTime TIMESTAMP NOT NULL,
SilverlightVersion1 SMALLINT NOT NULL,
SilverlightVersion2 SMALLINT NOT NULL,
SilverlightVersion3 INTEGER NOT NULL,
SilverlightVersion4 SMALLINT NOT NULL,
PageCharset TEXT NOT NULL,
CodeVersion INTEGER NOT NULL,
IsLink SMALLINT NOT NULL,
IsDownload SMALLINT NOT NULL,
IsNotBounce SMALLINT NOT NULL,
FUniqID BIGINT NOT NULL,
OriginalURL TEXT NOT NULL,
HID INTEGER NOT NULL,
IsOldCounter SMALLINT NOT NULL,
IsEvent SMALLINT NOT NULL,
IsParameter SMALLINT NOT NULL,
DontCountHits SMALLINT NOT NULL,
WithHash SMALLINT NOT NULL,
HitColor CHAR NOT NULL,
LocalEventTime TIMESTAMP NOT NULL,
Age SMALLINT NOT NULL,
Sex SMALLINT NOT NULL,
Income SMALLINT NOT NULL,
Interests SMALLINT NOT NULL,
Robotness SMALLINT NOT NULL,
RemoteIP INTEGER NOT NULL,
WindowName INTEGER NOT NULL,
OpenerName INTEGER NOT NULL,
HistoryLength SMALLINT NOT NULL,
BrowserLanguage TEXT NOT NULL,
BrowserCountry TEXT NOT NULL,
SocialNetwork TEXT NOT NULL,
SocialAction TEXT NOT NULL,
HTTPError SMALLINT NOT NULL,
SendTiming INTEGER NOT NULL,
DNSTiming INTEGER NOT NULL,
ConnectTiming INTEGER NOT NULL,
ResponseStartTiming INTEGER NOT NULL,
ResponseEndTiming INTEGER NOT NULL,
FetchTiming INTEGER NOT NULL,
SocialSourceNetworkID SMALLINT NOT NULL,
SocialSourcePage TEXT NOT NULL,
ParamPrice BIGINT NOT NULL,
ParamOrderID TEXT NOT NULL,
ParamCurrency TEXT NOT NULL,
ParamCurrencyID SMALLINT NOT NULL,
OpenstatServiceName TEXT NOT NULL,
OpenstatCampaignID TEXT NOT NULL,
OpenstatAdID TEXT NOT NULL,
OpenstatSourceID TEXT NOT NULL,
UTMSource TEXT NOT NULL,
UTMMedium TEXT NOT NULL,
UTMCampaign TEXT NOT NULL,
UTMContent TEXT NOT NULL,
UTMTerm TEXT NOT NULL,
FromTag TEXT NOT NULL,
HasGCLID SMALLINT NOT NULL,
RefererHash BIGINT NOT NULL,
URLHash BIGINT NOT NULL,
CLID INTEGER NOT NULL
)
CLUSTER BY (CounterID, EventDate, UserID, EventTime, WatchID);
COPY INTO hits FROM 'https://repo.databend.rs/hits/hits_1m.tsv.gz' FILE_FORMAT=(type=TSV compression=AUTO);
"
for i in {1..6}
do
echo "start insert into for the $i time"
mysql -h $MYSQL_HOST -u $MYSQL_USER -P $MYSQL_PORT -e "insert into hits select * from hits;"
echo "end insert into for the $i time"
done
echo "insert data over, ready to compact segement"
mysql -h $MYSQL_HOST -u $MYSQL_USER -P $MYSQL_PORT -e "optimize table hits compact segment;"
# Run the first query and assign the result to a variable
snapshot_id=$(mysql -N -h $MYSQL_HOST -u $MYSQL_USER -P $MYSQL_PORT -e "SELECT snapshot_id FROM FUSE_SNAPSHOT('default', 'hits') limit 1;")
# Run the second query using the result from the first query
mysql -h $MYSQL_HOST -u $MYSQL_USER -P $MYSQL_PORT -e "SELECT * FROM FUSE_SEGMENT('default', 'hits', '$snapshot_id');" Test Result
We also tested
|
This test seems happen on local fs. It's worth a bench on |
Good idea, Next, I will test!:D |
For bench |
I have tried this dataset before, but my machine tested slowly at that time, so I chose the above method. I'll use this test later! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM.
- I think we should compress the
snapshot
in this pr together, let's keepv3
complete as possible. - We don't need
message_pack
andsnap
andjson
by default, maybe it's better to add an optional dependency and features gate which reduce the compile time.
BTW, do we have a tool or functions to decode the compressed bincode meta file?Like we want to check the meta data in it, for json it's easy to do it. |
Command-line tool name: databend-meta-decoder Usage: Description: Options: Examples:
databend-meta-decoder -i data.bin
Version: x
Encoding: y
Compression: z
Blocks Size: p bytes
Summary Size: q bytes
databend-meta-decoder -i data.bin -j -o output.json Note: How about this design? |
We already have a service called ·databend-meta·, and it has some tools. Here, to avoid confusion, the name |
The location of the snapshot that flashed back seems to be broken.
|
@dantengsky The reason is: SnapshotIO takes the version of the root snapshot to read the previous snapshot. If the version of the snapshot before and after is different, the problem will occur. |
@zhyass Thanks a lot, I have merged branch 'main' into segment_compress. |
Fix purge with older version bug
Add more test
refactor: remove `segemnt::version()` and add compat test cases
test cases have been added: which will
please have a look, if there are any concerns or test cases should be added, please inform me, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks a lot for your contributions to this PR!! @dantengsky @zhyass |
I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/
Summary
This pr aims to improve the serialization and deserialization efficiency of metadata in Databed and reduce storage space consumption. Simultaneously supports partial read functionality.
Design
Adopting Bincode binary format+Zstd compression method. This can significantly improve the reading speed and stored file size. Simultaneously using custom file formats:
At the beginning, it is the header, which stores the version number of the segment, followed by the encoding format (bincode), which is stored in an enumerated manner, followed by the compression method, which is also stored in an enumerated manner, followed by the length of the serialized data (blocks and summary), and finally, the data is serialized in sequence before being compressed and stored.
Closes #10265