Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cacheMissPercentage metric #18208

Merged
merged 9 commits into from
Oct 27, 2023
Merged

Conversation

juanjuan2
Copy link

@juanjuan2 juanjuan2 commented Sep 25, 2023

What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.
modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations.
#16945

Why are the changes needed?

Because of the incorrect calculation results of this metric, it may produce exception value.

Does this PR introduce any user facing changes?

No

@alluxio-bot
Copy link
Contributor

Thank you for your pull request.
In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement (CLA).
It's all electronic and will take just a few minutes. Please download CLA form here, sign, and e-mail back to cla@alluxio.org

@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title must not end with punctuation
  • Commits associated with Github account: PASS

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

@Kai-Zhang Kai-Zhang changed the title Fix cacheMissPercentage metric. Fix cacheMissPercentage metric Sep 25, 2023
@alluxio-bot
Copy link
Contributor

Automated checks report:

  • PR title follows the conventions: PASS
  • Commits associated with Github account: PASS

All checks passed!

@Kai-Zhang Kai-Zhang added the type-bug This issue is about a bug label Sep 25, 2023
@@ -70,6 +83,19 @@ public PagedUfsBlockReader(UfsManager ufsManager,
mInitialOffset = offset;
mLastPage = ByteBuffer.allocateDirect((int) mPageSize);
mPosition = offset;
UfsManager.UfsClient ufsClient = mUfsManager.get(mUfsBlockOptions.getMountId());
mUfsBytesRead = mUfsBytesReadMetrics.computeIfAbsent(
new BytesReadMetricKey(ufsClient.getUfsMountPointUri(), mUfsBlockOptions.getUser()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be put in UfsReadableChannel so that you won't need to throw the checked exceptions in the constructor of PagedUfsBlockReader.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the constructor of PagedUfsBlockReader use mUfsManager.get(mUfsBlockOptions.getMountId()) which has exceptions output.

@@ -70,6 +83,19 @@ public PagedUfsBlockReader(UfsManager ufsManager,
mInitialOffset = offset;
mLastPage = ByteBuffer.allocateDirect((int) mPageSize);
mPosition = offset;
UfsManager.UfsClient ufsClient = mUfsManager.get(mUfsBlockOptions.getMountId());
mUfsBytesRead = mUfsBytesReadMetrics.computeIfAbsent(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why put in a map? Can you directly instantiate this metric key?

@@ -47,6 +56,9 @@ public class PagedUfsBlockReader extends BlockReader {
private long mLastPageIndex = -1;
private boolean mClosed = false;
private long mPosition;
private final ConcurrentMap<BytesReadMetricKey, Counter> mUfsBytesReadMetrics =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elements are only inserted into this map, but never queried

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is used by mUfsBytesReadMetrics.computeIfAbsent().

// cluster cache hit and miss
long bytesReadTotal = bytesReadLocal + bytesReadRemote + bytesReadDomainSocket;
long bytesReadTotal = bytesReadLocal + bytesReadCache + bytesReadUfs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the metrics are updated in page store only. If the user uses the block store, then these bytesReadCache + bytesReadUfs will be zero. Can you also update the metrics in the block store?

Comment on lines 105 to 107
int bytesReadFromCache = buf.writeBytes(mLocalFileChannel, buf.writableBytes());
MetricsSystem.counter(MetricKey.WORKER_BYTES_READ_CACHE.getName()).inc(bytesReadFromCache);
return bytesReadFromCache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalFileBlockReader is also used by LocalFileDataReader which is client side implementation that handles short circuit reads. If this worker specific metric is added here, then short circuit read will also cause worker cache read metrics to be updated.

The recommended place to inject this metric is in alluxio.worker.block.TieredBlockStore#createBlockReader. You can wrap the returned block reader object in some decorator class that's a subclass of BlockReader, and udpates the metrics in that decorator.

/**
* An reader class with metrics.
*/
public class DeStoreBlockReader extends BlockReader {
Copy link
Contributor

@dbw9580 dbw9580 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricAccountingLocalBlockReader is a better name.

* A decorator of BlockReader.
* @param deblockReader block reader
*/
public DeStoreBlockReader(BlockReader deblockReader) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring the reader to be a BlockReader is too general: UnderFileSystemBlockReader is also a BlockReader, but it's obviously wrong to use that with the metric MetricKey.WORKER_BYTES_READ_CACHE.

Since we are dealing with the bytes-read-cache metric here, requiring the argument to be of type LocalFileBlockReader makes sense.


@Override
public ByteBuffer read(long offset, long length) throws IOException {
return mDeBlockReader.read(offset, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update the metric here

@Override
public int transferTo(ByteBuf buf) throws IOException {
int bytesReadFromCache = mDeBlockReader.transferTo(buf);
MetricsSystem.counter(MetricKey.WORKER_BYTES_READ_CACHE.getName()).inc(bytesReadFromCache);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytesReadFromCache may be -1 at EOF.

}

@Override
public boolean isClosed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also override close() and close the wrapped reader


@Override
public ReadableByteChannel getChannel() {
return mDeBlockReader.getChannel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the channel also needs to be wrapped to add the metric updating logic.
something looks like

return new ReadableByteChannel() {
  private final ReadableByteChannel mDelegate = mDeBlockReader.getChannel();

  @Override
  public int read(ByteBuffer dst) throws IOException {
    int bytesRead = mDelegate.read(dst);
    // update metric here
    return bytesRead;
  }
}

Copy link
Contributor

@Kai-Zhang Kai-Zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@Kai-Zhang
Copy link
Contributor

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit b37a96d into Alluxio:master-2.x Oct 27, 2023
19 checks passed
alluxio-bot added a commit that referenced this pull request Nov 8, 2023
### What changes are proposed in this pull request?
Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR.

We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release.

#17747 is not cherry picked because tencent cloud EMR doc is removed
#17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x
#17758 is not cherry picked because MonoBlockStore has been removed in 3.x
#17641 is not cherry picked because the PR has already been in main
#17781 is not cherry picked because the PR has already been in main
#17722 is not cherry picked because the alluxio-fuse command has been changed a lot
#17489 is not cherry picked because audit log on master is no longer in 3.x
#17865 is not cherry picked because replication on job service is no longer in 3.x
#17858 is not cherry picked because it is already in main
#18090 is not cherry picked because generate-tarball has been rewritten in 3.x
#18091 is not cherry picked because the change is already in main
#17474 is not cherry picked because reconfiguration feature is not defined in 3.x
#17735 is not cherry picked because MonoBlockStore is no longer in 3.x
#18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x
#17910 is not cherry picked because I prefer to do that manually
#17983 is not cherry picked because the web UI has been reworked
#17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x
#18103 is not cherry picked because worker cache metrics have been reworked in 3.x
#18185 is not cherry picked because the report command has been reworked in 3.x
#18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x
#18143 is not cherry picked because the change is already in main
#18303 is not cherry picked because the change is already in main
#18208 is not cherry picked because cache metrics have been reworked in 3.x
#17002 is not cherry picked because the owner has been notified separately
#18334 is not cherry picked because the bash scripts have been reworked in 3.x
#18326 is not cherry picked because the owner has been notified separately

			pr-link: #18397
			change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
maobaolong pushed a commit to maobaolong/alluxio that referenced this pull request Jan 3, 2024
### What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.
modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations.
Alluxio#16945

### Why are the changes needed?

Because of the incorrect calculation results of this metric, it may produce exception value.

### Does this PR introduce any user facing changes?

No
			pr-link: Alluxio#18208
			change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
Xenorith pushed a commit to Xenorith/alluxio that referenced this pull request Apr 25, 2024
### What changes are proposed in this pull request?

Please outline the changes and how this PR fixes the issue.
modify the calculation method for cacheMissPercentage metric, ensuring it comprehensively accounts for the influence of job worker read operations.
Alluxio#16945

### Why are the changes needed?

Because of the incorrect calculation results of this metric, it may produce exception value.

### Does this PR introduce any user facing changes?

No
			pr-link: Alluxio#18208
			change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
alluxio-bot pushed a commit that referenced this pull request Apr 25, 2024
Cherry-pick of existing commit.
orig-pr: #18208
orig-commit: b37a96d
orig-commit-author: juanjuan2 <45665083+juanjuan2@users.noreply.github.com>

			pr-link: #18589
			change-id: cid-d24a6f324f7148cab4a30fbbf819510a1a314ebc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants