Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add direct access option to avoid caching certain paths #18326

Merged
merged 7 commits into from
Nov 6, 2023

Conversation

yuzhu
Copy link
Contributor

@yuzhu yuzhu commented Oct 25, 2023

What changes are proposed in this pull request?

Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations.

Why are the changes needed?

for correct functionality using iceberg

Does this PR introduce any user facing changes?

one user property that is by default empty, alluxio.user.file.direct.access

@yuzhu
Copy link
Contributor Author

yuzhu commented Oct 25, 2023

@dbw9580 @apc999 as I am adding more to this PR, I find that I am changing all the entry point that path-specific configuration is changing. I am not sure that this change is any less complex than path-specific config when it is fully done.

.setCommonOptions(FileSystemOptionsUtils.commonDefaults(conf)
.toBuilder().setSyncIntervalMs(0)).build();
// may not be necessary
options = options.toBuilder().setReadType(ReadPType.NO_CACHE).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NO_CACHE means to not cache the block into the worker's storage if it's not already cached. It says nothing about skipping reading from the cache if it's already cached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a new read type SKIP_CACHE to make the worker to always read from the UFS, regardless of whether the block is already cached or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we refresh/invalidate the cache in combination with NO_CACHE? i think that is effectively skip_CACHE no?

@yuzhu yuzhu added the type-feature This issue is a feature request label Oct 31, 2023
@yuzhu yuzhu requested review from dbw9580 and apc999 October 31, 2023 02:51
Comment on lines 171 to 174
if (!getConf().isSet(PropertyKey.USER_FILE_DIRECT_ACCESS)) {
return false;
}
List<String> pathList = getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS) can be cached as this is called for every call and can be harmful to performance otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* 2. inner configuration
*/
@ThreadSafe
public class NestedAlluxioConfiguration implements AlluxioConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest overlay configuration as the name

Suggested change
public class NestedAlluxioConfiguration implements AlluxioConfiguration {
public class OverlayConfiguration implements AlluxioConfiguration {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 76 to 84
@Override
public Set<PropertyKey> keySet() {
return mInnerConf.keySet();
}

@Override
public Set<PropertyKey> userKeySet() {
return mOuterConf.keySet();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keySet and userKeySet have different sources? Should it be a union of both layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was set this way because I wanted Outer to completely override inner but that does not seem like the right way.

@Override
public AlluxioProperties copyProperties() {
AlluxioProperties properties = mInnerConf.copyProperties();
for (PropertyKey key : keySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlluxioConfiguration.keySet includes all properties defined in the PropertyKey class, even if it's not defined in this particular AlluxioProperties object. What you need is probably for (Entry<PropertyKey, Object> entry : properties.entrySet()).

}

private AlluxioConfiguration conf(PropertyKey key) {
return mOuterConf.isSet(key) ? mOuterConf : mInnerConf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean a default value in the outer layer gets to override a value explicitly set by the user in the inner layer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

take the specific example of a path specific configuration as inner and instancedConfiguration as outer. path specific will have cluster default as the default and pathspecific as the user set properties. InstancedConfiguration will have the system default as the default and user set as user set configurations.

/**
* Integration tests for path level configurations.
*/
public class DirectAccessIntegrationTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherit from BaseIntegrationTest to get timeout and logging facilities.

Suggested change
public class DirectAccessIntegrationTest {
public class DirectAccessIntegrationTest extends BaseIntegrationTest {

.build();
private FileSystem mFileSystem;

private final String mLocalUfsPath = Files.createTempDir().getAbsolutePath();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use AlluxioTestDirectory which automatically cleans up the directory after the test is finished. Take this as an example:

put(PropertyKey.MASTER_MOUNT_TABLE_ROOT_UFS, AlluxioTestDirectory
.createTemporaryDirectory("FileSystemMasterTest").getAbsolutePath());

try (FileInStream is = mFileSystem.openFile(uri)) {
IOUtils.copy(is, ByteStreams.nullOutputStream());
}
FileSystemUtils.waitForAlluxioPercentage(mFileSystem, uri, shouldCache ? 100 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of shouldCacheBefore == false and shouldCache == false,
waitForAlluxioPercentage returns immediately when the expected cached percentage is 0. It takes time for this percentage to increase to above 0, so if the worker incorrectly caches the data which it wasn't supposed to cache, this test will fail to detect this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better alternative is to look directly into the block master to see if the file has any associated blocks.

* 2. inner configuration
*/
@ThreadSafe
public class NestedAlluxioConfiguration implements AlluxioConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test for this class?

Copy link
Contributor Author

@yuzhu yuzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbw9580 addressed comments, PTAL

Comment on lines 171 to 174
if (!getConf().isSet(PropertyKey.USER_FILE_DIRECT_ACCESS)) {
return false;
}
List<String> pathList = getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* 2. inner configuration
*/
@ThreadSafe
public class NestedAlluxioConfiguration implements AlluxioConfiguration {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

private AlluxioConfiguration conf(PropertyKey key) {
return mOuterConf.isSet(key) ? mOuterConf : mInnerConf;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

take the specific example of a path specific configuration as inner and instancedConfiguration as outer. path specific will have cluster default as the default and pathspecific as the user set properties. InstancedConfiguration will have the system default as the default and user set as user set configurations.

Comment on lines 76 to 84
@Override
public Set<PropertyKey> keySet() {
return mInnerConf.keySet();
}

@Override
public Set<PropertyKey> userKeySet() {
return mOuterConf.keySet();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was set this way because I wanted Outer to completely override inner but that does not seem like the right way.

@dbw9580
Copy link
Contributor

dbw9580 commented Nov 3, 2023

@yuzhu I think you forgot to push the latest changes that addresses the comments

@yuzhu yuzhu requested a review from dbw9580 November 6, 2023 04:24
* determines the value.
*/
@ThreadSafe
public class OverlayConfiguration implements AlluxioConfiguration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test for this class?

Copy link
Contributor

@dbw9580 dbw9580 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM

@yuzhu
Copy link
Contributor Author

yuzhu commented Nov 6, 2023

alluxio-bot, merge this please

@alluxio-bot alluxio-bot merged commit 045a511 into Alluxio:master-2.x Nov 6, 2023
19 checks passed
@Xenorith
Copy link
Contributor

Xenorith commented Nov 6, 2023

alluxio-bot, cherry-pick this to branch-2.10 please

@alluxio-bot
Copy link
Contributor

Auto cherry-pick to branch branch-2.10 successfully opened PR: #18384

alluxio-bot pushed a commit that referenced this pull request Nov 6, 2023
### What changes are proposed in this pull request?

Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations. 

### Why are the changes needed?

for correct functionality using iceberg

### Does this PR introduce any user facing changes?

one user property that is by default empty, alluxio.user.file.direct.access

			pr-link: #18326
			change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
@jiacheliu3
Copy link
Contributor

@yuzhu Do we need this in main? Thanks!

@yuzhu
Copy link
Contributor Author

yuzhu commented Nov 8, 2023

Not needed in main, will require a very different implementation.

yuzhu added a commit to yuzhu/alluxio that referenced this pull request Nov 8, 2023
### What changes are proposed in this pull request?

Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations. 

### Why are the changes needed?

for correct functionality using iceberg

### Does this PR introduce any user facing changes?

one user property that is by default empty, alluxio.user.file.direct.access

			pr-link: Alluxio#18326
			change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
alluxio-bot added a commit that referenced this pull request Nov 8, 2023
### What changes are proposed in this pull request?
Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR.

We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release.

#17747 is not cherry picked because tencent cloud EMR doc is removed
#17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x
#17758 is not cherry picked because MonoBlockStore has been removed in 3.x
#17641 is not cherry picked because the PR has already been in main
#17781 is not cherry picked because the PR has already been in main
#17722 is not cherry picked because the alluxio-fuse command has been changed a lot
#17489 is not cherry picked because audit log on master is no longer in 3.x
#17865 is not cherry picked because replication on job service is no longer in 3.x
#17858 is not cherry picked because it is already in main
#18090 is not cherry picked because generate-tarball has been rewritten in 3.x
#18091 is not cherry picked because the change is already in main
#17474 is not cherry picked because reconfiguration feature is not defined in 3.x
#17735 is not cherry picked because MonoBlockStore is no longer in 3.x
#18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x
#17910 is not cherry picked because I prefer to do that manually
#17983 is not cherry picked because the web UI has been reworked
#17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x
#18103 is not cherry picked because worker cache metrics have been reworked in 3.x
#18185 is not cherry picked because the report command has been reworked in 3.x
#18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x
#18143 is not cherry picked because the change is already in main
#18303 is not cherry picked because the change is already in main
#18208 is not cherry picked because cache metrics have been reworked in 3.x
#17002 is not cherry picked because the owner has been notified separately
#18334 is not cherry picked because the bash scripts have been reworked in 3.x
#18326 is not cherry picked because the owner has been notified separately

			pr-link: #18397
			change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
alluxio-bot pushed a commit that referenced this pull request Nov 14, 2023
Cherry-pick of existing commit.
orig-pr: #18326
orig-commit: 045a511
orig-commit-author: David Zhu <david@alluxio.com>

			pr-link: #18399
			change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
maobaolong pushed a commit to maobaolong/alluxio that referenced this pull request Jan 3, 2024
### What changes are proposed in this pull request?

Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations.

### Why are the changes needed?

for correct functionality using iceberg

### Does this PR introduce any user facing changes?

one user property that is by default empty, alluxio.user.file.direct.access

			pr-link: Alluxio#18326
			change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature This issue is a feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants