-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add direct access option to avoid caching certain paths #18326
Conversation
.setCommonOptions(FileSystemOptionsUtils.commonDefaults(conf) | ||
.toBuilder().setSyncIntervalMs(0)).build(); | ||
// may not be necessary | ||
options = options.toBuilder().setReadType(ReadPType.NO_CACHE).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NO_CACHE
means to not cache the block into the worker's storage if it's not already cached. It says nothing about skipping reading from the cache if it's already cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a new read type SKIP_CACHE
to make the worker to always read from the UFS, regardless of whether the block is already cached or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we refresh/invalidate the cache in combination with NO_CACHE? i think that is effectively skip_CACHE no?
if (!getConf().isSet(PropertyKey.USER_FILE_DIRECT_ACCESS)) { | ||
return false; | ||
} | ||
List<String> pathList = getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS)
can be cached as this is called for every call and can be harmful to performance otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* 2. inner configuration | ||
*/ | ||
@ThreadSafe | ||
public class NestedAlluxioConfiguration implements AlluxioConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest overlay configuration as the name
public class NestedAlluxioConfiguration implements AlluxioConfiguration { | |
public class OverlayConfiguration implements AlluxioConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Override | ||
public Set<PropertyKey> keySet() { | ||
return mInnerConf.keySet(); | ||
} | ||
|
||
@Override | ||
public Set<PropertyKey> userKeySet() { | ||
return mOuterConf.keySet(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why keySet
and userKeySet
have different sources? Should it be a union of both layers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was set this way because I wanted Outer to completely override inner but that does not seem like the right way.
@Override | ||
public AlluxioProperties copyProperties() { | ||
AlluxioProperties properties = mInnerConf.copyProperties(); | ||
for (PropertyKey key : keySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AlluxioConfiguration.keySet
includes all properties defined in the PropertyKey
class, even if it's not defined in this particular AlluxioProperties
object. What you need is probably for (Entry<PropertyKey, Object> entry : properties.entrySet())
.
} | ||
|
||
private AlluxioConfiguration conf(PropertyKey key) { | ||
return mOuterConf.isSet(key) ? mOuterConf : mInnerConf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean a default value in the outer layer gets to override a value explicitly set by the user in the inner layer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
take the specific example of a path specific configuration as inner and instancedConfiguration as outer. path specific will have cluster default as the default and pathspecific as the user set properties. InstancedConfiguration will have the system default as the default and user set as user set configurations.
/** | ||
* Integration tests for path level configurations. | ||
*/ | ||
public class DirectAccessIntegrationTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inherit from BaseIntegrationTest
to get timeout and logging facilities.
public class DirectAccessIntegrationTest { | |
public class DirectAccessIntegrationTest extends BaseIntegrationTest { |
.build(); | ||
private FileSystem mFileSystem; | ||
|
||
private final String mLocalUfsPath = Files.createTempDir().getAbsolutePath(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AlluxioTestDirectory
which automatically cleans up the directory after the test is finished. Take this as an example:
alluxio/dora/core/server/master/src/test/java/alluxio/master/file/FileSystemMasterTestBase.java
Lines 154 to 155 in 6dc8ed2
put(PropertyKey.MASTER_MOUNT_TABLE_ROOT_UFS, AlluxioTestDirectory | |
.createTemporaryDirectory("FileSystemMasterTest").getAbsolutePath()); |
try (FileInStream is = mFileSystem.openFile(uri)) { | ||
IOUtils.copy(is, ByteStreams.nullOutputStream()); | ||
} | ||
FileSystemUtils.waitForAlluxioPercentage(mFileSystem, uri, shouldCache ? 100 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of shouldCacheBefore == false
and shouldCache == false
,
waitForAlluxioPercentage
returns immediately when the expected cached percentage is 0. It takes time for this percentage to increase to above 0, so if the worker incorrectly caches the data which it wasn't supposed to cache, this test will fail to detect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better alternative is to look directly into the block master to see if the file has any associated blocks.
* 2. inner configuration | ||
*/ | ||
@ThreadSafe | ||
public class NestedAlluxioConfiguration implements AlluxioConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add unit test for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbw9580 addressed comments, PTAL
if (!getConf().isSet(PropertyKey.USER_FILE_DIRECT_ACCESS)) { | ||
return false; | ||
} | ||
List<String> pathList = getConf().getList(PropertyKey.USER_FILE_DIRECT_ACCESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* 2. inner configuration | ||
*/ | ||
@ThreadSafe | ||
public class NestedAlluxioConfiguration implements AlluxioConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
private AlluxioConfiguration conf(PropertyKey key) { | ||
return mOuterConf.isSet(key) ? mOuterConf : mInnerConf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
take the specific example of a path specific configuration as inner and instancedConfiguration as outer. path specific will have cluster default as the default and pathspecific as the user set properties. InstancedConfiguration will have the system default as the default and user set as user set configurations.
@Override | ||
public Set<PropertyKey> keySet() { | ||
return mInnerConf.keySet(); | ||
} | ||
|
||
@Override | ||
public Set<PropertyKey> userKeySet() { | ||
return mOuterConf.keySet(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was set this way because I wanted Outer to completely override inner but that does not seem like the right way.
@yuzhu I think you forgot to push the latest changes that addresses the comments |
* determines the value. | ||
*/ | ||
@ThreadSafe | ||
public class OverlayConfiguration implements AlluxioConfiguration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add unit test for this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM
alluxio-bot, merge this please |
alluxio-bot, cherry-pick this to branch-2.10 please |
Auto cherry-pick to branch |
### What changes are proposed in this pull request? Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations. ### Why are the changes needed? for correct functionality using iceberg ### Does this PR introduce any user facing changes? one user property that is by default empty, alluxio.user.file.direct.access pr-link: #18326 change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
@yuzhu Do we need this in |
Not needed in main, will require a very different implementation. |
### What changes are proposed in this pull request? Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations. ### Why are the changes needed? for correct functionality using iceberg ### Does this PR introduce any user facing changes? one user property that is by default empty, alluxio.user.file.direct.access pr-link: Alluxio#18326 change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
### What changes are proposed in this pull request? Merge missing commits from master-2.x to main. The commits in 2023/07/01~2023/11/08 from main...master-2.x will be included by this PR. We do this merge to catch missing fixes from `master-2.x` and catch the train before `main` cuts a release. #17747 is not cherry picked because tencent cloud EMR doc is removed #17755 is not cherry picked because DistLoadCliRunner has been removed in 3.x #17758 is not cherry picked because MonoBlockStore has been removed in 3.x #17641 is not cherry picked because the PR has already been in main #17781 is not cherry picked because the PR has already been in main #17722 is not cherry picked because the alluxio-fuse command has been changed a lot #17489 is not cherry picked because audit log on master is no longer in 3.x #17865 is not cherry picked because replication on job service is no longer in 3.x #17858 is not cherry picked because it is already in main #18090 is not cherry picked because generate-tarball has been rewritten in 3.x #18091 is not cherry picked because the change is already in main #17474 is not cherry picked because reconfiguration feature is not defined in 3.x #17735 is not cherry picked because MonoBlockStore is no longer in 3.x #18133 is not cherry picked because the issue is about master metadata and no longer relevant in 3.x #17910 is not cherry picked because I prefer to do that manually #17983 is not cherry picked because the web UI has been reworked #17984 is not cherry picked because Mount/Unmount commands have been reworked in 3.x #18103 is not cherry picked because worker cache metrics have been reworked in 3.x #18185 is not cherry picked because the report command has been reworked in 3.x #18222 is not cherry picked because Mount/Unmount operations have been reworked in 3.x #18143 is not cherry picked because the change is already in main #18303 is not cherry picked because the change is already in main #18208 is not cherry picked because cache metrics have been reworked in 3.x #17002 is not cherry picked because the owner has been notified separately #18334 is not cherry picked because the bash scripts have been reworked in 3.x #18326 is not cherry picked because the owner has been notified separately pr-link: #18397 change-id: cid-dbf8cbb2d9e721a5a0a1e5028a3c9577438a2ac0
### What changes are proposed in this pull request? Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations. ### Why are the changes needed? for correct functionality using iceberg ### Does this PR introduce any user facing changes? one user property that is by default empty, alluxio.user.file.direct.access pr-link: Alluxio#18326 change-id: cid-25a1ee3c2876a4a4126727113342c24d40b06126
What changes are proposed in this pull request?
Datalake formats such as Iceberg requires frequent changes to certain files, it is better not to cache them at all to avoid frequent invalidations.
Why are the changes needed?
for correct functionality using iceberg
Does this PR introduce any user facing changes?
one user property that is by default empty, alluxio.user.file.direct.access