Releases: GoogleCloudDataproc/hadoop-connectors
2018-08-09 (GCS 1.9.5, BQ 0.13.5)
Changelog
Cloud Storage connector:
-
Improve build configuration (
pom.xml
s) compatibility with Maven release plugin.Changes version string from
1.9.5-hadoop2
tohadoop2-1.9.5
. -
Update Maven plugins versions.
-
Do not send batch request when performing operations (rename, delete, copy) on 1 object.
-
Add
fs.gs.performance.cache.dir.metadata.prefetch.limit
(default=1000
) configuration property to control number of prefetched metadata objects in the same directory byPerformanceCachingGoogleCloudStorage
.To disable metadata prefetching set property value to
0
.To prefetch all objects metadata in a directory set property value to
-1
. -
Add configuration properties to control batching of copy operations separately from other operations:
fs.gs.copy.max.requests.per.batch (default: 30) fs.gs.copy.batch.threads (default: 0)
-
Fix
RejectedExecutionException
during parallel execution of GCS batch requests. -
Change default values for GCS batch/directory operations properties:
fs.gs.copy.with.rewrite.enable (default: false -> true) fs.gs.copy.max.requests.per.batch (default: 30 -> 1) fs.gs.copy.batch.threads (default: 0 -> 50) fs.gs.max.requests.per.batch (default: 30 -> 25) fs.gs.batch.threads (default: 0 -> 25)
BigQuery connector:
-
POM updates for GCS connector 1.9.5.
-
Improve build configuration (
pom.xml
s) compatibility with Maven release plugin.Changes version string from
0.13.5-hadoop2
tohadoop2-0.13.5
. -
Update Maven plugins versions.
2018-08-10 (GCS 1.6.8, BQ 0.10.9)
Changelog
Cloud Storage connector:
-
Support parallel execution of GCS batch requests.
Number of threads to execute batch requests configurable via property:
fs.gs.batch.threads (default: 0)
If
fs.gs.batch.threads
value is set to 0 then batch requests will be executed sequentially by caller thread. -
Do not send batch request when performing operations (rename, delete, copy) on 1 object.
-
Add configuration properties to control batching of copy operations separately from other operations:
fs.gs.copy.max.requests.per.batch (default: 30) fs.gs.copy.batch.threads (default: 0)
-
Fix
RejectedExecutionException
during parallel execution of GCS batch requests. -
Change default values for GCS batch/directory operations properties:
fs.gs.copy.with.rewrite.enable (default: false -> true) fs.gs.copy.max.requests.per.batch (default: 30 -> 1) fs.gs.copy.batch.threads (default: 0 -> 50) fs.gs.max.requests.per.batch (default: 30 -> 25) fs.gs.batch.threads (default: 0 -> 25)
BigQuery connector:
- POM updates for GCS connector 1.6.8.
2018-08-07 (GCS 1.9.4, BQ 0.13.4)
Changelog
Cloud Storage connector:
-
Add
fs.gs.generation.read.consistency
(default :LATEST
) property to determine read consistency across different generations of a GCS object.Three modes are supported:
LATEST
: this is the default behavior. The connector will ignore generation ID of the GCS objects and always try to read the live version.BEST_EFFORT
: The connector will try to read the generation determined when theGoogleCloudStorageReadChannel
is first established. However if that generation cannot be found anymore, connector will fall back to read the live version. This mode allows to improve performance by requesting the same object generation from GCS. Using this mode connector can read changing objects from GCS buckets with disabled object versioning without failure.STRICT
: The connector will always try to read the generation determined when theGoogleCloudStorageReadChannel
is first established, and report error (FileNotFound
) when that generation cannot be found anymore.
Note that this property will only apply to new streams opened after generation is determined. It won't affect read from any streams that are already open, pre-fetched footer, or the metadata of the object.
-
Support parallel execution of GCS batch requests.
Number of threads to execute batch requests configurable via property:
fs.gs.batch.threads (default: 0)
If
fs.gs.batch.threads
value is set to 0 then batch requests will be executed sequentially by caller thread. -
Do not fail-fast when creating
GoogleCloudStorageReadChannel
instance for non-existing object to avoid GCS metadata request. -
Add property to fail fast with
FileNotFoundException
when callingGoogleCloudStorageImpl#open
method (costs additional GCS metadata request):fs.gs.inputstream.fast.fail.on.not.found.enable (default: true)
-
Lazily initialize
GoogleCloudStorageReadChannel
metadata after first read operation. -
Lazily pre-fetch footer in
AUTO
andRANDOM
fadvise modes when reading end of the file usingGoogleCloudStorageReadChannel
. -
Delete
fs.gs.inputstream.footer.prefetch.size
property and usefs.gs.inputstream.min.range.request.size
property for determining lazy footer prefetch size.Because
GoogleCloudStorageReadChannel
makes first read without knowing object size it uses heuristic to lazily prefetch at mostfs.gs.inputstream.min.range.request.size / 2
bytes before read channel position in case this is a footer read. This logic simplifies performance tuning and rendersfs.gs.inputstream.footer.prefetch.size
property to be obsolete. -
Delete unused
fs.gs.inputstream.support.content.encoding.enable
property. -
Update all dependencies to latest versions.
BigQuery connector:
- POM updates for GCS connector 1.9.4.
- Update all dependencies to latest versions.
2018-07-25 (GCS 1.9.3, BQ 0.13.3)
Changelog
Cloud Storage connector:
- Ignore
fs.gs.io.buffer
property when determining HTTP range request size in fadviseRANDOM
mode which is used to limit minimum size of HTTP range request. - Reuse prefetched footer when reading end of the file.
- Always skip in place for gzip-encoded files.
- Fix Ivy compatibility - resolve artifact versions in released
pom.xml
files.
BigQuery connector:
- POM updates for GCS connector 1.9.3.
- Fix Ivy compatibility - resolve artifact versions in released
pom.xml
files.
2018-07-18 (GCS 1.9.2, BQ 0.13.2)
Changelog
Cloud Storage connector:
-
Report the UGI user in FileStatus instead of process owner.
-
Implement automatic fadvise (adaptive range reads). In this mode, connector starts to send bounded range requests instead of streaming range requests when reading non gzip encoded files after first backward read or forward read for more than
fs.gs.inputstream.inplace.seek.limit
bytes was detected.To activate this behavior set the property:
fs.gs.inputstream.fadvise=AUTO (default: SEQUENTIAL)
-
Add an option to prefetch footer when creating
GoogleCloudStorageReadChannel
inAUTO
andRANDOM
fadvise mode. Prefetch size is configured via property:fs.gs.inputstream.footer.prefetch.size (default: 0)
This optimization is helpful when reading objects in format that stores metadata at the end of the file in footer, like Parquet and ORC.
Note: for this optimization to work, specified footer prefetch size should be greater or equal to an actual metadata size stored in the file footer.
To disable footer pre-fetching set this property to 0.
-
Cache objects metadata in
PerformanceCachingGoogleCloudStorage
using GCSListObjects
requests. -
Change default values of properties:
fs.gs.inputstream.min.range.request.size (default: 1048576 -> 524288) fs.gs.performance.cache.max.entry.age.ms (default: 3000 -> 5000) fs.gs.performance.cache.list.caching.enable (default: true -> false)
-
Change default OAuth 2.0 token server URL to
https://oauth2.googleapis.com/token
.Default OAuth 2.0 token server URL could be changed via environment variable:
GOOGLE_OAUTH_TOKEN_SERVER_URL
BigQuery connector:
- POM updates for GCS connector 1.9.2.
2018-07-11 (GCS 1.9.1, BQ 0.13.1)
Changelog
Cloud Storage connector:
-
Fix
PerformanceCachingGoogleCloudStorage
. -
Send only 1 GCS metadata request per
GoogleCloudStorageReadChannel
object lifecycle to reduce number of GCS requests when reading objects. -
Always fail-fast when creating
GoogleCloudStorageReadChannel
instance for non-existing GCS object. Remove property that disables this:fs.gs.inputstream.fast.fail.on.not.found.enable
-
For gzip-encoded objects always return
Long.MAX_VALUE
size inGoogleCloudStorageReadChannel.size()
method, until object will be fully read. This fixes a bug, when clients that rely onsize
method could stop reading object prematurely. -
Implement fadvise feature that allows to read objects in random mode in addition to sequential mode (current behavior).
In random mode connector will send bounded range requests (HTTP Range header) to GCS which are more efficient in some cases (e.g. reading objects in row-columnar file formats like ORC, Parquet, etc).
Range request size is limited by whatever is greater,
fs.gs.io.buffer
or read buffer size passed by client.To avoid sending too small range requests (couple bytes) what could happen if
fs.gs.io.buffer
is 0 and client passes very small read buffer, min range request size is limited to 1 MiB by default. To override this limit and set minimum range request size to different value, use property:fs.gs.inputstream.min.range.request.size (default: 1048576)
To enable fadvise random mode set property:
fs.gs.inputstream.fadvise=RANDOM (default: SEQUENTIAL)
-
Do not close GCS read channel when calling
GoogleCloudStorageReadChannel.position(long)
method. -
Remove property that disables use of
includeTrailingDelimiter
GCS parameter after it was verified in production for a while:fs.gs.list.directory.objects.enable
BigQuery connector:
- POM updates for GCS connector 1.9.1.
2018-06-15 (GCS 1.9.0, BQ 0.13.0)
Changelog
Cloud Storage connector:
-
Update all dependencies to latest versions.
-
Delete metadata cache functionality because Cloud Storage has strong native list operation consistency already. Deleted properties:
fs.gs.metadata.cache.enable fs.gs.metadata.cache.type fs.gs.metadata.cache.directory fs.gs.metadata.cache.max.age.info.ms fs.gs.metadata.cache.max.age.entry.ms
-
Decrease default value for max requests per batch from 1,000 to 30.
-
Make max requests per batch value configurable with property:
fs.gs.max.requests.per.batch (default: 30)
-
Support Hadoop 3.
-
Change Maven project structure to be better compatible with IDEs.
-
Delete deprecated
GoogleHadoopGlobalRootedFileSystem
. -
Fix thread leaks that were occurring when YARN log aggregation uploaded logs to GCS.
-
Add interface through which user can directly provide the access token.
-
Add more retries and error handling in GoogleCloudStorageReadChannel, to make it more resilient to network errors; also add a property to allow users to specify number of retries on low level GCS HTTP requests in case of server errors and I/O errors.
-
Add properties to allow users to specify connect timeout and read timeout on low level GCS HTTP requests.
-
Include prefix/directory objects metadata into
storage.objects.list
requests response to improve performance (i.e. setincludeTrailingDelimiter
parameter forstorage.objects.list
GCS requests totrue
).
BigQuery connector:
- POM updates for GCS connector 1.9.0.
- Update all dependencies to latest versions.
- Change Maven project structure to be better compatible with IDEs.
- Support Hadoop 3.
- Default BigQueryInputFormats to use unsharded exports and deprecate sharded exports.
- Deprecate BigQueryOutputFormat in favor of IndirectBigQueryOutputFormat.
- Add interface through which user can directly provide the access token.
- Support Cloud KMS key name in the output table spec.
2018-06-12 (GCS 1.6.7, BQ 0.10.8)
Changelog
Cloud Storage connector:
- Remove Hadoop3 support.
- Add interface through which user can directly provide the access token.
- Update Hadoop dependencies to 2.8.4 version.
- Add more retries and error handling in GoogleCloudStorageReadChannel, to make it more resilient to network errors; also add a property to allow users to specify number of retries on low level GCS HTTP requests in case of server errors and I/O errors.
- Add properties to allow users to specify connect timeout and read timeout on low level GCS HTTP requests.
- Include prefix/directory objects metadata into
storage.objects.list
requests response to improve performance (i.e. setincludeTrailingDelimiter
parameter forstorage.objects.list
GCS requests totrue
).
BigQuery connector:
- POM updates for GCS connector 1.6.7.
- Remove Hadoop3 support.
- Deprecate BigQueryOutputFormat in favor of IndirectBigQueryOutputFormat.
- Add interface through which user can directly provide the access token.
2018-05-08 (GCS 1.6.6, BQ 0.10.7)
Changelog
Cloud Storage connector:
- Support Hadoop 3.
- Change Maven project structure to be better compatible with IDEs.
- Fix thread leaks that were occurring when YARN log aggregation uploaded logs to GCS.
BigQuery connector:
- POM updates for GCS connector 1.6.6.
- Change Maven project structure to be better compatible with IDEs.
- Support Hadoop 3.
2018-04-12 (GCS 1.6.5, BQ 0.10.6)
Changelog
Cloud Storage connector:
-
Add support for using Cloud Storage Rewrite requests for copy operation:
fs.gs.copy.with.rewrite.enable (default: false)
This allows to copy files between different locations and storage classes.
-
Update all dependencies to latest versions.
-
Decrease default value for max requests per batch from 1,000 to 30.
-
Make max requests per batch value configurable with property:
fs.gs.max.requests.per.batch (default: 30)
BigQuery connector:
- Wire location through load, extract, and query jobs.
- Always require at least 2 partitions for sharded exports.
- Update all dependencies to latest versions.
- POM updates for GCS connector 1.6.5.