Skip to content

Releases: GoogleCloudDataproc/hadoop-connectors

2018-08-09 (GCS 1.9.5, BQ 0.13.5)

10 Aug 05:19
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Improve build configuration (pom.xmls) compatibility with Maven release plugin.

    Changes version string from 1.9.5-hadoop2 to hadoop2-1.9.5.

  2. Update Maven plugins versions.

  3. Do not send batch request when performing operations (rename, delete, copy) on 1 object.

  4. Add fs.gs.performance.cache.dir.metadata.prefetch.limit (default=1000) configuration property to control number of prefetched metadata objects in the same directory by PerformanceCachingGoogleCloudStorage.

    To disable metadata prefetching set property value to 0.

    To prefetch all objects metadata in a directory set property value to -1.

  5. Add configuration properties to control batching of copy operations separately from other operations:

    fs.gs.copy.max.requests.per.batch (default: 30)
    fs.gs.copy.batch.threads (default: 0)
    
  6. Fix RejectedExecutionException during parallel execution of GCS batch requests.

  7. Change default values for GCS batch/directory operations properties:

    fs.gs.copy.with.rewrite.enable (default: false -> true)
    fs.gs.copy.max.requests.per.batch (default: 30 -> 1)
    fs.gs.copy.batch.threads (default: 0 -> 50)
    fs.gs.max.requests.per.batch (default: 30 -> 25)
    fs.gs.batch.threads (default: 0 -> 25)
    

BigQuery connector:

  1. POM updates for GCS connector 1.9.5.

  2. Improve build configuration (pom.xmls) compatibility with Maven release plugin.

    Changes version string from 0.13.5-hadoop2 to hadoop2-0.13.5.

  3. Update Maven plugins versions.

2018-08-10 (GCS 1.6.8, BQ 0.10.9)

10 Aug 18:11
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Support parallel execution of GCS batch requests.

    Number of threads to execute batch requests configurable via property:

    fs.gs.batch.threads (default: 0)
    

    If fs.gs.batch.threads value is set to 0 then batch requests will be executed sequentially by caller thread.

  2. Do not send batch request when performing operations (rename, delete, copy) on 1 object.

  3. Add configuration properties to control batching of copy operations separately from other operations:

    fs.gs.copy.max.requests.per.batch (default: 30)
    fs.gs.copy.batch.threads (default: 0)
    
  4. Fix RejectedExecutionException during parallel execution of GCS batch requests.

  5. Change default values for GCS batch/directory operations properties:

    fs.gs.copy.with.rewrite.enable (default: false -> true)
    fs.gs.copy.max.requests.per.batch (default: 30 -> 1)
    fs.gs.copy.batch.threads (default: 0 -> 50)
    fs.gs.max.requests.per.batch (default: 30 -> 25)
    fs.gs.batch.threads (default: 0 -> 25)
    

BigQuery connector:

  1. POM updates for GCS connector 1.6.8.

2018-08-07 (GCS 1.9.4, BQ 0.13.4)

08 Aug 00:10
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add fs.gs.generation.read.consistency (default : LATEST) property to determine read consistency across different generations of a GCS object.

    Three modes are supported:

    • LATEST: this is the default behavior. The connector will ignore generation ID of the GCS objects and always try to read the live version.
    • BEST_EFFORT: The connector will try to read the generation determined when the GoogleCloudStorageReadChannel is first established. However if that generation cannot be found anymore, connector will fall back to read the live version. This mode allows to improve performance by requesting the same object generation from GCS. Using this mode connector can read changing objects from GCS buckets with disabled object versioning without failure.
    • STRICT: The connector will always try to read the generation determined when the GoogleCloudStorageReadChannel is first established, and report error (FileNotFound) when that generation cannot be found anymore.

    Note that this property will only apply to new streams opened after generation is determined. It won't affect read from any streams that are already open, pre-fetched footer, or the metadata of the object.

  2. Support parallel execution of GCS batch requests.

    Number of threads to execute batch requests configurable via property:

    fs.gs.batch.threads (default: 0)
    

    If fs.gs.batch.threads value is set to 0 then batch requests will be executed sequentially by caller thread.

  3. Do not fail-fast when creating GoogleCloudStorageReadChannel instance for non-existing object to avoid GCS metadata request.

  4. Add property to fail fast with FileNotFoundException when calling GoogleCloudStorageImpl#open method (costs additional GCS metadata request):

    fs.gs.inputstream.fast.fail.on.not.found.enable (default: true)
    
  5. Lazily initialize GoogleCloudStorageReadChannel metadata after first read operation.

  6. Lazily pre-fetch footer in AUTO and RANDOM fadvise modes when reading end of the file using GoogleCloudStorageReadChannel.

  7. Delete fs.gs.inputstream.footer.prefetch.size property and use fs.gs.inputstream.min.range.request.size property for determining lazy footer prefetch size.

    Because GoogleCloudStorageReadChannel makes first read without knowing object size it uses heuristic to lazily prefetch at most fs.gs.inputstream.min.range.request.size / 2 bytes before read channel position in case this is a footer read. This logic simplifies performance tuning and renders fs.gs.inputstream.footer.prefetch.size property to be obsolete.

  8. Delete unused fs.gs.inputstream.support.content.encoding.enable property.

  9. Update all dependencies to latest versions.

BigQuery connector:

  1. POM updates for GCS connector 1.9.4.
  2. Update all dependencies to latest versions.

2018-07-25 (GCS 1.9.3, BQ 0.13.3)

26 Jul 00:47
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Ignore fs.gs.io.buffer property when determining HTTP range request size in fadvise RANDOM mode which is used to limit minimum size of HTTP range request.
  2. Reuse prefetched footer when reading end of the file.
  3. Always skip in place for gzip-encoded files.
  4. Fix Ivy compatibility - resolve artifact versions in released pom.xml files.

BigQuery connector:

  1. POM updates for GCS connector 1.9.3.
  2. Fix Ivy compatibility - resolve artifact versions in released pom.xml files.

2018-07-18 (GCS 1.9.2, BQ 0.13.2)

18 Jul 17:37
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Report the UGI user in FileStatus instead of process owner.

  2. Implement automatic fadvise (adaptive range reads). In this mode, connector starts to send bounded range requests instead of streaming range requests when reading non gzip encoded files after first backward read or forward read for more than fs.gs.inputstream.inplace.seek.limit bytes was detected.

    To activate this behavior set the property:

    fs.gs.inputstream.fadvise=AUTO (default: SEQUENTIAL)
    
  3. Add an option to prefetch footer when creating GoogleCloudStorageReadChannel in AUTO and RANDOM fadvise mode. Prefetch size is configured via property:

    fs.gs.inputstream.footer.prefetch.size (default: 0)
    

    This optimization is helpful when reading objects in format that stores metadata at the end of the file in footer, like Parquet and ORC.

    Note: for this optimization to work, specified footer prefetch size should be greater or equal to an actual metadata size stored in the file footer.

    To disable footer pre-fetching set this property to 0.

  4. Cache objects metadata in PerformanceCachingGoogleCloudStorage using GCS ListObjects requests.

  5. Change default values of properties:

    fs.gs.inputstream.min.range.request.size (default: 1048576 -> 524288)
    fs.gs.performance.cache.max.entry.age.ms (default: 3000 -> 5000)
    fs.gs.performance.cache.list.caching.enable (default: true -> false)
    
  6. Change default OAuth 2.0 token server URL to https://oauth2.googleapis.com/token.

    Default OAuth 2.0 token server URL could be changed via environment variable:

    GOOGLE_OAUTH_TOKEN_SERVER_URL
    

BigQuery connector:

  1. POM updates for GCS connector 1.9.2.

2018-07-11 (GCS 1.9.1, BQ 0.13.1)

12 Jul 04:46
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Fix PerformanceCachingGoogleCloudStorage.

  2. Send only 1 GCS metadata request per GoogleCloudStorageReadChannel object lifecycle to reduce number of GCS requests when reading objects.

  3. Always fail-fast when creating GoogleCloudStorageReadChannel instance for non-existing GCS object. Remove property that disables this:

    fs.gs.inputstream.fast.fail.on.not.found.enable
    
  4. For gzip-encoded objects always return Long.MAX_VALUE size in GoogleCloudStorageReadChannel.size() method, until object will be fully read. This fixes a bug, when clients that rely on size method could stop reading object prematurely.

  5. Implement fadvise feature that allows to read objects in random mode in addition to sequential mode (current behavior).

    In random mode connector will send bounded range requests (HTTP Range header) to GCS which are more efficient in some cases (e.g. reading objects in row-columnar file formats like ORC, Parquet, etc).

    Range request size is limited by whatever is greater, fs.gs.io.buffer or read buffer size passed by client.

    To avoid sending too small range requests (couple bytes) what could happen if fs.gs.io.buffer is 0 and client passes very small read buffer, min range request size is limited to 1 MiB by default. To override this limit and set minimum range request size to different value, use property:

    fs.gs.inputstream.min.range.request.size (default: 1048576)
    

    To enable fadvise random mode set property:

    fs.gs.inputstream.fadvise=RANDOM (default: SEQUENTIAL)
    
  6. Do not close GCS read channel when calling GoogleCloudStorageReadChannel.position(long) method.

  7. Remove property that disables use of includeTrailingDelimiter GCS parameter after it was verified in production for a while:

    fs.gs.list.directory.objects.enable
    

BigQuery connector:

  1. POM updates for GCS connector 1.9.1.

2018-06-15 (GCS 1.9.0, BQ 0.13.0)

15 Jun 17:11
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Update all dependencies to latest versions.

  2. Delete metadata cache functionality because Cloud Storage has strong native list operation consistency already. Deleted properties:

    fs.gs.metadata.cache.enable
    fs.gs.metadata.cache.type
    fs.gs.metadata.cache.directory
    fs.gs.metadata.cache.max.age.info.ms
    fs.gs.metadata.cache.max.age.entry.ms
    
  3. Decrease default value for max requests per batch from 1,000 to 30.

  4. Make max requests per batch value configurable with property:

    fs.gs.max.requests.per.batch (default: 30)
    
  5. Support Hadoop 3.

  6. Change Maven project structure to be better compatible with IDEs.

  7. Delete deprecated GoogleHadoopGlobalRootedFileSystem.

  8. Fix thread leaks that were occurring when YARN log aggregation uploaded logs to GCS.

  9. Add interface through which user can directly provide the access token.

  10. Add more retries and error handling in GoogleCloudStorageReadChannel, to make it more resilient to network errors; also add a property to allow users to specify number of retries on low level GCS HTTP requests in case of server errors and I/O errors.

  11. Add properties to allow users to specify connect timeout and read timeout on low level GCS HTTP requests.

  12. Include prefix/directory objects metadata into storage.objects.list requests response to improve performance (i.e. set includeTrailingDelimiter parameter for storage.objects.list GCS requests to true).

BigQuery connector:

  1. POM updates for GCS connector 1.9.0.
  2. Update all dependencies to latest versions.
  3. Change Maven project structure to be better compatible with IDEs.
  4. Support Hadoop 3.
  5. Default BigQueryInputFormats to use unsharded exports and deprecate sharded exports.
  6. Deprecate BigQueryOutputFormat in favor of IndirectBigQueryOutputFormat.
  7. Add interface through which user can directly provide the access token.
  8. Support Cloud KMS key name in the output table spec.

2018-06-12 (GCS 1.6.7, BQ 0.10.8)

12 Jun 22:35
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Remove Hadoop3 support.
  2. Add interface through which user can directly provide the access token.
  3. Update Hadoop dependencies to 2.8.4 version.
  4. Add more retries and error handling in GoogleCloudStorageReadChannel, to make it more resilient to network errors; also add a property to allow users to specify number of retries on low level GCS HTTP requests in case of server errors and I/O errors.
  5. Add properties to allow users to specify connect timeout and read timeout on low level GCS HTTP requests.
  6. Include prefix/directory objects metadata into storage.objects.list requests response to improve performance (i.e. set includeTrailingDelimiter parameter for storage.objects.list GCS requests to true).

BigQuery connector:

  1. POM updates for GCS connector 1.6.7.
  2. Remove Hadoop3 support.
  3. Deprecate BigQueryOutputFormat in favor of IndirectBigQueryOutputFormat.
  4. Add interface through which user can directly provide the access token.

2018-05-08 (GCS 1.6.6, BQ 0.10.7)

09 May 01:28
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Support Hadoop 3.
  2. Change Maven project structure to be better compatible with IDEs.
  3. Fix thread leaks that were occurring when YARN log aggregation uploaded logs to GCS.

BigQuery connector:

  1. POM updates for GCS connector 1.6.6.
  2. Change Maven project structure to be better compatible with IDEs.
  3. Support Hadoop 3.

2018-04-12 (GCS 1.6.5, BQ 0.10.6)

13 Apr 00:07
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add support for using Cloud Storage Rewrite requests for copy operation:

    fs.gs.copy.with.rewrite.enable (default: false)
    

    This allows to copy files between different locations and storage classes.

  2. Update all dependencies to latest versions.

  3. Decrease default value for max requests per batch from 1,000 to 30.

  4. Make max requests per batch value configurable with property:

    fs.gs.max.requests.per.batch (default: 30)
    

BigQuery connector:

  1. Wire location through load, extract, and query jobs.
  2. Always require at least 2 partitions for sharded exports.
  3. Update all dependencies to latest versions.
  4. POM updates for GCS connector 1.6.5.