Skip to content

Downloader Users Guide

kenisteward edited this page Aug 14, 2017 · 13 revisions

Description

The herd downloader application is a command line program that provides the ability to copy data (files/directories) registered with the herd Registry from an S3 bucket to local file system. The downloaded data includes the creation of the "manifest.json" side-car file.

The JAR is built as part of the herd application suite in the dm-tools project.

The downloader uses the Amazon S3 SDK which downloads files into the system temporary directory (e.g. /tmp). You should ensure there is adequate space in the temporary directory if large numbers of files are downloaded.

Command Line Summary

java -jar dm-downloader.jar
  [-a <S3AccessKey>]
  [-p <S3SecretKey>]
  [-e <S3Endpoint>]
  -l < LocalDirPath>
  -m <ManifestFilePath>
  -H <RegServerHost>
  -P <RegServerPort>
  [-s true]
  [-u <username>]
  [-w <password>]
  [-n <HttpProxyHost>]
  [-o <HttpProxyPort>]
  [-t <MaxThreads>]
  [-c <socketTimeout>]

Options

-a <arg>, --s3AccessKey <arg>

  • Required: No
  • Type: String

The AWS access key ID used to identify the user making S3 service requests. When specified, make sure the s3SecretKey is also specified.If the s3AccessKey and s3SecretKey parameters aren't both specified, then the AWS Java default credential provider chain will be used to find credentials. If no credentials are found, an error will result. See the following link for more details: http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.html.

-p <arg>, --s3SecretKey <arg>

  • Required: No
  • Type: String

The AWS secret access key to be used to authenticate the user making S3 service requests. When specified, make sure the s3AccessKey is also specified.

-e <arg>, --s3EndPoint <arg>

  • Required: No
  • Type: String

The optional Amazon S3 endpoint to use when making S3 service calls.

-l <arg>, --localPath <arg>

  • Required: Yes
  • Type: String

The path to a local directory, relative to which the downloaded files will be created. The local path and the S3 path, which was prepended to the data when uploaded, are used together to build the target local directory where the downloaded files will be created. Please note that the target local directory must be empty, but the local path does not have to.

-m <arg>, --manifestPath <arg>

  • Required: Yes
  • Type: String

Local path to the manifest file.

-H <arg>, --regServerHost <arg>

  • Required: Yes
  • Type: String

Registration Server hostname.

-P <arg>, --regServerPort <arg>

  • Required: Yes
  • Type: Integer

Registration Server port.

-Y <arg>, --dmRegServerHost <arg>

  • Required: Yes
  • Type: String

DEPRECATED. Use regServerHost parameter.

-Z <arg>, --dmRegServerPort <arg>

  • Required: Yes
  • Type: Integer

DEPRECATED. Use regServerPort parameter.

-s, --ssl

  • Required: No
  • Type: Boolean
  • Default: false

If set to true, enables SSL (HTTPS) to communicate with the herd Registration Service. Otherwise, uses HTTP.

-u <arg>, --username <arg>

  • Required: No
  • Type: String

The username used for HTTPS client authentication with the herd Registration Service.

Note: To avoid complications with parsing the username if it has spaces, please encapsulate your username in "" (double quotes)

-w <arg>, --password <arg>

  • Required: No
  • Type: String

The password used for HTTPS client authentication with the herd Registration Service.

Note: To avoid complications with parsing the password, please encapsulate your password in "" (double quotes)

-h, --help

  • Required: No

Display usage information and exit.

-v, --version

  • Required: No

Display version information and exit.

-n <arg>, --httpProxyHost <arg>

  • Required: No
  • Type: String

The hostname of an HTTP proxy that will be used when connecting to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyPort is also specified when usiing this option.

-o <arg>, --httpProxyPort <arg>

  • Required: No
  • Type: Integer

The port number of an HTTP proxy that will be used when connectinng to the S3 service. This is needed when a direct HTTP connection isn't allowed. Make sure the httpProxyHost is also specified when using this option.

-t <arg>, --maxThreads <arg>

  • Required: No
  • Type: Integer
  • Default: 10

The maximum number of threads to use during file transfers. If this argument isn't specified, a suitable default will be used. Amazon does a good job of determining how many threads to use so it is not recommended to use this option unless there is a specific need. Please note that we are only expecting to get ~55Mbps of throughput per thread, so please run the tool on the appropriate box given required performance.

-c <args>, --socketTimeout <args>

  • Required: No
  • Type: Integer
  • Default: 50000
  • Release: 0.18.0

The socket timeout in milliseconds. 0 indicates no timeout.

Returned Codes

The command line program returns zero when execution succeeds and non-zero when execution fails.

Logging

The downloader displays output including errors on the console. Informational messages will be logged such as key program parameters and the total number of files/bytes copied.

NOTE: You might see the below socket and http exceptions in the downloader output. Those exception are safe to ignore, since they are typically handled seamlessly by AWS Java SDK. Still, if you observe those exceptions, please try to reduce the number of threads being used by the relative downloader instance and/or limit the number of the downloader instances (parallel upload jobs) that you run on the relative box.

  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Timeout waiting for connection
    org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection
        
  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Broken pipe
    java.net.SocketException: Socket is closed
       
  • ... INFO  com.amazonaws.http.AmazonHttpClient.executeHelper - Unable to execute HTTP request: Socket is closed
    java.net.SocketException: Broken pipe
        

Input Manifest "Side-car" File

The information provided in the manifest file, or "side-car" file, is used by the downloader to retreive information on the business object data registered with the Data Registry. This subsection describes the specification for the manifest file, which includes required and optional fields.

The characteristics of the file should be:

  • Name: <manifest_file_name>.json
  • Type: Text
  • Encoding: UTF8
  • Format: JSON

namespace

  • Required: Yes
  • Type: String
  • Case sensitive: No

The namespace in which the business object definition belongs to.

businessObjectDefinitionName

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object definition name (e.g. NEW_ORDERS).

businessObjectFormatUsage

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object format usage (e.g. PRC).

businessObjectFormatFileType

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object format file type (e.g. ORC).

businessObjectFormatVersion

  • Required: No
  • Type: Quoted integer

The business object format version (e.g. 0). When format version is not specified, the business object data with the latest business format version avaiable for this partition value is returned back.

partitionKey

  • Required: Yes
  • Type: String
  • Case sensitive: No

The business object format partition key (e.g. TDATE).

partitionValue

  • Required: Yes
  • Type: String
  • Case sensitive: Yes

The business object data partition value (e.g. 2014-07-21).

subPartitionValues

  • Required: No
  • Type: List of String
  • Case sensitive: Yes

The business object data sub-partition values

businessObjectDataVersion

  • Required: No
  • Type: Quoted integer

The business object data version (e.g. 0). When data version is not specified, the latest business object data is returned back.

storageName

  • Required: No
  • Type: String
  • Case sensitive: No

The name of the storage to download from. Defaults to S3_MANAGED.

Input Manifest File Format

{
    "namespace": STRING,
    "businessObjectDefinitionName": STRING,
    "businessObjectFormatUsage": STRING,
    "businessObjectFormatFileType": STRING,
    "businessObjectFormatVersion": STRING,
    "partitionKey": STRING,
    "partitionValue": STRING,
    "subPartitionValues" : [STRING,STRING,STRING,STRING]
    "businessObjectDataVersion": STRING,
    "storageName": STRING
}

Input Manifest File Example

The below is an example of a manifest file (e.g. manifest.json) to retrieve the latest data version of the NEW_ORDERS processed data for 2014-04-01.

{
    "namespace": "APPLICATION_A",
    "businessObjectDefinitionName": "NEW_ORDERS",
    "businessObjectFormatUsage": "PRC",
    "businessObjectFormatFileType": "TXT",
    "businessObjectFormatVersion": "2",
    "partitionKey": "PROCESS_DATE",
    "partitionValue": "2014-04-01"
}

Output Manifest "Side-car" File

The downloaded data includes the creation of the "manifest.json" side-car file. This subsection describes the specification for the manifest file, which includes required and optional fields.

The characteristics of the file should be:

  • Name: manifest.json
  • Type: Text
  • Encoding: UTF8
  • Format: JSON
Field Name
Description
namespace The business object definition namespace.
businessObjectDefinitionName The business object definition name.
businessObjectFormatUsage The business object format usage.
businessObjectFormatFileType The business object format file type.
businessObjectFormatVersion The business object format version.
partitionKey The business object format partition key.
partitionValue The business object data partition value.
subPartitionValues The business object data sub-partition values.
businessObjectDataVersion The business object data version.

storageName

The name of the storage.
manifestFiles The list of file information.
  • fileName
The file name of a manifest file.
  • fileSizeBytes
The size in bytes of the contents of the manifest file.
  • rowCount
The row count of a manifest file.
attributes The list of name/value pairs associated with the data.
businessObjectDataParents The list of business object data parents (i.e. predecessors) that were used in the creation of this data.
  • businessObjectDefinitionName
The name of the business object definition for a specific business object data parent.
  • businessObjectFormatUsage
The business object format usage for a specific business object data parent.
  • businessObjectFormatFileType
The business object format file type for a specific business object data parent.
  • businessObjectFormatVersion
The business object format version for a specific business object data parent.
  • partitionValue
The partition value for a specific business object data parent.
  • subPartitionValues
The business object data sub-partition values.
  • businessObjectDataVersion
The business object data version for a specific business object data parent.
businessObjectDataChildren The list of business object data children (i.e. successors) that are dependent on this business object data.

Output Manifest File Format

{
    "namespace": STRING,
    "businessObjectDefinitionName": STRING,
    "businessObjectFormatUsage": STRING,
    "businessObjectFormatFileType": STRING,
    "businessObjectFormatVersion": STRING,
    "partitionKey": STRING,
    "partitionValue": STRING,
    "subPartitionValues" : [STRING,STRING,STRING,STRING],
    "businessObjectDataVersion": STRING,
    "storageName": STRING,
    "manifestFiles" : [ {
        "fileName" : STRING,
        "fileSizeBytes" : NUMBER,
        "rowCount" : NUMBER,
    },
    ...
    ],
    "attributes": { STRING: STRING, STRING: STRING, ... },
    "businessObjectDataParents" : [ {
        "businessObjectDefinitionName" : STRING,
        "businessObjectFormatUsage" : STRING,
        "businessObjectFormatFileType" : STRING,
        "businessObjectFormatVersion" : NUMBER,
        "partitionValue" : STRING,
        "subPartitionValues" : [STRING,STRING,STRING,STRING],
        "businessObjectDataVersion" : NUMBER
    },
    ...
    ]
    "businessObjectDataChildren" : [ {
        "businessObjectDefinitionName" : STRING,
        "businessObjectFormatUsage" : STRING,
        "businessObjectFormatFileType" : STRING,
        "businessObjectFormatVersion" : NUMBER,
        "partitionValue" : STRING,
        "subPartitionValues" : [STRING,STRING,STRING,STRING],
        "businessObjectDataVersion" : NUMBER
    },
    ...
    ]
}

Output Manifest File Example

The below is an example of a manifest file for NEW_ORDERS object processed data for 2014-04-01.

{
    "namespace": "APPLICATION_A",
    "businessObjectDefinitionName": "NEW_ORDERS",
    "businessObjectFormatUsage": "PRC",
    "businessObjectFormatFileType": "TXT",
    "businessObjectFormatVersion": "2",
    "partitionKey": "PROCESS_DATE",
    "partitionValue": "2014-04-01",
    "storageName": "S3_MANAGED",
    "manifestFiles" : [ {
        "fileName" : "testFile1.gz",
        "fileSizeBytes" : 10000,
        "rowCount" : 1000
    }, {
        "fileName" : "testFile2.gz",
        "fileSizeBytes" : 20000,
        "rowCount" : 2000
    } ],
    "attributes": {"name1": "value1", "name2": "value2"}
    "businessObjectDataParents" : [ {
        "businessObjectDefinitionName" : "NEW_ORDERS",
        "businessObjectFormatUsage" : "SRC",
        "businessObjectFormatFileType" : "TXT",
        "businessObjectFormatVersion" : 1,
        "partitionValue" : "2014-04-01",
        "businessObjectDataVersion" : 0
    } ]
    "businessObjectDataChildren" : [ {
        "businessObjectDefinitionName" : "NEW_ORDERS",
        "businessObjectFormatUsage" : "PRC2",
        "businessObjectFormatFileType" : "TXT",
        "businessObjectFormatVersion" : 1,
        "partitionValue" : "2014-04-01",
        "businessObjectDataVersion" : 0
    } ]
}

Usage Example

The below command downloads NEW_ORDERS 2014-04-01 data registered with the herd from the S3_MANAGED DEV bucket and the local files system.

java -jar dm-uploader-app.jar \ 
  -a <accessKey> \ 
  -p <secretKey> \ 
  -e s3-external-1.amazonaws.com \  
  -l /nfs/site/mrkt/exchange_ingest/ECXH_PD/20140401/NEW_ORDERS_DU/EXCH_V2_FMT/ \ 
  -m /export/home/application_a_dev/dm-downloader-manifest-files/new-orders-pd-v2-2014-04-01.json \ 
  -H myHostname.us-east-1.elb.amazonaws.com \ 
  -P 80 \
  -s true \
  -u <username> \
  -w <password> \
  -n 10.0.0.100 \
  -o 3128

Environment/Security Access Details

  • Please make sure that the server where you run the Uploader can talk to the herd application server. That might require a new firewall rule to be set up.
  • Depending on your environment, in order for the uploader tool to communicate with the AWS S3, you might need to provide values for the HTTP proxy parameters (i.e. -n and -o parameters).
Clone this wiki locally