Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS CLI Enhancements #6

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
8600edd
Patches for HDP (version)
dstreev Aug 17, 2015
b79292e
Pull from forked Stemshell.
dstreev Aug 17, 2015
66900ef
Major Revision to use native HDFS Shell Commands.
dstreev Aug 19, 2015
8a5d6bd
Dealing with Directives
dstreev Aug 19, 2015
b66831d
Doc Update
dstreev Aug 19, 2015
7809187
Fixed Auto-Completion
dstreev Aug 19, 2015
e34ef33
Docs and Binary
dstreev Aug 19, 2015
c8f216a
Link fix.
dstreev Aug 19, 2015
4f62898
Link updated
dstreev Aug 19, 2015
99d8af3
Updated current known issues.
dstreev Aug 19, 2015
381ef74
Fixed put directory placement issue.
dstreev Aug 19, 2015
a4a0520
Added snapshot capabilities.
dstreev Aug 26, 2015
6cb3cc1
Release 2.1.0
dstreev Aug 26, 2015
54d43f9
path in readme
dstreev Aug 26, 2015
b06153d
README Cleanup and helper setup script to install.
dstreev Aug 26, 2015
d51f1b6
Readme
dstreev Aug 26, 2015
a9cc5c6
Added Support for an initialization startup script.
dstreev Aug 28, 2015
af80e3f
Doc clarification.
dstreev Aug 28, 2015
f32ac96
Added support for UserGroupInformation.
dstreev Sep 2, 2015
ed7edee
CLI Options for Kerberos. Support pre/post Init hooks.
dstreev Sep 4, 2015
4dde6dd
Building Kerberos Support
dstreev Sep 4, 2015
d970334
Enhanced Kerberos Options
dstreev Sep 5, 2015
5743e16
Config and Auto Option Support.
dstreev Oct 28, 2015
ff05401
Doc Update
dstreev Oct 28, 2015
1b469dc
Doc Updates
dstreev Oct 29, 2015
0e9ead3
Added missing symlink
dstreev Oct 29, 2015
3b4a5ce
Added 'lsp' function.
dstreev Feb 17, 2016
b9b1d2f
Added 'nnstat'
dstreev Mar 22, 2016
8fa2d3a
Ready Update for latest version.
dstreev Mar 22, 2016
807f3e0
link update
dstreev Mar 22, 2016
893bd99
Adjust order to allow `-i` scripts with auto connect.
dstreev Mar 22, 2016
0928e20
Rework startup and fix additional unnecessary calls to NN for nnstat
dstreev Mar 26, 2016
e46afb7
Updates to README to reflect changes in behavior.
dstreev Mar 28, 2016
3bd5d11
Doc Updates
dstreev Apr 3, 2016
c4dd53b
docs
dstreev Apr 3, 2016
5d78566
Moving...
dstreev May 20, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .gitignore
Binary file not shown.
233 changes: 222 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,107 @@
# We're moving...

We've been replaced! Well, actually we're growing to more than just hdfs. A new project has been created to reflect this and is now the active replacement. Please see `hadoopcli` for all the same capabilities, plus more!

[Hadoop-Cli](https://github.com/dstreev/hadoop-cli)

## HDFS-CLI

HDFS-CLI is an interactive command line shell that makes interacting with the Hadoop Distribted Filesystem (HDFS)
simpler and more intuitive than the standard command-line tools that come with Hadoop. If you are familiar with OS X, Linux, or even Windows terminal/console-based applications, then you are likely familiar with features such as tab completion, command history, and ANSI formatting.
simpler and more intuitive than the standard command-line tools that come with Hadoop. If you're familiar with OS X, Linux, or even Windows terminal/console-based applications, then you are likely familiar with features such as tab completion, command history, and ANSI formatting.

### Binary Package

[Pre-Built Distribution](https://github.com/dstreev/hdfs-cli/releases)

Download the release files to a temp location. As a root user, chmod +x the 3 shell script files then run the 'setup.sh'. This will create and install the hdfscli application to your path.

Try it out on a host with default configs:

hdfscli

To use an alternate HADOOP_CONF_DIR:

hdfscli --config /var/hadoop/dev-cfg

### Release Notes

#### 2.3.2-SNAPSHOT (in-progress)

##### Behavioural Changes
- Due to the complexities and requirements of connecting to an environment, I've removed the
ability to connect manually to an environment by simply using 'connect'. This option was there from the beginning, but as more and more features are added, I'm finding myself hacking
away at recreating the settings and controls enabled through the configurations available in
`hdfs-site.xml` and `core-site.xml`. Therefore, the options `-k for kerberos` and `-a for auto connect` are no longer available. Unless specified via the `--config` option, the `hdfs-site.xml` and `core-site.xml` files in the default location of `/etc/hadoop/conf` will be used to establish all of the environment variables needed to connect. You can have multiple
directories with various `hdfs|core-site.xml` files in them and use the `--config` option to
enable connectivity to alternate hadoop environments.
##### Removed Options
`-a` Auto Connect. Use `--config` for alternate site files or nothing for the default `/etc/hadoop/conf`.
`-k` Kerberos Option

##### Enhancements
I noticed some pauses coming from inquiries into the Namenode JMX for `nnstat`. Instead of requesting the
entire Namenode Jmx stack, now we target only the JmxBeans that we're interested in. This will help
the observed pauses and relieve the Namenode of some unnecessary work.

#### 2.3.1-SNAPSHOT (in-progress)
- Added new 'nnstat' function to collect Namenode JMX stats for long term analysis.

See [NN Stat Feature](https://youtu.be/CZxx_BxCX4Y)

Also checkout how to auto fetch the stats via a script. Use this technique to run cron jobs to gather the stats.

Where `-i stats` defines the initialization file. See [Auto nnstat](https://youtu.be/gy43_Hg2RXk)
```
hdfscli -i stats
```

#### 2.3.0-SNAPSHOT
- Added new 'lsp' function. Consider it an 'ls' PLUS.

#### 2.2.1-SNAPSHOT

- External Config Support (See Below)
- Supports NN HA (thru hdfs and core site files)
- Auto Config support using a default config directory.

#### 2.2.0-SNAPSHOT

- Setup Script to help deploy (bin/setup.sh)
- hdfscli shell script to launch (bin/hdfscli.sh)
- Support for initialization Script (-i <file>)
- Kerberos Support via default and --config option for hadoop site files.

#### 2.1.0

- Added support for create/delete/rename Snapshot

#### 2.0.0

- Initial Forked Release of P. Taylor Goetz.
- Update to 2.6.0 Hadoop Libraries
- Re-wrote Command Implementation to use FSShell as basis for issuing commands.
- Provide Context Feedback in command window to show local and remote context.
- Added several missing hdfs dfs commands that didn't exist earlier.

### Building

This project requires the artifacts from https://github.com/dstreev/stemshell , which is a forked enhancement that has added support for processing command line parameters and deals with quoted variables.

### Basic Usage
HDFS-CLI works much like a command-line ftp client: You first establish a connection to a remote HDFS filesystem,
then manage local/remote files and transfers.

To start HDFS-CLI, run the following command:

java -jar hdfs-cli-0.0.1-SNAPSHOT.jar
java -jar hdfs-cli-full-bin.jar

### Command Documentation

Help for any command can be obtained by executing the `help` command:

help pwd

Note that currently, documentation may be limited.

#### Local vs. Remote Commands
When working within a HDFS-CLI session, you manage both local (on your computer) and remote (HDFS) files. By convention, commands that apply to both local and remote filesystems are differentiated by prepending an `l`
Expand All @@ -23,24 +115,137 @@ For example:

Every HDFS-CLI session keeps track of both the local and remote current working directories.

### Support for External Configurations (core-site.xml,hdfs-site.xml)

By default, hdfs-cli will use `/etc/hadoop/conf` as the default location to search for
`core-site.xml` and `hdfs-site.xml`. If you want to use an alternate, use the `--config`
option when starting up hdfs-cli.

The `--config` option takes 1 parameter, a local directory. This directory should contain hdfs-site.xml and core-site.xml files. When used, you'll automatically be connected to hdfs and changed to you're hdfs home directory.

Example Connection parameters.

# Use the hadoop files in the input directory to configure and connect to HDFS.
hdfscli --config ../mydir

This can be used in conjunction with the 'Startup' Init option below to run a set of commands automatically after the connection is made. The 'connect' option should NOT be used in the initialization script.

### Startup Initialization Option

Using the option '-i <filename>' when launching the CLI, it will run all the commands in the file.

The file needs to be location in the $HOME/.hdfs-cli directory. For example:

# If you're using the helper shell script
hdfscli -i test

# If you're using the java command
java -jar hdfs-cli-full-bin.jar -i test


Will initialize the session with the command(s) in $HOME/.hdfs-cli/test. One command per line.

The contents could be any set of valid commands that you would use in the cli. For example:

cd user/dstreev

### NN Stats

Collect Namenode stats from the available Namenode JMX url's.

3 Type of stats are current collected and written to hdfs (with -o option) or to screen (no option specified)

The 'default' delimiter for all records is '\u0001' (Cntl-A)

>> Namenode Information: (optionally written to the directory 'nn_info')
Fields: Timestamp, HostAndPort, State, Version, Used, Free, Safemode, TotalBlocks, TotalFiles, NumberOfMissingBlocks, NumberOfMissingBlocksWithReplicationFactorOne

>> Filesystem State: (optionally written to the directory 'fs_state')
Fields: Timestamp, HostAndPort, State, CapacityUsed, CapacityRemaining, BlocksTotal, PendingReplicationBlocks, UnderReplicatedBlocks, ScheduledReplicationBlocks, PendingDeletionBlocks, FSState, NumLiveDataNodes, NumDeadDataNodes, NumDecomLiveDataNodes, NumDecomDeadDataNodes, VolumeFailuresTotal

>> Top User Operations: (optionally written to the directory 'top_user_ops')
Fields: Timestamp, HostAndPort, State, WindowLenMs, Operation, User, Count

[Hive Table DDL for NN Stats](./src/main/hive/nn_stats.ddl)

### Enhanced Directory Listing (lsp)

Like 'ls', you can fetch many details about a file. But with this, you can also add information about the file that includes:
- Block Size
- Access Time
- Ratio of File Size to Block
- Datanode information for the files blocks (Host and Block Id)

Use help to get the options:

help lsp

```
usage: stats [OPTION ...] [ARGS ...]
Options:
-d,--maxDepth <maxDepth> Depth of Recursion (default 5), use '-1'
for unlimited
-f,--format <output-format> Comma separated list of one or more:
permissions_long,replication,user,group,siz
e,block_size,ratio,mod,access,path,datanode
_info (default all of the above)
-o,--output <output> Output File (HDFS) (default System.out)
```

When not argument is specified, it will use the current directory.

Examples:

# Using the default format, output a listing to the files in `/user/dstreev/perf` to `/tmp/test.out`
lsp -o /tmp/test.out /user/dstreev/perf

Output with the default format of:

permissions_long,replication,user,group,size,block_size,ratio,mod,access,path,datanode_info

```
rw-------,3,dstreev,hdfs,429496700,134217728,3.200,2015-10-24 12:26:39.689,2015-10-24 12:23:27.406,/user/dstreev/perf/teragen_27/part-m-00004,10.0.0.166,d2.hdp.local,blk_1073747900
rw-------,3,dstreev,hdfs,429496700,134217728,3.200,2015-10-24 12:26:39.689,2015-10-24 12:23:27.406,/user/dstreev/perf/teragen_27/part-m-00004,10.0.0.167,d3.hdp.local,blk_1073747900
rw-------,3,dstreev,hdfs,33,134217728,2.459E-7,2015-10-24 12:27:09.134,2015-10-24 12:27:06.560,/user/dstreev/perf/terasort_27/_partition.lst,10.0.0.166,d2.hdp.local,blk_1073747909
rw-------,3,dstreev,hdfs,33,134217728,2.459E-7,2015-10-24 12:27:09.134,2015-10-24 12:27:06.560,/user/dstreev/perf/terasort_27/_partition.lst,10.0.0.167,d3.hdp.local,blk_1073747909
rw-------,1,dstreev,hdfs,543201700,134217728,4.047,2015-10-24 12:29:28.706,2015-10-24 12:29:20.882,/user/dstreev/perf/terasort_27/part-r-00002,10.0.0.167,d3.hdp.local,blk_1073747920
rw-------,1,dstreev,hdfs,543201700,134217728,4.047,2015-10-24 12:29:28.706,2015-10-24 12:29:20.882,/user/dstreev/perf/terasort_27/part-r-00002,10.0.0.167,d3.hdp.local,blk_1073747921
```

With the file in HDFS, you can build a [hive table](./src/main/hive/lsp.ddl) on top of it to do some analysis. One of the reasons I created this was to be able to review a directory used by some process and get a baring on the file construction and distribution across the cluster.

#### Use Cases
- The ratio can be used to identify files that are below the block size (small files).
- With the Datanode information, you can determine if a dataset is hot-spotted on a cluster. All you need is a full list of hosts to join the results with.

### Available Commands

#### Common Commands
connect connect to a remote HDFS instance
help display help information
put upload local files to the remote HDFS

get (todo) retrieve remote files from HDFS to Local Filesystem

#### Remote (HDFS) Commands
cd change current working directory
ls list directory contents
rm delete files/directories
pwd print working directory path
cat print file contents
chown change ownership
chmod change permissions
chgrp change group
head print first few lines of a file
mkdir create directories
count Count the number of directories, files and bytes under the paths that match the specified file pattern.
stat Print statistics about the file/directory at <path> in the specified format.
tail Displays last kilobyte of the file to stdout.
text Takes a source file and outputs the file in text format.
touchz Create a file of zero length.
usage Return the help for an individual command.

createSnapshot Create Snapshot
deleteSnapshot Delete Snapshot
renameSnapshot Rename Snapshot

#### Local (Local File System) Commands
lcd change current working directory
Expand All @@ -50,19 +255,25 @@ Every HDFS-CLI session keeps track of both the local and remote current working
lcat print file contents
lhead print first few lines of a file
lmkdir create directories

### Command Documentation
Help for any command can be obtained by executing the `help` command:

help pwd

Note that currently, documentation may be limited.

#### Tools and Utilities
lsp ls plus. Includes Block information and locations.
nnstat Namenode Statistics

### Known Bugs/Limitations

* No support for paths containing spaces
* No support for Windows XP
* Path Completion for chown, chmod, chgrp, rm is broken

### Road Map

- Support input variables
- Expand to support Extended ACL's (get/set)
- Add Support for setrep
- HA Commands
- NN and RM




Expand Down
28 changes: 28 additions & 0 deletions bin/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

# Should be run as root.

cd `dirname $0`

mkdir -p /usr/local/hdfs-cli/bin
mkdir -p /usr/local/hdfs-cli/lib

cp -f hdfscli /usr/local/hdfs-cli/bin
cp -f JCECheck /usr/local/hdfs-cli/bin

if [ -f ../target/hdfs-cli-full-bin.jar ]; then
cp -f ../target/hdfs-cli-full-bin.jar /usr/local/hdfs-cli/lib
fi

if [ -f hdfs-cli-full-bin.jar ]; then
cp -f hdfs-cli-full-bin.jar /usr/local/hdfs-cli/lib
fi

chmod -R +r /usr/local/hdfs-cli
chmod +x /usr/local/hdfs-cli/bin/hdfscli
chmod +x /usr/local/hdfs-cli/bin/JCECheck

ln -sf /usr/local/hdfs-cli/bin/JCECheck /usr/local/bin/JCECheck
ln -sf /usr/local/hdfs-cli/bin/hdfscli /usr/local/bin/hdfscli


Loading