# DocRipper
[![Gem Version](https://badge.fury.io/rb/doc_ripper.svg)](http://badge.fury.io/rb/doc_ripper)

Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.

For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.

Need OCR support or in-image text parsing? Take a look at [Docsplit](https://github.com/documentcloud/docsplit).

### Supported File Formats
````
.doc
.docx
.pdf
.txt
.sketch
````

File format | Supported? | Dependencies
------------|------------|-------------
.doc        |     x      |   Antiword
.docx       |     x      |
.pdf        |     x      |   Poppler-utils
.txt        |     x      |
.sketch     |     x      |   Sqlite3

## Quickstart

```
  gem install doc_ripper
```
### Specify a file path of a file

```
  require 'doc_ripper'

  DocRipper::rip('/path/to/file')
```

#### If the file cannot be read, nil will be returned.

```
  DocRipper::rip('/path/to/missing/file')
  => nil
```

#### Want to raise an exception? Use #rip!
\#rip! will raise an exception if rip returns nil or the file type isn't supported

```
  # invalid file type
  DocRipper::rip!('/path/to/invalide/file.type')
  => DocRipper::UnsupportedFileType

  # missing file
  DocRipper::rip!('/path/to/missing/file.doc')
  => DocRipper::FileNotFound
```



## Dependencies
 - Ruby version >= 1.9.2
 - [Poppler-utils/(pdftotext)](http://poppler.freedesktop.org/) (PDF)
 - [Antiword](http://www.winfield.demon.nl/) (docx) more info: http://linux.die.net/man/1/antiword
 - Sketch support requires sqlite3 and the [sqlite3 gem](https://rubygems.org/gems/sqlite3) gem