-
Notifications
You must be signed in to change notification settings - Fork 6
/
README
executable file
·106 lines (81 loc) · 3.53 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
Simple Tika Server
https://github.com/gselva
==========================
Tika as a HTTP service to parse and get metadata and the textual content of documents (as json) such as below.
Sample Invocation:
------------------
curl -T tika.pdf http://localhost:8080/tika/fulldata
Sample Output:
-------------
{
"metadata": { "Author":"Apache",
"Content-Length":243166,
"Content-Type":"application/pdf",
"Creation-Date":"2010-05-11T13:39:32Z",
"Keywords":"",
"Last-Modified":"2010-05-11T13:39:32Z",
"created":"Wed May 11 14:39:32 BST 2010",
"creator":"Google Chrome",
"producer":"Mac OS X 10.6.7 Quartz PDFContext",
"resourceName":"tika.pdf",
"subject":"",
"title":"Apache Tika - Apache Tika",
"xmpTPg:NPages":4 },
"text":"Apache Tika\nApache Tika - a content analysis toolkit"
}
Based on https://github.com/maxcom/tikaserver-ex (thanks Max!)
More info
---------
Tika as server : https://issues.apache.org/jira/browse/TIKA-593
Json output : https://issues.apache.org/jira/browse/TIKA-213
Build and run
-------------
Install Maven if you haven't, then, under project root folder do:
mvn jetty:run
Deploy
------
mvn war:war
Then drop the war into the webapps folder of a servlet container like Jetty or Apache Tomcat.
If using Simple Tika Server for parsing local files with HTTP GET, you'll need to setup JNDI entries, see below.
Usage
-----
HTTP PUT a document to parse with Tika
PUT document to get metadata
curl -T pom.xml http://localhost:8080/tika/metadata
returns metadata as JSON
PUT document to get text
curl -T pom.xml http://localhost:8080/tika/text
returns textual content, if any, as json
PUT document to get metadata and text
curl -T pom.xml http://localhost:8080/tika/fulldata
returns metadata and textual content as json
HTTP GET to parse a locally available document with Tika
GET Metadata for files local to server ('metadata' option)
http://localhost:8080/tika/metadata/localfiles/tika.pdf
returns metadata as JSON
'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example)
tika.pdf should be made available there before calling
http://localhost:8080/tika/metadata/wikipedia/wiki/Douglas_Adams
returns metadata as JSON
'wikipedia' is a JNDI value that returns a URL (see jetty-env.xml for example)
"wiki/Douglas_Adams" is the resource you want to parse
http://localhost:8080/tika/metadata/wikipedia/w/index.php?title=Talk:Douglas_Adams&action=history
document part can have query params as well
"w/index.php?title=Talk:Douglas_Adams&action=history" is treated as the source document here
GET Text for files local to server ('text' option)
http://localhost:8080/tika/text/localfiles/tika.pdf
returns textual content, if any, as json
'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example)
tika.pdf should be made available there before calling
GET Metadata and Text for files local to server ('fulldata' option)
http://localhost:8080/tika/fulldata/localfiles/tika.pdf
returns metadata and textual content as json
'localfiles' is a JNDI value that returns a URL (see jetty-env.xml for example)
tika.pdf should be made available there before calling
HTTP Codes returned
-------------------
200 - Ok
404 - Document not found
415 - Unknown file type
422 - Unparsable document of known type (password protected documents and unsupported versions like Biff5 Excel)
500 - Internal error