tika

Here are 147 public repositories matching this topic...

apache / tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

java metadata content tika extraction

Updated Dec 10, 2024
Java

dadoonet / fscrawler

Star

Elasticsearch File System Crawler (FS Crawler)

java elasticsearch crawler tika

Updated Dec 9, 2024
Java

yobix-ai / extractous

Star

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

nlp rust pdf machine-learning natural-language-processing ocr etl tika extraction docx data-pipelines pdf-parser unstructured unstructured-data rag etl-pipelines llm

Updated Dec 10, 2024
Rust

USCDataScience / sparkler

Star

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

search search-engine distributed-systems information-retrieval big-data spark solr web-crawler nutch tika

Updated Mar 30, 2023
Java

ICIJ / extract

Star

A cross-platform command line tool for parallelised content extraction and analysis.

etl solr tika index ediscovery

Updated Nov 26, 2024
Java

KevM / tikaondotnet

Star

Use the Java Tika text extraction library on the .NET platform

tika extract-text

Updated Apr 13, 2024
Rich Text Format

shebinleo / pdf2html

Star

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

nodejs tika pdf-converter pdfbox thumbnail pdftohtml

Updated Dec 10, 2024
JavaScript

apache / tika-docker

Star

Convenience Docker images for Apache Tika Server

docker image tika

Updated Oct 22, 2024
Shell

chrismattmann / MLwithTensorFlow2ed

Sponsor

Star

Code for Machine Learning with TensorFlow: 2nd Edition Published by Manning Publications

python docker machine-learning deep-learning clustering tensorflow machine-learning-algorithms tika ml regression classification autoencoder tensorflow-tutorials python2 manning-publications ml-with-tensorflow

Updated Nov 22, 2022
Jupyter Notebook

nasa-jpl-memex / memex-explorer

Star

Viewers for statistics and dashboarding of Domain Search Engine data

crawler dashboard anaconda nutch tika apache miniconda domain-discovery memex-explorer ache

Updated Jan 19, 2016
Python

vaites / php-apache-tika

Star

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

ocr php-library tika apache text-extraction text-recognition

Updated May 28, 2024
PHP

chrismattmann / tika-similarity

Sponsor

Star

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.

python machine-learning information-retrieval clustering tika cosine-similarity jaccard-similarity cosine-distance similarity-score tika-similarity metadata-features tika-python

Updated Mar 26, 2024
Python

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

solr tika apache memex oodt oodt-radix