xyr
is a very lightweight, simple, and powerful data ETL platform that helps you to query available data sources usingSQL
.
here we define a new table called
users
which will load all json files in that directory (recursive) with any of the following json formats: (object/object[] per-file, newline delimited json objects/object[], or event no delimiter json objects/object[] like what kinesis firehose json output format).
Let's image we have a directory of json files called
/tmp/data/users
and here is an example of a json file there:
{"id":10,"email":"u10@example.com"}{"id":20,"email":"u20@example.com"}{"id": 3,"email":"u3@example.com"}{"id": 4,"email":"u4@example.com"}
Then we can define its schema as following
# where xyr should store its internal database
data_dir = "./tmp/db/"
# this file is `./config.xyr.hcl`
table "users" {
// the driver we want
driver = "jsondir"
// the data source directory
source = "/tmp/data/users"
// xyr will try to create a table into its internal storage, so it needs
// to know at least what are the required columns names of your data.
// i.e: {"id": 1, "email": "user@example.com", "age": 20}
// but we only need "id" and "email", so we defined both in the below columns array
// and not that the ordering is the same as our example.
columns = ["id", "email"]
// what do you want to load
// in case of jsondir, we can specify a regex pattern to filter the files
// using the filename
// but if we're using an SQL driver we can provide an sql statement that reads the data
// from the source SQL based database.
// i.e: "SELECT * FROM SOME_TABLE"
filter = ".*"
}
Now its the time to load it
$ xyr table:import users
Now let's query it
$ xyr exec "SELECT * FROM users"
All tables you define could be joined in the same query easily, let's imagine that we have the following defination
# debug mode "affects the log level"
debug = true
# how many workers to use to write into our sqlite db
# 0 means current cpu cores count
workers_count = 0
# where xyr should store its internal database
data_dir = "./tmp/db/"
table "users" {
driver = "s3jsondir"
source = "s3://ACCESS_KEY:SECRET_KEY@/BUCKET_NAME?region=&ssl=false&path=true&perpage=1000"
# which prefix we want to select
filter = "xyr/users/"
columns = ["id", "email"]
}
table "user_vists" {
driver = "postgres"
source = "postgresql://username:password@server:port/dbname?option1=value1"
columns = ["user_id", "vists"]
filter = "SELECT user_id, count(vists) FROM USERS GROUP BY user_id"
}
Now let's join them
$ xyr exec "SELECT * FROM users LEFT JOIN user_vists ON users_vists.user_id = users.id"
use this docker package
Driver | Source Connection String |
---|---|
jsondir |
/PATH/TO/JSON/DATA/DIR |
s3jsondir |
s3://[access_key_url_encoded]:[secret_key_url_encoded]@[endpoint_url]/bucket_name?region=&ssl=false&path=true&perpage=1000&downloaders_count=8&downloader_concurrency=8 |
mysql |
usrname:password@tcp(server:port)/dbname?option1=value1&... |
postgres |
postgresql://username:password@server:port/dbname?option1=value1 |
sqlite3 |
/path/to/db.sqlite?option1=value1 |
sqlserver |
sqlserver://username:password@host/instance?param1=value¶m2=value |
sqlserver://username:password@host:port?param1=value¶m2=value |
|
sqlserver://sa@localhost/SQLExpress?database=master&connection+timeout=30 |
|
hana |
hdb://user:password@host:port |
clickhouse |
tcp://host1:9000?username=user&password=qwerty&database=clicks&read_timeout=10&write_timeout=20&alt_hosts=host2:9000,host3:9000 |
oracle |
oracle://user:pass@server1/service?server=server2&server=server3 |
- Simple Presto Alternative.
- Simple AWS Athena Alternative.
- Convert your JSON documents into a SQL DB.
internaly
xyr
utilizesSQLite
as an embeded sql datastore (it may be changed in future and we can add multiple data stores), when you define a table inXYRCONFIG
file then run$ xyr table:import
you will be able to import all defined tables as well querying them via$ xyr exec "SELECT * FROM TABLE_NAME_HERE"
which outputs json result by default.
- Building the initial core.
- Add the basic
import
command for importing the tables intoxyr
. - Add the
exec
command to execute SQL query. - Add well known SQL drivers
- mysql
- postgres
- sqlite3
- clickhouse
- oracle
- hana
- sqlserver
- Add an S3 driver
- Adding/Improving documentations
- Expose another API beside the
CLI
to enable external Apps to queryxyr
.- JSON Endpoint?
- Mysql Protocol?
- Redis Protocol?
- Improving the code base (iteration 1).
- Add another backend instead of sqlite3 as internal datastore?