Example solution that demonstrates the use of the EDK npm package to create various remote file type data sources.
Run npm install
to install the package dependencies.
The solution can be built using the following command edk build
.
The project will involve creating some file based datasources, detecting their outputs, then visualising the outputs. Prior to usign the example the test files in files/
must be accesible from a known ftp server host, which could be generated locally with the docker: docker run -d -v __location_to_dir__/files:/home/vsftpd -p 20:20 -p 21:21 -p 47400-47470:47400-47470 -e FTP_USER=__user__ -e FTP_PASS=__password__ -e PASV_ADDRESS=127.0.0.1 --name ftp --restart=always bogem/ftp
.
Each data source was added with the following EDK commands:
- csv:
edk add datasource csv --name "Csv" --uri ftp://__user__:__password__@localhost/test.csv --def_dir src/datasource
- json:
edk add datasource json --name "Json Test" --uri ftp://__user__:__password__@localhost/test.jsonl --def_dir src/datasource
- xlsx:
edk add datasource xlsx --name "Xlsx" --uri ftp://__user__:__password__@localhost/test.xlsx --def_dir src/datasource
- csv two:
edk add datasource csv --name "Csv Two" --uri ftp://__user__:__password__@localhost/test_two.csv --def_dir src/datasource
- xlsx two:
edk add datasource xlsx --name "Xlsx Two" --uri ftp://__user__:__password__@localhost/test_two.xlsx --def_dir src/datasource
This will create an empty datasource for each file, for example for the json datasource:
import * as ELARA from "@elaraai/edk/lib"
export default ELARA.JsonSourceSchema({
name: "Json",
uri: 'ftp://__user__:__password__@localhost/test.jsonl'
})
Note that the second version of each file (and datasource) are intentionally more complicated, containing duplicate rows (i.e. no unique key) and a much larger number of rows. Also, for test.csv
dataset we know that rather than empty cells for variables, the file contains a "?"
string.
The output expressions were detected for each data with the following commands.
- csv:
edk-io detect csv --asset csv.source --defaults --empty ?
- json:
edk-io detect json --asset json.source --defaults
- xlsx:
edk-io detect xlsx --asset xlsx.source --defaults
- csv two:
edk-io detect csv --asset csv_two.source --defaults --empty ?
- xlsx two:
edk-io detect xlsx --asset xlsx_two.source --defaults
This will generate the types and expressions for the datasources, for example for json.source
:
import * as ELARA from "@elaraai/edk/lib"
const json_struct_type = ELARA.StructType({
string: 'string',
date: 'datetime',
'float': 'float',
integer: 'integer',
'boolean': 'boolean',
struct: ELARA.StructType({
string: 'string',
date: 'datetime',
'float': 'float',
integer: 'integer',
'boolean': 'boolean',
}),
array: 'set',
});
export default ELARA.JsonSourceSchema({
name: "Json",
uri: 'ftp://__user__:__password__@localhost/test.jsonl',
primary_key: ELARA.Variable("string", 'string'),
selections: {
string: ELARA.Parse(ELARA.Variable("string", 'string')),
date: ELARA.Parse(ELARA.Variable("date", 'datetime')),
'float': ELARA.Parse(ELARA.Variable("float", 'float')),
integer: ELARA.Parse(ELARA.Variable("integer", 'integer')),
'boolean': ELARA.Parse(ELARA.Variable("boolean", 'boolean')),
array: ELARA.Parse(ELARA.Variable("array", 'set')),
struct: ELARA.Parse(ELARA.Variable("struct", json_struct_type)),
},
})
Or alternatively for the second set of datasources, the detection will sample from the full set of rwos for efficiency, and also inject an index variable given the lack of a unique field. For example with the csv_two.source
below.
// East type declarations
import * as ELARA from "@elaraai/edk/lib"
export default ELARA.CsvSourceSchema({
name: "Csv Two",
uri: 'ftp://__user__:__password__@localhost/test_two.csv',
primary_key: ELARA.Print(ELARA.Variable("index", 'integer')),
index_variable: ELARA.Variable("index", 'integer'),
selections: {
string: ELARA.Parse(ELARA.IfElse(
ELARA.Equal(ELARA.Variable("string", 'string'), ELARA.Const("?")),
ELARA.Null('string'),
ELARA.Variable("string", 'string')
)),
date: ELARA.Parse(ELARA.Variable("date", 'datetime')),
number: ELARA.Parse(ELARA.Variable("number", 'float')),
integer: ELARA.Parse(ELARA.Variable("integer", 'integer')),
'boolean': ELARA.Parse(ELARA.Variable("boolean", 'boolean')),
'Another String': ELARA.Parse(ELARA.IfElse(
ELARA.Equal(ELARA.Variable("Another String", 'string'), ELARA.Const("?")),
ELARA.Null('string'),
ELARA.Variable("Another String", 'string')
)),
},
})
The application was added for the project with the following command: edk add plugin --name Application --def_dir src/plugin
. The application contents was added to display the datasource outputs with the DataSourcePlugin
for a default SuperUser
:
import * as ELARA from "@elaraai/edk/lib"
import { ApplicationPlugin, Const, DataSourcePlugin, SuperUser } from "@elaraai/edk/lib"
import csv from "../../gen/csv.source"
import json from "../../gen/json.source"
import xlsx from "../../gen/xlsx.source"
export default ELARA.Schema(
ApplicationPlugin({
name: "File Datasources",
schemas: {
"Datasources": DataSourcePlugin({
datasources: [csv, json, xlsx]
})
},
users: [
SuperUser({
email: 'admin@domain.com',
name: 'Admin',
password: Const('admin'),
})
]
})
)
General reference documentation for EDK usage is available in the following links: