The Ballerina Data module offers a variety of functions designed to facilitate conversion between different data formats (e.g., JSON, XML, CSV, etc.). This module is equipped to handle both string data formats (e.g., JSON, XML, CSV, etc.) and Ballerina anydata
values.
TODO: Complete the introduction.
Below are the supported string data formats, along with the corresponding base type, and the types they can be converted from and to.
Format | Base Type | Convert From | Convert To |
---|---|---|---|
JSON | record {|T...;|} or json |
string , stream<byte[], error> , byte[] , xml , ... |
xml , json , records, ... |
XML | record {|T...;|} or xml |
string , stream<byte[], error> , byte[] , json , ... |
json , records ... |
CSV | record {|T...;|}[] |
string , stream<byte[], error> , byte[] , json , ... |
json , xml , ... |
YAML | record {|T...;|} |
TBD | TBD |
TOML | record {|T...;|} |
TBD | TBD |
Properties (Map) | record {|T...;|} |
TBD | TBD |
Highlights:
- Notably, all string data types, including XML, can be represented using a record type. The constraint
T
is based on the specific string data format. - Conversion are opinionated, but can be customized using additional parameters.
Any Ballerina anydata
values including formats such as json
, xml
, records, arrays, maps, and more.
Note:
Initially, we will not support all anydata
values, only a select few basic types.
The core idea of the design is to streamline the conversion process between different string data formats.
By using a canonical representation based on the Record type,
we can represent most data types and use this representation to facilitate conversions between various data formats,
including different anydata
basic types.
Another crucial aspect of our design is the use of Projection to convert what the user wishes to convert. This level of specificity cannot be achieved with structural typing and default cloning.
graph LR
A["Source `byte[]`|`string`|`stream &lt;byte[], error>`"] -- "fromXYWithType" --> B["Record Type (R1)"]
B -- "Data Mapper" -.-> C["Another Record Type (Rx)"]
C -- "toXY" --> D["Target `byte[]`|`string`|`stream &lt;byte[], error>`"]
ST["Source Type `xml`, `json`, list, mapping, etc."] -- "fromYWithType" --> B
C -- "toY" --> TT["Target Type `xml`, `json`, list, mapping, etc.`"]
ST -- "fromYWithType" --> TT
subgraph "Source Data Format (JSON, XML, CSV, etc.)"
A
end
subgraph "Source Basic Type"
ST
end
subgraph "Target Basic Type"
TT
end
subgraph "Target Data Format (JSON, XML, CSV, etc.)"
D
end
subgraph "Ballerina Record Based Canonical Representation"
B
C
end
For this design, we are introducing a new set of functions and an algorithm (called Projection) for data format conversion.
In this context, the source data format type is S
, and the target type is T
. To facilitate Projection, T
must be a subtype of P|P[]|table<P>
, where P
is record {|R...;|}
.
Key Points:
P
can represent both open and closed records, which remain subtypes ofrecord {|R...;|}
. The openness or closeness ofP
will determine the projection of the source data.R
may be constrained based on the source string data format type.
To convert T
to another string data format, we will introduce a new set of functions specific to each string data format.
This function serves as the inverse of the toString()
function and constructs a ()|boolean|string|int|decimal|float
from a string representation.
isolated function fromStringWithType(string, typedesc<T> t = <>) returns T|error = external;
- This should be a langlib function, which is part of the
ballerina/lang.value
module. - A numeric-string value represents a value that belongs to one of the basic types: int, float, or decimal. As with literals, we need to apply a similar algorithm to determine the type of the value.
- Most basic types provide the
fromString
langlib function. The semantics of this function are the same as those.
Converts a string data format value to the canonical form with projection. These functions follows following pattern.
from <StringDataFormat> <Source> WithType
- ballerina/data.json, Here
T
is a subtype ofrecord {|anydata...;|}
fromJsonStringWithType (string src, Options options = {}, typedesc<T> t = <>) returns T|error
fromJsonBytesWithType (byte[] src, Options options = {}, typedesc<T> t = <>) returns T|error
fromJsonStreamWithType (stream<byte[] src, error>, Options options = {}, typedesc<T> t = <>) returns T|error
- ballerina/data.xml, Here
T
is defined in the xml module.fromXmlStringWithType (string src, Options options = {}, typedesc<T> t = <>) returns T|error
fromXmlBytesWithType (byte[] src, Options options = {}, typedesc<T> t = <>) returns T|error
fromXmlStreamWithType (stream<byte[] src, error>, Options options = {}, typedesc<T> t = <>) returns T|error
- ballerina/data.csv, Here
T
is defined in the csv module.fromCsvStringWithType (string src, Options options = {}, typedesc<T> t = <>) returns T|error
fromCsvBytesWithType (byte[] src, Options options = {}, typedesc<T> t = <>) returns T|error
fromCsvStreamWithType (stream<byte[], error> src, Options options = {}, typedesc<T> t = <>) returns T|error
Converts a Ballerina value to the canonical form with projection or to a different Basic Type. These functions follows following pattern.
from <Source> WithType
- ballerina/data.json
fromXmlWithType(xml src, Options options = {}, typedesc<T> t = <>) returns T|error
fromJsonWithType(json src, Options options = {}, typedesc<T> t = <>) returns T|error
fromRecordWithType(record {|anydata...;|} src, Options options = {}, typedesc<T> t = <>) returns T|error
fromTableWithType(table<anydata> src, Options options = {}, typedesc<T> t = <>) returns T|error
- ballerina/data.xml
fromJsonWithType(json src, Options options = {}, typedesc<T> t = <>) returns T|error
fromRecordWithType(record {|anydata...;|} src, Options options = {}, typedesc<T> t = <>) returns T|error
- ballerina/data.csv
fromJsonWithType(json src, Options options = {}, typedesc<T> t = <>) returns T|error
fromRecordWithType(record {|anydata...;|} src, Options options = {}, typedesc<T> t = <>) returns T|error
Note: As we are dealing with two different basic types, compatibility issues can be expected. Each function will strive to optimally convert the value to the target type. Otherwise, an error will be returned.
Converts a canonical value to a string data format value for persistence.
to <StringDataFormat> <Target>
- ballerina/data.json
toJsonString (T src, Options options = {}) returns string|error
toJsonBytes (T src, Options options = {}) returns byte[]|error
- Will be added latertoJsonStream (T src, Options options = {}) returns stream<byte[], error>|error
- Will be added later
- ballerina/data.xml
toXmlString (T src, Options options = {}) returns string|error
toXmlBytes (T src, Options options = {}) returns byte[]|error
- Will be added latertoXmlStream (T src, Options options = {}) returns stream<byte[], error>|error
- Will be added later
- ballerina/data.csv
toCsvString (T src, Options options = {}) returns string|error
toCsvBytes (T src, Options options = {}) returns byte[]|error
- Will be added latertoCsvStream (T src, Options options = {}) returns stream<byte[], error>|error
- Will be added later
Note:
Each StringDataFormat module will take additional Options to control the output.
For example XML module will take xml:Schema
to control the output.
Converts a canonical value to a different ballerina value.
to <Target> [WithType]
- ballerina/data.json
toXml (T src, Options options = {}) returns xml|error
- Will take Optional
xml:Schema
value to control the output.
- Will take Optional
toRecord (T src, Options options = {}, typedesc<record{anydata...;}> t = <>) returns T|error
toTable (T src, Options options = {}) returns table<anydata>|error
- Only works for JSON Arrays. Otherwise, it will return an error.
- ballerina/data.xml
toJson (T src, Options options = {}) returns json|error
toRecord (T src, Options options = {}, typedesc<record{anydata...;}> t = <>) returns T|error
- ballerina/data.csv
toJson (T src, Options options = {}) returns json|error
toRecord (T src, Options options = {}, typedesc<record{anydata...;}> t = <>) returns T|error
While the majority of necessary functions are readily available in the langlib, such as toString()
,
we will still require a set of unique functions to facilitate conversions between different string data formats.
Details about these additional functions will be introduced later in this document.
Projection can be executed at two stages:
- On functions that construct a new Ballerina value from
string
,byte[]
,stream<byte[], error>
values, which represent a string data format. - On functions that convert an existing Ballerina value to a different Ballerina value, such as
xml
,json
, records, arrays, etc.
The projection rules are more or less the same for both cases unless otherwise specified.
When P
is Open:
-
If the source is a String Data type:
- If the position of the source data matches the name of the
P
field:- The string value at that position is converted using
fromStringWithType
. - This might result in a
ConversionError
if the value is invalid (e.g., due to number format errors).
- The string value at that position is converted using
- If there's no matching field in
P
:- The string value is converted to the most suitable basic type using
fromStringWithType
. - The suitable basic type is inferred from
R
, which will generally succeed unlessR
does not includestring
.
- The string value is converted to the most suitable basic type using
- These rules are further adjusted based on the specific String Data Type. For instance:
- XML
- If the position is an XML element, the element name is matched against the
P
field name.- If the element is a text element, the value is converted using
fromStringWithType
. - If the element is a complex element, it will be converted to a
record {|R...;|}
following the same rules.
- If the element is a text element, the value is converted using
- If the position is an XML attribute, the attribute name is matched against the
P
field name, and the value is converted usingfromStringWithType
. - If both the element and attribute don't match, the value is converted to the most suitable basic type using
fromStringWithType
. - If both the element and attribute match, priority is given to the element.
- If the position is an XML comment, it will be ignored.
- If the position is an XML processing instruction (PI), it will be ignored.
- If the position is an XML namespace declaration, it will be ignored by default, but this behavior can be overridden with additional parameters.
- Unspecified fields will be converted to the most suitable basic type using
fromStringWithType
.
- If the position is an XML element, the element name is matched against the
- JSON
- The rules remain the same as normal.
- Note: JSON Objects and JSON Arrays are treated as
record {|R...;|}
andrecord {|R...;|}[]
respectively.
- CSV
- Multiple CSV headers are supported; only the last one will be considered for the conversion.
- If a CSV header is present (this has to be explicitly specified by the user, the default is false)
- The header name is first matched against the
P
field name. - Based on this match, the conversion is performed. If there is a type mismatch, a
ConversionError
will occur. - Unspecified fields will be converted to the most suitable basic type using
fromStringWithType
.
- The header name is first matched against the
- If a CSV header is not present
- The position of the value is matched against the
P
field name. In this case, field names must be numeric strings, i.e., "0", "1", "2", etc. (TODO: Check this) The conversion is then performed based on this match. If there is a type mismatch, aConversionError
will occur.
- The position of the value is matched against the
- Unspecified fields will be converted to the most suitable basic type using
fromStringWithType
. - Note: These rules apply only when
P
is used. For convenience, we support a list of list representation for CSV, where projection doesn't apply.
- XML
- If the position of the source data matches the name of the
-
If the source is an existing Ballerina value:
- If it is a mapping value:
- The source field name should match the
P
field's name.- Both the source and target field basic types should be the same.
- For numeric basic types, the
NumericConvert
operation will be performed. - Otherwise, it will return a
ConversionError
.
- If there's no matching field in
P
:- The source field is copied to the target field if the source field's basic type is compatible with
R
. - Otherwise, it will return a
ConversionError
.
- The source field is copied to the target field if the source field's basic type is compatible with
- The source field name should match the
- For an
xml
value:- The rules are the same as for an XML String Data Type.
- For a
json
value:- The rules are the same as for a JSON String Data Type.
- For a list value:
P[]
will be used instead ofP
.
- For a table value:
- To be determined (TBD).
- Other values are not supported and will result in a
ConversionError
.
- If it is a mapping value:
When P
is Closed:
- If the source is a String Data type:
- If the source data position matches the
P
field's name, the rules are the same as above. - If there's no matching field in
P
, the value is ignored. - For different String Data Formats:
- XML:
- The rules are the same as when
P
is Open, except that unspecified fields will be ignored.
- The rules are the same as when
- JSON:
- The rules are the same as when
P
is Open, except that unspecified fields will be ignored.
- The rules are the same as when
- CSV:
- The rules are the same as when
P
is Open, except that unspecified fields will be ignored.
- The rules are the same as when
- XML:
- If the source data position matches the
- If the source is an existing Ballerina value:
- Same as when
P
is Open, except that unspecified fields will be ignored.
- Same as when
Design Choices:
- Having separate functions for each string data module.
- Scalability: This approach is more scalable as new string data formats can be added without affecting existing modules.
- Function call syntax would be:
Person p = 'json:fromJsonStringWithType("{name: "John", age: 30}");
- This introduces redundancy with the prefix, but it is required for consistency with other langlib functions.
- Another option is to omit the
<StringDataFormat>
part, which would make itfromStringWithType
.
- This approach, however, leads to a dependency problem:
- When providing conversion between two different data types (i.e., XML <-> JSON, JSON <-> CSV, TOML <-> YAML, etc.), some base type information needs to be shared.
- This results in a circular dependency problem. For example, the
ballerina/data.json
module would have to depend on theballerina/data.xml
module and vice versa.
- Another option is to have a single module containing all the functions, i.e.,
ballerina/data
.- Function call syntax would be:
Person p = 'data:fromJsonStringWithType("{name: "John", age: 30}");
- Function call syntax would be:
Each approach has its advantages and disadvantages. It's essential to consider the project's scope and future scalability when making this decision.
When converting back to XML, We will need additional information such as xmlns, attributes, etc.
Since XML schema it self is an XML document, We can use read it and generate the Ballerina canonical XML representation.
toXML
and toXmlXXX
functions take an XML Schema as an additional parameter, which will be used to generate the XML document.
ballerina/data.xml
module defines a Schema
record type, which can be used to represent an XML Schema.
Also, this can be used for direct json
-> xml
conversion as well.
One disadvantage of this approach is that XML Schema is not available for all XML documents. So opinionated conversion has to be done for the default case. Also XML Schema can be complex, supporting all the features might be difficult.
Example:
type Schema record {
Element[]? element;
ComplexType[]? complexType;
AttributeGroup[]? attributeGroup;
SimpleType[]? simpleType;
};
type Element record {
string name;
string? 'type;
ComplexType? complexType;
SimpleType? simpleType;
};
type ComplexType record {
string? name;
Sequence? sequence;
Attribute[]? attribute;
AttributeGroup? attributeGroup;
};
type Sequence record {
Element[]? element;
};
type Attribute record {
string name;
string 'type;
string? use;
};
type AttributeGroup record {
string name;
Attribute[] attribute;
};
type SimpleType record {
string name;
Restriction? restriction;
};
type Restriction record {
string base;
MaxLength? maxLength;
MinLength? minLength;
};
type MaxLength record {
int value;
};
type MinLength record {
int value;
};
This approach is similar to the first approach, but instead of using an XML Schema, we can annotate the record type with XML information.
Disadvantages:
- Only Works with Cannibalized XML records.
- Requires additional information to be provided by the user as annotation, which is cumbersome, especially for large XML documents.
Here are some of the functions I found in the Ballerina language specification that start with "from" and "to" and perform some type of conversion:
Category | Module | Function Signature | Description |
---|---|---|---|
XX | ballerina/lang.array | fromBase64(string) returns byte[]|error |
Converts a Base64 string to a byte array |
XX | ballerina/lang.array | fromBase16(string) returns byte[]|error |
Converts a Base16 string to a byte array |
String |
ballerina/lang.boolean | fromString(string) returns T|error |
Construct value from a string representation. |
String |
ballerina/lang.decimal | fromString(string) returns T|error |
Construct value from a string representation. |
String |
ballerina/lang.float | fromString(string) returns T|error |
Construct value from a string representation. |
String |
ballerina/lang.int | fromString(string) returns T|error |
Construct value from a string representation. |
String |
ballerina/lang.regex | fromString(string) returns T|error |
Construct value from a string representation. |
String |
ballerina/lang.xml | fromString(string) returns T|error |
Construct value from a string representation. |
XXString |
ballerina/lang.float | fromHexString(string) returns float|error |
Converts a hexadecimal string representation to a float value |
XXString |
ballerina/lang.int | fromHexString(string) returns int|error |
Converts a hexadecimal string representation to a decimal value |
XXString |
ballerina/lang.value | fromBalString(string) returns anydata|error |
- |
XXString |
ballerina/lang.value | fromJsonString(string) returns json|error |
- |
XXString |
ballerina/lang.value | fromJsonFloatString(string) returns JsonFloat|error |
- |
XXString |
ballerina/lang.value | fromJsonDecimalString(string) returns JsonDecimal|error |
- |
String Data WithType |
ballerina/lang.value | fromJsonStringWithType(string, T) returns T|error |
- |
DATA WithType |
ballerina/lang.value | fromJsonWithType(json, T) returns T|error |
- |
DATA | ballerina/lang.float | fromBitsInt(int) returns float |
- |
DATA | ballerina/lang.string | fromBytes(byte[]) returns string|error |
- |
DATA | ballerina/lang.string | fromCodePointInts(int[]) returns string|error |
- |
DATA | ballerina/lang.string | fromCodePointInt(int) returns Char|error |
- |
Category | Module | Function Signature | Description |
---|---|---|---|
Type | ballerina/lang.array | toStream(T[] arr) returns stream<T,()> |
Converts an array to a stream |
Type | ballerina/lang.float | toBitsInt(float x) returns int |
IEEE 64-bit binary floating point format representation of x as an int |
Type | ballerina/lang.map | toArray(map<Type> m) returns Type[] |
Returns a list of all the members of a map |
Type | ballerina/lang.string | toBytes(string str) returns byte[] |
Represents str as an array of bytes using UTF-8 |
Type | ballerina/lang.string | toCodePointInts(string str) returns int[] |
Converts a string to an array of code points |
Type | ballerina/lang.string | toCodePointInt(Char ch) returns int |
Converts a single character string to a code point |
Type | ballerina/lang.table | toArray(table<MapType> t) returns MapType[] |
Returns a list of all the members of a table |
Type | ballerina/lang.value | toJson(anydata v) returns json |
Converts a value of type anydata to json |
StringDataFormat | ballerina/lang.array | toBase64(byte[] arr) returns string |
Converts a byte array to a Base64 string |
StringDataFormat | ballerina/lang.array | toBase16(byte[] arr) returns string |
Converts a byte array to a Base16 string |
StringDataFormat | ballerina/lang.float | toHexString(float x) returns string |
Converts a float value to a hexadecimal string representation |
StringDataFormat | ballerina/lang.float | toFixedString(float x, int? fractionDigits) returns string |
Returns a string that represents x using fixed-point notation |
StringDataFormat | ballerina/lang.float | toExpString(float x, int? fractionDigits) returns string |
Returns a string that represents x using scientific notation |
StringDataFormat | ballerina/lang.int | toHexString(int n) returns string |
Returns representation of n as hexadecimal string |
StringDataFormat | ballerina/int.string | toLowerAscii(string str) returns string |
Converts occurrences of A-Z to a-z |
StringDataFormat | ballerina/int.string | toUpperAscii(string str) returns string |
Converts occurrences of a-z to A-Z |
StringDataFormat | ballerina/lang.value | toJsonString(anydata v) returns string |
Returns the string that represents v in JSON format |
String | ballerina/lang.value | toString(any v) returns string |
- |
String | ballerina/lang.error | toString(error v) returns string |
- |
BalString | ballerina/lang.value | toBalString(any v) returns string |
- |
BalString | ballerina/lang.error | toBalString(error v) returns string |
- |