Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KTNB-693 Send the full dataframe schema as metadata #706

Merged
merged 3 commits into from
May 30, 2024
Merged

Conversation

cmelchior
Copy link
Contributor

@cmelchior cmelchior commented May 27, 2024

This part adds the infrastructure needed for https://youtrack.jetbrains.com/issue/KTNB-693/Enable-AI-Actions-for-DataFrames-in-Kotlin-Notebooks as we currently are not able to detect column types in a good way which is needed when creating prompts for the AI Assistant.

It adds a new "types" property to the top-level "metadata" as well as recursively on each row so it is possible to easily identify column types.

A columns property has also been added to ColumnGroup and FrameColumn metadata, it contains nested column names similar to the top-level columns property.

Example:

val col1 by columnOf("a", "b", "c")
val col2 by columnOf(1, 2, 3)
val col3 by columnOf("Foo", "Bar", null)
val df2 = dataFrameOf(Pair("header", listOf("A", "B", "C")))
val col4 by columnOf(df2, df2, df2)
var df = dataFrameOf(col1, col2, col3, col4)
df.group(col1, col2).into("group")            
{
   ...
             {
              "${'$'}version": "2.1.0",
              "metadata": {
                "columns": ["group", "col3", "col4"],
                "types": [{
                  "kind": "ColumnGroup"
                }, {
                  "kind": "ValueColumn",
                  "type": "kotlin.String?"
                }, {
                  "kind": "FrameColumn"
                }],
                "nrow": 3,
                "ncol": 3
              },
              "kotlin_dataframe": [{
                "group": {
                  "data": {
                    "col1": "a",
                    "col2": 1
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Foo",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "b",
                    "col2": 2
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Bar",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "c",
                    "col2": 3
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": null,
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }]
            }
}

schemaData["name"] = name
schemaData["kind"] = columnSchema.kind.toString()
when (columnSchema) {
is ColumnSchema.Value -> schemaData["type"] = columnSchema.type.toString()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a function to turn KType to String, it's used in HTML rendering and DataFrameSchema.toString

internal fun renderType(type: KType?): String {

What do you think? Suits for AI actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into that method, it seems to hide some of the type information in some cases. So I do not think it is suitable for AI Actions. At least if we want to be as specific as possible with the context information. Also, if others want to use the type information, we probably need the fully qualified type as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into this method, it seems to remove type information in some cases, which might make it problematic when we want to use the metadata to improve the context of AI actions. So for now I would prefer to keep the fully qualified names that also include nullability, e.g. Kotlin.String?

@ermolenkodev
Copy link
Contributor

There is a problem when a FrameColumn contains frames with different schemas. I recommend attaching types to the metadata of each nested frame. This may lead to duplication if the schema of each nested frame is the same, but it will make it easier to work with on the Kotlin Notebook plugin side. We already have a lot of duplication because we pass column names for each value in rows, so this additional overhead will be minimal.
Here is the short reproducer of the problem:
dataFrameOf("a", "b")(1, dataFrameOf("c", "d")(1, 2), 2, dataFrameOf("e", "f")(1, 2))

Copy link
Contributor

@ermolenkodev ermolenkodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmelchior
Copy link
Contributor Author

@ermolenkodev I see your point. I forgot to think about that each row could hold different schemas for data frame references. So you are right, it is probably better to have the schema as part of the metadata inside the data frame content.

I'll refactor it.

@cmelchior cmelchior requested a review from ermolenkodev May 29, 2024 14:09
@cmelchior
Copy link
Contributor Author

After some discussion with @ermolenkodev we decided to rework the metadata a little. I have updated the PR and description. So it should be ready for a 2nd round of review.

@cmelchior cmelchior merged commit 75d8e78 into master May 30, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants