Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df1Row[df2Column] == df1Row[df2Column.name] #445

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Jolanrensen
Copy link
Collaborator

Implemented fix for #442.

It used to be that df1Row[df2Column] == df2Column[df1Row.index], but IMO that's confusing. I can see it was done to make df1Row[df2Column] and df2Column[df1Row] return the same thing but across the library, this is only actually used once (in toList and was fixed by swapping row/cols).

It's also very common for the library to call df1Row[newColumn] for reasons I don't fully understand. So my fix checks if the name of newColumn already exists in the row, if so, return the value at that name. If not, do the magic you did before and let the column handle the resolving of the value.

This seems to work with all tests and fixes the specific example of the issue.

@Jolanrensen Jolanrensen added bug Something isn't working invalid This issue/PR doesn't seem right labels Aug 22, 2023
@Jolanrensen Jolanrensen requested a review from koperagen August 22, 2023 14:04
@Jolanrensen Jolanrensen self-assigned this Aug 22, 2023
@koperagen
Copy link
Collaborator

koperagen commented Sep 12, 2023

This behavior is defined by org.jetbrains.kotlinx.dataframe.DataColumn#resolveSingle. Logic here is that when you use DataColumn as a ColumnReference and it's resolved, you get values from the column itself, not the dataframe column with such name. I do not immediately recall where it works like that. Hm.. I think we need some more evidence to make a decision and change behavior in org.jetbrains.kotlinx.dataframe.DataColumn#resolveSingle

@Jolanrensen
Copy link
Collaborator Author

The only place I found it breaking in the library was here: core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/toList.kt. There, the column was taken from the row while the other way around was intended.

I also looked into forbidding such behavior, as taking a column from another dataframe from a row seems odd, but that's still used a lot across the implementation.

But purely from a user-standpoint, what would you expect df1Row[df2Column] to return?

@koperagen
Copy link
Collaborator

Ah, ok, i think i finally understood. Somehow this part was hard for me to understand, that's why i was suspicious of this fix

It used to be that df1Row[df2Column] == df2Column[df1Row.index], but IMO that's confusing. I can see it was done to make df1Row[df2Column] and df2Column[df1Row] return the same thing but across the library, this is only actually used once (in toList and was fixed by swapping row/cols).

Now i think we talk about the same thing, but differently. Yeah, indeed df1Row[df2Column] gives you values from df2Column if df2Column is pure DataColumn, unlike columns created by column accessors API:

val name by columnOf(1, 2, 3) // ResolvingValueColumn
df.select { name }.print()

val col = df.name.lastName.map { it.length } named "abc" // pure DataColumn
df.select { col }.print() 

image

So yeah, to me the reason why it works like this is that we need this behavior for data schema api:

 df.select { 
    name.lastName // pure DataColumn
}

But it backfires in getValue(row: AnyRow) because it doesn't make sense. Hm, still not sure if i see the full picture, but i hope my thoughts made it clearer for you as well :D What do you think?
I can say that fix looks reasonable. I suggest to change if (column.name() in df.columnNames()) to something faster. Maybe check that ColumnReference is our problematic DataColumn, or something else

@Jolanrensen
Copy link
Collaborator Author

@koperagen Thanks! you just gave me another test case that this change indeed broke. I'll try to find another solution which is a) more efficient and b) doesn't break df.select { dataColumnWithNameAlsoExistingInDf }

@Jolanrensen
Copy link
Collaborator Author

That's interesting...

val aColumn = columnOf(3, 4, 5) named "a"
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }

This results in "1, 2, 3". So it takes the name from aColumn, since it exists in the dataframe already, instead of returning the new "3, 4, 5" column. But the funny thing is that this does not use my fix. This behavior was already there.

@koperagen
Copy link
Collaborator

koperagen commented Sep 20, 2023

That's interesting...

val aColumn = columnOf(3, 4, 5) named "a"
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }

This results in "1, 2, 3". So it takes the name from aColumn, since it exists in the dataframe already, instead of returning the new "3, 4, 5" column. But the funny thing is that this does not use my fix. This behavior was already there.

Right. Columns created by columnOf are "force resolved" and behave differently from columns created by let's say df["a"]

public inline fun <reified T> columnOf(vararg values: T): DataColumn<T> =
    createColumn(values.asIterable(), typeOf<T>(), true).forceResolve()

@Jolanrensen
Copy link
Collaborator Author

@koperagen hmm, you're right

val aColumn = dataFrameOf("a")(4, 5, 6)["a"]
dataFrameOf("a")(1, 2, 3)
    .select { aColumn }

results in "4, 5, 6" indeed. I don't know entirely how to rectify this anymore... but I do think they should all behave the same

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Sep 22, 2023

But it backfires in getValue(row: AnyRow) because it doesn't make sense. Hm, still not sure if i see the full picture, but i hope my thoughts made it clearer for you as well :D What do you think? I can say that fix looks reasonable. I suggest to change if (column.name() in df.columnNames()) to something faster. Maybe check that ColumnReference is our problematic DataColumn, or something else

Okay, new implementation doesn't check all names anymore; it attempts to resolve both manners (from column by row index and from dataframe by name) and gives by name priority if it was able to resolve it and it was different from the other manner:

val fromColumnByRow = column.getValue(this)

val fromDfByName = df
    .let {
        try {
            it.getColumnOrNull(column.name())
        } catch (e: IllegalStateException) {
            return fromColumnByRow
        }
    }
    .let { it ?: return fromColumnByRow }
    .let {
        try {
            it[index]
        } catch (e: IndexOutOfBoundsException) {
            return fromColumnByRow
        }
    }
    .let {
        try {
            it as R
        } catch (e: ClassCastException) {
            return fromColumnByRow
        }
    }

return when {

    // Issue #442: df1Row[df2Column] should be df1Row[df2Column.name], not df2Column[df1Row(.index)]
    // so, give fromDfByName priority if it's not the same as fromColumnByRow
    fromDfByName != fromColumnByRow -> fromDfByName

    else -> fromColumnByRow
}

@Jolanrensen Jolanrensen removed the invalid This issue/PR doesn't seem right label Oct 1, 2024
@Jolanrensen Jolanrensen marked this pull request as draft October 1, 2024 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants