Skip to content

Conversation

@CarloMariaProietti
Copy link
Contributor

@CarloMariaProietti CarloMariaProietti commented Nov 10, 2025

FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.

Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.

It comes with a proper test case.

About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf

@CarloMariaProietti CarloMariaProietti marked this pull request as ready for review November 16, 2025 19:05
// row at index 'x-1' of dfA was removed
xPrev + 1 == x && yPrev + 1 != y -> {
comparisonDf = comparisonDf.concat(
dataFrameOf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange formatting, can you lint this file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably best to add argument names to ComparisonDescription, "magic" argumentless constants are often a source of bugs :)

internal class ComparisonDescription(
val rowAtIndex: Int,
val of: String,
val wasRemoved: Boolean?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this results in true or null... what's wrong with false? Same below

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... rows can either be removed or inserted, right? nothing else. Let's turn these two booleans in an enum as well then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose for null instead of false to remark that if wasInserted is true, wasRemoved column does not even make sense (and vive-versa).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; well, in that case an enum makes more sense I think

val of: String,
val wasRemoved: Boolean?,
val wasInserted: Boolean?,
val afterRow: Int?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this insertedAfterRow or something more expressive. That explains why it's null if wasRemoved == true

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but honestly, this seems a thing AI would do wrong. If you're using AI, that's okay; however, you remain the one responsible for the code.

At first you had:

val wasRemoved: Boolean?
val wasInserted: Boolean?
val afterRow: Int?

I remarked here: #1556 (comment) both values had two options: either null or true. This is an odd thing to do with booleans which, as you know, have 2 values already: true or false, however, you explained why, so that makes slightly more sense now :)

What I don't understand is why you now have two nullable RowOfComparisons:

val wasRemoved: RowOfComparison?
val wasInserted: RowOfComparison?

What would these even contain? A row is either 'removed' (WAS_REMOVED) or 'inserted' (WAS_INSERTED_AFTER_ROW), right? So why not just make one column called modification: RowOfComparison?

Next, I meant you to rename afterRow to insertedAfterRow, not wasInserted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, no AI at all was used in this code. I misunderstood #1556 (comment),
I thought that I had to create an enum whose constants had an exact correspondance with boolean values true and false (so that the schema could remain the same) .
However, now I understood what You meant :), I proceed to correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I take back what I said :) Thanks for the explanation!

val wasRemoved: Boolean?,
val wasInserted: Boolean?,
val afterRow: Int?,
) : DataRowSchema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could include the modified DataRow<*> in the resulting DataFrame as well. That could make it a bit easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myers difference algorithm exploits the idea of comparison in a boolean sense,
a member (row) of the data structure (df) is equal or not-equal to another member.
In this logic a modified row is represented by two DataRow<ComparisonDescription> ,
One row is like: row n of dfA was removed and the other: row m of dfB was inserted after row n-1 (of dfA).

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo representing explicitly modified rows means customing Myers Alg logic by introducing a 3-possible-output non boolean logic of comparison: Equal, Non Equal, Similar. Similar means that compared row differ for a limited number of elements (that may be proportional to row's length).
In a lower abstraction level, 'similar' looks like a 'flagged' diagonal move in the edit script graph
-> list of Pair would be no more enough to represent the result, we would need a list of 3-ple.
Anyway, imo, this slution has the following weakness: the result is less neutral because the previous definition of 'Similarity' can't determine with certainty wheter a row was actually modified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not (yet) asking for in-row differences. Simply for adding the original row to a new column when that row was "Not Equal", so either "removed" or "inserted". A ComparisonDescription like row n of dfA was removed doesn't really help me if I can't see what "row n of dfA" actually contains. Similar to row m of dfB.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would make the comparison output more independent. I can try to implement it.

Copy link
Contributor Author

@CarloMariaProietti CarloMariaProietti Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would make the comparison output more independent. I can try to implement it.

Done, i added a DataRow<T> column to ComparisonDescription<T> representing the content of the row.

@Jolanrensen Jolanrensen added this to the 1.0.0-Beta5 milestone Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comparing two data frame

2 participants