Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus-merge (part 2: blob-merge) #1585

Merged
merged 10 commits into from
Sep 30, 2024
Merged

octopus-merge (part 2: blob-merge) #1585

merged 10 commits into from
Sep 30, 2024

Conversation

Byron
Copy link
Member

@Byron Byron commented Sep 7, 2024

Implement an octopus merge based on trees, and (mostly) equivalent to merge-ORT in Git. The foundation, tree-editing, was implemented in #1566.

Related to gitbutlerapp/gitbutler#4793.

Tasks

This PR was re-focussed on blob-based conflict detection and the generation of blob merge-results with conflict markers.

  • sketch API and document all knowledge from Git docs through it
    • flatten merge into platform API, similar to how it's done in diff Bring merge::State to being able to execute the merge.
  • Pipeline implementation and tests
    • renormalization also works
    • refactor to only lookup attributes once, on platform, as only a single path counts.
  • Platform implementation and tests
    • generalize merge-command invocation - lots of context must be passed
      • See if shell-quoting of the %P argument for merge-drivers is needed for it to get through to the driver program
      • Allow to pass additional environment variables so GIT_REFLOG_ACTION, GIT_EXEC_PATH and GIT_PREFIX can be passed
    • Can 'zealous' conflict resolution (or information) be part of that, whether it's used or not? Is it decided here?
      • Yes, we could provide access to our internal hunk information, but it seems like something that isn't desirable just yet.
    • binary merge
      • with large file and binary file support (i.e. it doesn't fail due to these conditions)
    • text merge
      • resolve conflicts with union, ours and theirs
      • merge style
      • diff3 style
      • zdiff3 style
      • properly deduce line endings
      • integrate tests for diff3-conflict-markers.sh (some special cases)
        • figure out common case, where single-line changes are merged with the common middle, probably unrelated to the actual diff as zdiff3 and diff3 work. - skipped in interest of time
        • Note that 'union' is also affected by this, which seems detremental - skipped in interest of time
      • tests for newline handling (in baseline)
      • integrate tests for merge-file (less simplistic input)
      • How are newlines written in the <<< lines that the text merge is adding? Probably configurable?
        • It's based on looking at the hunks before and after
        • a test or two specifically for newline handling - we are deviating quite a bit - would need to check Git code to do this right, skipping in the interest of time as this is probably minor.
      • What about deletions when zealously handling hunks? Have a test for that!
    • assure tests run into every condition - only one place is missing that the code doesn't run into, and for now I don't feel bold enough to make it an assertion.
  • validate that a difference in line-ending style will clash (it should, tokens-with-line-endings are used) (also for gix-diff.

Questions

  • How to deal with MERGE_HEAD and additional files the merge might create?
    • For merging commits only
  • Does Git provide binary data to external merge programs (and diff drivers, for that matter?)
    • Yes, and it also does that for files that are too large (larger than 1023MB). It has no problem reading them from disk to memory though, which is inefficient, and we should do better.

Next PR / Outscoped

  • textconv with context, see this gist for details.
    • There seem to be different 'tiers' of tools, some don't get GIT_DIR set, others do.
    • It also seems that diff-programs get too much context right now, but that depends on how much is passed to them by the caller as gix-command::Context.
  • How to model virtual-merge-bases? Can be none or many, user should have control over how this is done.
    • there is tests for that in diff3-conflict-markers
  • Understand how conflicts are currently handled in GitButler.
  • Actual tree-based merging
    • See also, all the merge options
    • Submodule merges are also possible! Maybe outscope it though! libgit2 also doesn't try it.
    • diff3-conflict-markers.sh - be sure to capture the 'empty tree' label , but also other special cases
  • Make blob-merge based conflict handling possible in the tree merge from gix at least.

Research

Everything is about MergeORT.

  • it uses an empty tree if there is no merge-base - we must allow the same.
  • it allows for multiple merge-bases, creating a virtual one by merging all merge-bases together using the same algorithm, recursively.
  • merges can have conflicts without a individual files being involved, for instance when directory renames clash
  • Must make sure that possible types of conflicts are properly communicated, to not degenerate information
  • It puts conflict-markers in the result tree, with annotations to provide additional context
  • Need resolution configuration, see git2::MergeOptions.
  • data stored by path, and is interned in the map to allow pointer-based comparisons
    • merge-info with everything one needs to know, also related to renames
    • or conflict information
    • it uses a memory-pool/arena to get memory for many paths all at once (and also release it like that)
  • paths start out as conflicted, and then can later be changed to non-conflicting if a content-based merged succeeded.
    • If it remains conflicts, the meta-data is used to produce an 'as merged as possible' version with conflict markers that can be checked out to the working tree.
  • hunks can partially overlap, but can also be resolved line-by line to some extend.

Crates

Mostly for performance optimization, also interesting for handling the index, and the tree::editor types.

Building Blocks for String Interning

String Interining

Handle Special Cases

  • A file was renamed differently
  • deal with "merge.directoryRenames"

Questions

Is git2::merge_trees() a trivial merge? Does it handle all the cases of MergeORT?

Maybe not, but it definitely handles rename tracking. See these git2 flags for more information.

How does rename-tracking affect a tree-merge?

The first round is done without renames, then there is a second round to find renames, and perform the merge of renamed items.

How is an octopus merge implemented, particularly with Merge ORT?

  • recursive mere-base merge to get virtual merge base
  • three-way tree iterator merge
  • rename tracking

References

Removed Utility

Prints a patch with context 3, using bytes only, based on imara-diff.

/// An adapted version of the unified diff writer to get a first idea.
        // TODO: remove this
        #[allow(dead_code)]
        mod unified_test {
            use bstr::ByteVec;
            use imara_diff::intern::{InternedInput, Interner, Token};
            use imara_diff::Sink;
            use std::ops::Range;

            /// A [`Sink`] that creates a textual diff
            /// in the format typically output by git or gnu-diff if the `-u` option is used
            pub struct UnifiedDiffBuilder<'a, W>
            where
                W: std::io::Write,
            {
                before: &'a [Token],
                after: &'a [Token],
                interner: &'a Interner<&'a [u8]>,

                pos: u32,
                before_hunk_start: u32,
                after_hunk_start: u32,
                before_hunk_len: u32,
                after_hunk_len: u32,

                buffer: Vec<u8>,
                dst: W,
            }

            impl<'a, W> UnifiedDiffBuilder<'a, W>
            where
                W: std::io::Write,
            {
                /// Create a new `UnifiedDiffBuilder` for the given `input`,
                /// that will writes it output to the provided implementation of [`Write`].
                pub fn with_writer(input: &'a InternedInput<&'a [u8]>, writer: W) -> Self {
                    Self {
                        before_hunk_start: 0,
                        after_hunk_start: 0,
                        before_hunk_len: 0,
                        after_hunk_len: 0,
                        buffer: Vec::with_capacity(8),
                        dst: writer,
                        interner: &input.interner,
                        before: &input.before,
                        after: &input.after,
                        pos: 0,
                    }
                }

                fn print_tokens(&mut self, tokens: &[Token], prefix: char) {
                    for &token in tokens {
                        self.buffer.push_char(prefix);
                        self.buffer.extend_from_slice(self.interner[token]);
                    }
                }

                fn flush(&mut self) {
                    if self.before_hunk_len == 0 && self.after_hunk_len == 0 {
                        return;
                    }

                    let end = (self.pos + 3).min(self.before.len() as u32);
                    self.update_pos(end, end);

                    writeln!(
                        &mut self.dst,
                        "@@ -{},{} +{},{} @@",
                        self.before_hunk_start + 1,
                        self.before_hunk_len,
                        self.after_hunk_start + 1,
                        self.after_hunk_len,
                    )
                    .unwrap();
                    self.dst.write_all(&self.buffer).unwrap();
                    self.buffer.clear();
                    self.before_hunk_len = 0;
                    self.after_hunk_len = 0
                }

                fn update_pos(&mut self, print_to: u32, move_to: u32) {
                    self.print_tokens(&self.before[self.pos as usize..print_to as usize], ' ');
                    let len = print_to - self.pos;
                    self.pos = move_to;
                    self.before_hunk_len += len;
                    self.after_hunk_len += len;
                }
            }

            impl<W> Sink for UnifiedDiffBuilder<'_, W>
            where
                W: std::io::Write,
            {
                type Out = W;

                fn process_change(&mut self, before: Range<u32>, after: Range<u32>) {
                    if before.start - self.pos > 6 {
                        self.flush();
                        self.pos = before.start - 3;
                        self.before_hunk_start = self.pos;
                        self.after_hunk_start = after.start - 3;
                    }
                    self.update_pos(before.start, before.end);
                    self.before_hunk_len += before.end - before.start;
                    self.after_hunk_len += after.end - after.start;
                    self.print_tokens(&self.before[before.start as usize..before.end as usize], '-');
                    self.print_tokens(&self.after[after.start as usize..after.end as usize], '+');
                }

                fn finish(mut self) -> Self::Out {
                    self.flush();
                    self.dst
                }
            }
        }
    }

@Byron Byron force-pushed the merge branch 2 times, most recently from ae27a5e to 1d3d258 Compare September 11, 2024 09:02
@Byron Byron changed the title octopus-merge (part 2) octopus-merge (part 2: blob-diff) Sep 11, 2024
@Byron Byron changed the title octopus-merge (part 2: blob-diff) octopus-merge (part 2: blob-merge) Sep 11, 2024
@Byron
Copy link
Member Author

Byron commented Sep 12, 2024

@EliahKagan Just as a note in case you'd like to submit a first patch to Git that will definitely be able to land, maybe to warm-up with the mailing-list workflow, then I have something for you.

The attribute-documentation about three-way merges contains a reference to merge.default which is indeed used in code as well.

However, the git-config documentation doesn't mention it at all.

@Byron Byron force-pushed the merge branch 3 times, most recently from 0ae3791 to ff0975c Compare September 13, 2024 12:13
gix-diff/src/blob/pipeline.rs Outdated Show resolved Hide resolved
@Byron Byron force-pushed the merge branch 14 times, most recently from 2f9c91b to a6678f9 Compare September 18, 2024 14:27
@Byron Byron force-pushed the merge branch 6 times, most recently from ca465ab to 46a86e9 Compare September 25, 2024 18:20
@Byron Byron force-pushed the merge branch 2 times, most recently from 022589f to 4783693 Compare September 26, 2024 14:41
This makes it more usable as the value lives longer than the ref itself.
@Byron Byron force-pushed the merge branch 5 times, most recently from b599a89 to e864aaa Compare September 28, 2024 18:23
@Byron Byron force-pushed the merge branch 4 times, most recently from efcab99 to 182f9ab Compare September 30, 2024 08:59
That way, the platform can be used to perform actual merges.
This will also be a good chance to try the API.
@Byron Byron marked this pull request as ready for review September 30, 2024 12:42
@Byron Byron merged commit 2261de4 into main Sep 30, 2024
16 checks passed
@Byron Byron deleted the merge branch September 30, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants