Core: Track duplicate DVs for data file and merge them before committing #15006

amogh-jahagirdar · 2026-01-09T03:34:17Z

While generally, writers are expected to merge DVs for a given data file before attempting to commit, we probably want to have a safeguard in the commit path in case this assumption is violated. This has been observed when AQE is enabled in Spark and a data file is split across multiple tasks (really just depends on how files and deletes are split); then multiple DVs are produced for a given data file, and then committed. Currently, after that commit reads would fail since the DeleteFileIndex detects the duplicates and fails on read.

Arguably, there should be a safeguard on the commit path which detects duplicates and fixes them up to prevent any invalid table states. Doing this behind the API covers any engine integration using the library.

This change updates MergingSnapshotProducer to track duplicate DVs for a datafile, and then merge them and produces a Puffin file per DV. Note that since we generally expect duplicates to be rare, we don't expect there to be too many small Puffins produced, and we don't add the additional logic to coalesce into larger files. Furthermore, these can later be compacted. In case of large scale duplicates, then engines should arguably fix those up before handing off to the commit path.

amogh-jahagirdar · 2026-01-09T03:38:11Z

Still cleaning some stuff up, so leaving in draft but feel free to comment. But basically there are some cases in Spark where a file can be split across multiple tasks, and if deletes happen to touch every single part in the task we'd incorrectly produce multiple DVs for a given data file (discovered this recently with a user when they had Spark AQE enabled, but I think file splitting can happen in more cases).

We currently throw on read in such cases, but ideally we can try and prevent this on write by detecting and merging pre-commit.

The reason this is done behind the API is largely so that we are defensive from a library perspective that in case an engine/integration happens to produce multiple DVs, we can at least fix it up pre-commit.

In the case there are too many to reasonably rewrite on a single node, then engines could do distributed writes to fix up before handing off the files to the API, but arguably from a library perspective it seems reasonable to pay this overhead to prevent bad commits across any integration.

core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java

amogh-jahagirdar · 2026-01-09T23:20:09Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

-      addedFilesSummary.addedFile(spec, file);
      hasNewDeleteFiles = true;
      if (ContentFileUtil.isDV(file)) {
        newDVRefs.add(file.referencedDataFile());


Probably keep a boolean in-case we detect a duplicate. That way we don't have to pay the price of grouping by referenced file everytime to detect possible duplicates; only if we detect it at the time of adding it, we can do the dedupe/merge

We also could just keep a mapping specific for duplicates. That shrinks down how much work we need to do because instead of trying to group by every referenced data file in case of duplicates, we just go through the duplicates set. It's maybe a little more memory but if we consider that we expect duplicates to generally be rare it feels like a generally better solution

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

singhpk234 · 2026-01-10T18:29:18Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      Preconditions.checkArgument(
+          Objects.equals(dv.dataSequenceNumber(), firstDV.dataSequenceNumber()),
+          "Cannot merge duplicate added DVs when data sequence numbers are different,"
+              + "expected all to be added with sequence %s, but got %s",


This is during a single write ? are we worried of compaction cases or is it just safety net ?

Primarily a safety net, because this check is only for the case where duplicate DVs are added in a single commit.

The way I think about it is if someone is adding DVs in a single commit that are duplicate AND span multiple sequence numbers, then something is probably wrong (it already is just from the duplicate perspective, it's doubly wrong if the seq numbers are different for the duplicates) and from a merging perspective we'll really have no basis for inferring what the right sequence number to use is. So, I think it's reasonable to fail the commit in this case.

I think the Spec and tuple checks are also very cautious :) Theoretically I imagine this only happens if I have duplicate file paths in the table, not sure how else this could happen

singhpk234 · 2026-01-10T18:30:31Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      Preconditions.checkArgument(
+          Objects.equals(dv.partition(), firstDV.partition()),
+          "Cannot merge duplicate added DVs when partition tuples are different");
+      positionDeleteIndex.merge(Deletes.readDV(dvs.get(i), ops().io(), ops().encryption()));


I wonder now lets say DV1 and DV2 both set position in the vector is this gonna be an illegal state because merging will be simply union them

Hm, I think in the context of merging within the same commit, that it's not illegal.
If DV1 has positions 1, 2, and 3 set
And Dv2 for some reason has positions 3, 4 set

It may indicate that a writer in a single commit is doing duplicate work, but logically from within a commit perspective doesn't seem wrong to merge.

Since this is all within the same commit, it feels reasonable to me to just produce 1, 2, 3, and 4.

I think where this would make a difference is if we wanted to do merging on conflict resolution; in that case we do need to care that a concurrently added DV doesn't delete the same position to ensure serializable guarantees.

But I would argue that's different than what we're talking about here which is merging duplicates within the same commit.

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

…require HasTableOPerations

amogh-jahagirdar · 2026-01-11T20:14:52Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+        firstDV.dataSequenceNumber());
+  }
+
+  private DeleteFile writeDV(


I would group all the merged DVs and then write out the puffin, right now in case of duplicates we'd be eagerly producing a tiny puffin per data file that originally had duplicates

I take back what I said, I'll leave this as is now unless there's some strong opinions here, it's a classic tradeoff between parallelism and file sizes.

If we work from the assumption that there won't be too many data files with duplicate DVs, then there won't be too many small puffins created. From a metadata perspective it doesn't really matter because it's the same number of entries in delete manifests after merging, but this is mainly just from a storage I/O cost perspective. But again, if there's not too many data files with duplicates, this is probably negligible in terms of costs.

Trying to coalesce all the DVs into a single puffin file just complicates the logic even more and
personally, I'd prefer to keep the logic simpler until we know it's really a problem. In the end we could run Rewrite position deletes and better colocate the blobs, and again considering this fix up is really not expected to run in the average case, doesn't seem worth it to complicate.

The worst case from a storage I/O perspective is where there's an unclustered delete and many data files which are split, and DVs are produced for some random set of positions in every data file. But in that case there's almost certainly going to be a distributed engine at play, and in that case, I think we could do an optimization in the spark integration before handing off DVs to the commit path (basically a distributed fixup would be required in that case).

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

amogh-jahagirdar · 2026-01-12T03:16:09Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

    DeleteFileSet deleteFiles =
        newDeleteFilesBySpec.computeIfAbsent(spec.specId(), ignored -> DeleteFileSet.create());
-    if (deleteFiles.add(file)) {
-      addedFilesSummary.addedFile(spec, file);


because we may be merging duplicates, we don't update the summary for delete files until after we dedupe and are just about to write the new manifests

amogh-jahagirdar · 2026-01-12T03:28:15Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

  private final Map<Integer, DataFileSet> newDataFilesBySpec = Maps.newHashMap();
  private Long newDataFilesDataSequenceNumber;
  private final Map<Integer, DeleteFileSet> newDeleteFilesBySpec = Maps.newHashMap();
+  private final Map<String, DeleteFileSet> duplicateDVsForDataFile = Maps.newHashMap();


This approach keeps track of duplicate DVs as they are added for commit. The logic was already tracking newDVRefs below so as soon as a delete file is added which has a referenced file already in newDVRef, this map can be populated.

Another option is to just track a boolean if there are any duplicates, then at the end we could loop over all the DVs and group by referenced file to figure out the duplicates. This option doesn't seem great when we expect the average case for a data file to not have many duplicates which is why I went with the targeted map that gets updated every time a duplicate DV is added.

Are we saving that much by keeping this in an optionally populated map? I feel like we could just have "newDVRefs" just be Map

That would increase our memory usage by the name of every Data File Name, but I feel like that can't be that much since we are already storing all the Datafile objects completely ... Just feel like things are easier that way...

It certainly is simpler that way but I think it's more than just the data file name because if we have newDVRefs be a map tracking all DVs for a data file then we'd need to track a DeleteFileSet per data file (which we expect to have 1 in most cases), and we also already track the deletes just as part of regular tracking. So then the average case memory for delete files goes from O(delete files) to O(2 * delete files). Unless we further split the delete file tracking? i.e. separate the tracking for DVs and tracking for all other delete files. But that complicates things in another way.

Since delete files are already pretty small I think you're right, it's not so bad, but I guess I'm a bit hesitant to incur that cost for the average case where we don't really expect duplicates to begin with

Why would it be 2 * Deletefiles? We would just use the value set as the list of all delete files?

I guess my thought here, is if we are going to spend the time to clean it up we may as well just do String, DeleteFiles map. If we were just going to throw an error I would keep newDVRefs as a set.

Why would it be 2 * Deletefiles? We would just use the value set as the list of all delete files?

Yeah I just meant that it'd be 2 * delete files if we had both deleteFilesBySpec and dvRefs as a map. We could split the tracking so that deleteFilesBySpec only tracks eq. deletes and v2 position deletes, and dvRefs is only V3 DVs for a given data file. Then it's just O(deletes), but then separating the tracking also means some changes in other places.

I guess my thought here, is if we are going to spend the time to clean it up we may as well just do String, DeleteFiles map. If we were just going to throw an error I would keep newDVRefs as a set.

I'm good either way w slight preference to tracking just the duplicates!

I do think we should do a fixup rather than error because even in the case the fixup is expensive, I look at it as it incentivizes engines to try and avoid the problem on distributed write, or they do their own fixup before handing off to the commit path. Then we sort of get the best of protecting the table state and perf, without having to cause failures?

I meant we would change

private final Set newDVRefs = Sets.newHashSet();
to
private final Map<String, DeleteFiles>

It shouldn't duplicate the objects if we add to both of these simultaneously, so just 2 times the references, one by spec and one by targetFileName

I think I am convinced we should fix this rather than throw an error

I agree with Russell here. I think it would be better to keep track of the DVs in a different way.

Now that there may be duplicate DVs that get rewritten, we can no longer trust newDeleteFilesBySpec because it could contain DVs that will be replaced with compacted ones. I think what I would do is a bit simpler: I would keep a multimap of DeleteFile by referenced data file (location). Then in newDeleteFilesAsManifests, you would loop through each referenced data file and process the DeleteFile entries. If there is more than one then you'd create a new DV and write its metadata to a new file and then write the new DV metadata to a manifest writer (by spec ID). If it's important to dedup before compacting, then you could keep a single DeleteFileSet and clear it before processing each referenced file's deletes.

I think that's slightly more reliable in a couple ways. For instance, it would catch DVs with different (but possibly equivalent) specs and it would also catch cases where you have a v2 delete file and a DV, which should also be compacted. It's also simpler to keep state in just one place.

Fair points, I can just change newDVRefs to a multimap. Are we sure we would want to just leave deleteFilesBySpec as is? I guess it's not much of a memory concern since as @RussellSpitzer said it's just refferences but are we worried about the state of that map having potential duplicates?

amogh-jahagirdar · 2026-01-12T03:30:57Z

spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestPositionDeletesTable.java

    Pair<List<PositionDelete<?>>, DeleteFile> deletesA =
        deleteFile(tab, dataFileA, new Object[] {"aa"}, new Object[] {"a"});
    Pair<List<PositionDelete<?>>, DeleteFile> deletesB =
-        deleteFile(tab, dataFileA, new Object[] {"bb"}, new Object[] {"b"});


This fix surfaced an issue in some of the TestPositionDeletesTable tests where we were setting the wrong data file for delete file; we'd just add a DV for the same data file, and then it'd get merged with the new logic , and break some of the later assertions.

amogh-jahagirdar · 2026-01-12T03:37:47Z

...4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveDanglingDeleteAction.java

    // Add Data Files with EQ and POS deletes
    DeleteFile fileADeletes = fileADeletes();
    DeleteFile fileA2Deletes = fileA2Deletes();
-    DeleteFile fileBDeletes = fileBDeletes();


This test had to be fixed after the recent changes because the file paths for data file B and B2 were set to the same before, so the DVs for both referenced the same file (but that probably wasn't the intention of these tests) so it was a duplicate. After this change we'd merge the DVs in the commit, and then it'd actually get treated as a dangling delete and fail some of the assertions.

Since these tests are just testing the eq. delete case we could just simplify it by removing the usage of fileB deletes, it's a more minimal test that tests the same thing.

Also note, generally I'd take this in a separate PR but I think there's a good argument that this change should be in a 1.10.2 patch release to prevent invalid table states; in that case we'd need to keep these changes together.

core/src/test/java/org/apache/iceberg/TestRowDelta.java

nastra

This LGTM overall, thanks @amogh-jahagirdar, just left a few smaller comments

…uffin, don't produce a puffin per spec

amogh-jahagirdar · 2026-01-16T22:30:53Z

data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java

        dv.referencedDataFile());
  }
-
-  private static byte[] readBytes(InputFile inputFile, long offset, int length) {


This is moved to an IOUtil API, which we now use in the merging DV logic

core/src/main/java/org/apache/iceberg/deletes/BaseDVFileWriter.java

rdblue · 2026-01-22T22:58:33Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

        if (newSnapshotIds.contains(entry.snapshotId()) && ContentFileUtil.isDV(file)) {
          ValidationException.check(
-              !newDVRefs.contains(file.referencedDataFile()),
+              !dvsByReferencedFile.containsKey(file.referencedDataFile()),


I think this change is correct, but I want to note that in the future we could avoid failing by merging DVs as long as that is allowed by the operation being committed.

Yeah had an old PR out for this https://github.com/apache/iceberg/pull/11693/files#diff-410ff1b47d9a44a2fd5dbd103cad9463d82c8f4f51aa1be63b8b403123ab6e0e (probably a bad PR title since by definition for the operation if the positions are disjoint, it's not conflicting)

rdblue · 2026-01-22T23:50:21Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

      newDeleteFilesBySpec.forEach(
          (specId, deleteFiles) -> {
            PartitionSpec spec = ops().current().spec(specId);
+            deleteFiles.forEach(file -> addedFilesSummary.addedFile(spec, file));


I don't like that the DeleteFile instances are still kept in two places, and that both places will deduplicate using sets but different logic once you account for the map keys. I know that because we are not deduplicating v2 deletes, we need a place for them to be tracked, but that doesn't mean we need to store the DeleteFile instances for DVs twice.

The reason why I don't like the double storage is that it doesn't handle some strange cases. For instance, what if a DeleteFile is added for the same DV but with two different (possibly equivalent, unpartitioned) specs? Then the same file would be handled twice here, but would be deduplicated by the DeleteFileSet for its referenced data file causing a metrics bug. While the merge code checks that duplicates have the same spec ID, the DeleteFileSet does not.

I think it would be cleaner to keep a list of v2 deletes and the multimap of DVs and maintain them separately. This method should produce a new list of merged DVs, then both lists (v2 deletes and merged DVs) should be written to delete manifests by spec. It's easy enough to produce a filtered iterator, so I don't think we are buying much by grouping by spec ID as files are added.

rdblue · 2026-01-23T00:02:20Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+
+    // Update newDeleteFilesBySpec to remove all the duplicates
+    mergedDVs.forEach(
+        mergedDV -> newDeleteFilesBySpec.get(mergedDV.specId).removeAll(mergedDV.duplicateDVs));


I recommend avoiding methods like this that modify the instance state of snapshot producer. That should only happen in cases where we need to cache results, and cached results should be held in separate instance state. For example, this is how newDataFilesBySpec and cachedNewDataManifests work. The new data files are kept organized by spec, but that state is not rewritten. Instead, cachedNewDataManifests is used in case of retry.

Modifying the instance state like this hides how the class was configured by the caller after a merge happens, and relies on side-effects that are not obvious to people working with the code later.

I think this should be refactored so that as many of these new methods as possible are static and in a DVUtil class rather than inline in this already huge class. Merging DVs is independently useful and shouldn't live here. This method may also be a good candidate for being a utility if you refactor it to pass a threadpool to Tasks, but it may make more sense to leave it as an instance method if it needs to update a merged DV cache.

I'm also not a fan of updating instance state in multiple places. This removes DVs here, but adds them in writeMergedDVs

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

rdblue · 2026-01-23T00:20:36Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java


+  // Merge duplicates, internally takes care of updating newDeleteFilesBySpec to remove
+  // duplicates and add the newly merged DV
+  private void mergeDVsAndWrite() {


I think this method should return a List<DeleteFile> that represents all of the DVs after merging has been done. It should probably also use a cache to ensure that multiple calls produce the same list and reuse as much work as possible, although this could be done in a follow up and I could be convinced to cache at a different level (like delete manifests).

Then the caller should take the list produced here, combine it with the v2 DeleteFile list, group by spec, and write delete manifests. That would require no side-effects other than caching.

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

rdblue · 2026-01-23T00:25:37Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+          dv.specId());
+
+      Preconditions.checkArgument(
+          Objects.equals(dv.partition(), firstDV.partition()),


I don't think there is a guarantee about the underlying type of partition (it should be StructLike, I think). If that's the case, then Objects.equals is not going to work all of the time. If you want to compare partitions then you'd need a struct comparator for the struct produced by the partition spec. Honestly, I think that's a bit overkill if you know that the referenced data file matched though.

rdblue · 2026-01-23T00:26:24Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+
+  // Data class for referenced file, the duplicate DVs, the merged position delete index,
+  // partition spec and tuple
+  private static class MergedDVContent {


This seems a bit heavy to me. I'd combine this with whatever cache mechanism you end up using.

I was able to remove this, but I think I'll save caching for a follow-up if that's OK? I mainly want to aim for making sure we're just correct on this one.

…y duplicates in metadata

amogh-jahagirdar · 2026-01-27T02:21:56Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

-      if (ContentFileUtil.isDV(file)) {
-        newDVRefs.add(file.referencedDataFile());
-      }
+    hasNewDeleteFiles = true;


Since we're not tracking by DeleteFileSet at the time of adding, we treat every addition as a new delete, even potential duplicates (unless we want to do a look back in the list on every addDeleteFile, but I'm very against that since it's an O(deletes-added^2) operation effectively at that point for a commit).

If we look at how hasNewDeleteFiles is actually used, I don't think this is really consequential. hasNewDeleteFiles is true and there's a cached state we use the flag as an indication that the cache is stale, and should be cleared out/files cleaned up. Even if there are duplicates, there's at least 1 file which is new.

We end up merging/deduping the DVs (and the V2 pos deletes and equality deletes) anyways just before producing new manifests. See my comment below

amogh-jahagirdar · 2026-01-27T02:26:04Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

        if (newSnapshotIds.contains(entry.snapshotId()) && ContentFileUtil.isDV(file)) {
          ValidationException.check(
-              !newDVRefs.contains(file.referencedDataFile()),
+              !dvsByReferencedFile.containsKey(file.referencedDataFile()),


Yeah had an old PR out for this https://github.com/apache/iceberg/pull/11693/files#diff-410ff1b47d9a44a2fd5dbd103cad9463d82c8f4f51aa1be63b8b403123ab6e0e (probably a bad PR title since by definition for the operation if the positions are disjoint, it's not conflicting)

amogh-jahagirdar · 2026-01-27T02:33:36Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+  private final List<DeleteFile> positionAndEqualityDeletes = Lists.newArrayList();
+  private final Map<String, List<DeleteFile>> dvsByReferencedFile = Maps.newLinkedHashMap();


@rdblue These are 2 disjoint fields, one for a list of v2 deletes and a multimap for DVs.
The map is a LinkedHashMap because we have a bunch of tests which have expectations on the exact orders of entries in a manifest. The previous change didn't require anything because we worked with the deleteFilesBySpec, and inherently preserved the order.

I personally think our tests should probably get away from expecting a certain order in manifests, and just assert the contents (or at least have validate methods that express either being strict on the ordering or not). As we get into V4, maybe we'll make implementation choices for ordering entries in a certain way but in the current state of things, it was kind of a hinderance to making changes here.

I didn't make the test change since it's fairly large, and can be distracting from this change and I figured the linkedhashma has negligible overhead so we can just preserve the existing behavior.

amogh-jahagirdar · 2026-01-27T02:37:56Z

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java

+      Map<Integer, List<DeleteFile>> newDeleteFilesBySpec =
+          Streams.stream(
+                  Iterables.concat(
+                      mergedDVs, validDVs, DeleteFileSet.of(positionAndEqualityDeletes)))


@rdblue let me know how you feel about the DeleteFileSet.of(positionandEqualityDeletes).
I know we were kind of against de-duping but I think the fact that the two fields are disjoint now avoids that partition spec case you mentioned. I'm a bit worried that not deduping before producing the manifests is a regression compared to the previous behavior. And there's a good argument that if we can do it correctly, relatively cheaply, it's better to do it to avoid any bad metadata (similar to why we do it for data files).

The summary stats are anyways produced from this "final" deleteFilesBySpec which should be all correct so I think we're covered in general.

amogh-jahagirdar · 2026-01-27T02:54:37Z

core/src/main/java/org/apache/iceberg/DVUtil.java

+      Preconditions.checkArgument(
+          Objects.equals(dv.partition(), firstDV.partition()),
+          "Cannot merge duplicate added DVs when partition tuples are different");


@rdblue let me know if you feel strongly about this check. While it is StructLike and doesn't guarantee an equals implementation, the way I look at it is the following:

Generally it'll be PartitionData which does do a type and value by value comparison.

Even if the implementation changes from under us and ends up being another StructLike which doesn't override equals, then it's a reference equality which will worst case be a false positive and just fail the commit.

Another rationale behind these checks is that if a writer produces duplicate DVs, there's also a chance of some kind of metadata record reuse issue from the writer and this felt like an easy sanity check.

Alternatively, we could just simplify this and remove these validations by assuming that the duplicate DVs are OK by every other dimension.

github-actions bot added the core label Jan 9, 2026

github-actions bot added the spark label Jan 9, 2026

amogh-jahagirdar commented Jan 9, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jan 9, 2026

View reviewed changes

singhpk234 reviewed Jan 10, 2026

View reviewed changes

amogh-jahagirdar commented Jan 10, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

geruh reviewed Jan 11, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

amogh-jahagirdar added 8 commits January 11, 2026 12:50

Core: Merge DVs referencing the same data files as a safeguard

82cced9

Fix dangling delete tests

e41943d

Simplification in OutputFileFactory

76e24e4

minor optimization

a740ff9

cleanup, make outputfilefactory take in more fields so that we don't …

11ffc2f

…require HasTableOPerations

change the duplicate tracking algorithm, fix spark tests

772e3c2

Add more tests for multiple DVs and w equality deletes

3404a86

Rebase and fix spark 4.1 tests

c04d0e0

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch from 374b567 to c04d0e0 Compare January 11, 2026 20:02

amogh-jahagirdar commented Jan 11, 2026

View reviewed changes

more cleanup, put dvfilewriter in try w resources

a39b073

geruh reviewed Jan 11, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

amogh-jahagirdar added 2 commits January 11, 2026 19:13

Add logging, some more cleanup

a079d22

more cleanup

d7eadb0

amogh-jahagirdar commented Jan 12, 2026

View reviewed changes

amogh-jahagirdar marked this pull request as ready for review January 12, 2026 17:15

amogh-jahagirdar changed the title ~~Core: Merge DVs referencing the same data files as a safeguard~~ Core: Track duplicate DVs for data file and merge them before committing Jan 12, 2026

amogh-jahagirdar requested review from nastra and rdblue January 12, 2026 17:35

nastra reviewed Jan 16, 2026

View reviewed changes

core/src/test/java/org/apache/iceberg/TestRowDelta.java Show resolved Hide resolved

nastra reviewed Jan 16, 2026

View reviewed changes

core/src/test/java/org/apache/iceberg/TestRowDelta.java Outdated Show resolved Hide resolved

nastra reviewed Jan 16, 2026

View reviewed changes

core/src/test/java/org/apache/iceberg/TestRowDelta.java Outdated Show resolved Hide resolved

nastra approved these changes Jan 16, 2026

View reviewed changes

amogh-jahagirdar added 5 commits January 16, 2026 15:20

Use an unpartitioned spec for the output file factory for the final p…

301f0fe

…uffin, don't produce a puffin per spec

address feedback

112b086

Add some spacing to make precondition checks more readable

eecacad

more style stuff

9673f85

Update delete loader to use IOUtil.readDV API

4104097

github-actions bot added the data label Jan 16, 2026

amogh-jahagirdar commented Jan 16, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/deletes/BaseDVFileWriter.java Show resolved Hide resolved

Update interface documentation

9972ce0

rdblue reviewed Jan 22, 2026

View reviewed changes

rdblue reviewed Jan 23, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2026

View reviewed changes

core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated Show resolved Hide resolved

rdblue reviewed Jan 23, 2026

View reviewed changes

amogh-jahagirdar added this to the Iceberg 1.11.0 milestone Jan 25, 2026

address Ryan's feedback

e75bb03

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch from cbfc878 to e75bb03 Compare January 26, 2026 00:02

Dedupe pos/equality deletes to preserve previous behavior, prevent an…

6bccc52

…y duplicates in metadata

amogh-jahagirdar force-pushed the always-merge-duplicates-on-driver branch from 78b772c to 6bccc52 Compare January 26, 2026 00:21

Make DV tracking a linked hashmap to preserve ordering of entries

9bb9c56

amogh-jahagirdar commented Jan 27, 2026

View reviewed changes

Remove passing in specs to util

85801f1

amogh-jahagirdar commented Jan 27, 2026

View reviewed changes

		private final List<DeleteFile> positionAndEqualityDeletes = Lists.newArrayList();
		private final Map<String, List<DeleteFile>> dvsByReferencedFile = Maps.newLinkedHashMap();

Core: Track duplicate DVs for data file and merge them before committing #15006

Are you sure you want to change the base?

Core: Track duplicate DVs for data file and merge them before committing #15006

Conversation

amogh-jahagirdar commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amogh-jahagirdar Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jan 9, 2026 •

edited

Loading

amogh-jahagirdar commented Jan 9, 2026 •

edited

Loading

amogh-jahagirdar Jan 10, 2026 •

edited

Loading

amogh-jahagirdar Jan 10, 2026 •

edited

Loading

amogh-jahagirdar Jan 10, 2026 •

edited

Loading

amogh-jahagirdar Jan 11, 2026 •

edited

Loading

amogh-jahagirdar Jan 11, 2026 •

edited

Loading

amogh-jahagirdar Jan 12, 2026 •

edited

Loading

amogh-jahagirdar Jan 13, 2026 •

edited

Loading

amogh-jahagirdar Jan 13, 2026 •

edited

Loading

RussellSpitzer Jan 13, 2026 •

edited

Loading

amogh-jahagirdar Jan 12, 2026 •

edited

Loading

amogh-jahagirdar Jan 27, 2026 •

edited

Loading

amogh-jahagirdar Jan 27, 2026 •

edited

Loading