diff --git a/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio
new file mode 100644
index 0000000..4025407
--- /dev/null
+++ b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio
@@ -0,0 +1 @@
+7Zxdj6M2FIZ/DZeNMDaGXCaZ6famVaWp1MsVA05CS3AKziSzv74mGAIcz0d2AdOkGWkEBzDwvIeDXxxi4dXu9CUL9ttfecQSy7Gjk4UfLMeZEyr/F4HXMkA9pwxssjgqQ+gSeIq/MRW0VfQQRyxvrSg4T0S8bwdDnqYsFK1YkGX82F5tzZP2XvfBhoHAUxgkMPpnHIltGfVd+xL/hcWbbbVnZKslu6BaWQXybRDxYyOEHy28yjgX5dTutGJJwa7iUm738xtL6wPLWCo+s8EfX7/6L/HL7vnoL/L1bxz/9c/jT8Qtm3kJkoM6Y3W04rVCkPFDGrGiFWTh5XEbC/a0D8Ji6VFqLmNbsUvU4ijIt/W6qm2WCXZ686hRzULmEOM7JrJXuYragFSkVf44vpo/XtRwqIptG0rU3AOVAZu67QskOaE4XcEMAWToPWj2x9DWcZKseMKz87Z47RZ/Mp6LjP/NGkvo+VNswVPRiJeffog7uE0c25C4DjgeircDeN8S7hrvVHDjwdN7zWgY6nhH3vzZtgdKY9cwVzJ0Go+CFaSraazwBtZ7uq6dN7jSZ+rSYdKVmC4DFHDF/0Gs3XQ1jtUbOl0Zilzm6bjOqYeDodLVdBnwB07XcbCCdDWNdQ6wfpEU99aja8mO+HIFGMtzF22QQRJvUjkdSgZMMlsWhGLpsBZqwS6OomLzZcby+FvwfG6qUGjP41Scz8hdWu5D0dZB8Lz0iAhokfKUdYSrQrL79qSOEfUgk+OilkyyrACZvDFlqtxyQ6fgTqWpHG0lDfWBNHhUaaBdfL5PabA9NWmgs7QcmhRKSInoppiYzWZVTO6jDveqoO5O1L5X9XFh0M6tZW741oKgIbJ7xdrThdG6CtS82qanuz7y2tJgGwNpHI00wz3h0piq/6Up7S4xLA30ZdA/3KU0xKaGpYHeDnqQ+5TG+FUD7aHyMZ61JNbCKQyNdDNzd4qCDd81Q91eczXfEMwfVTBoPIEyV41LTeqZP6FeG7emL4aIhnclS/9jLNBA5iIQ7B3on3iG0gcqr4vKn3kwOXWwHG82GC5o6oyTctEkSUGPtQjDw+6QBIJnU+zw9yAFRbglBbYR0GE+Zjl14KBev+V0lLGnbtnExJ7BBB+5cEIXO83Cicl85pgvB1d938RM4ZwIKWj0GoVzinZ8iMLpIs0VPm7phK6u59I5xjgoKJ1zx3zphKZsmqWT2GgKBeEqS2SmdE6DFIZmplE6IbWbLJ3ExqZLJ4Yu6Tyccov4XbdTYDVPI3WDW8PBh8brZuF7Tvt5Y03VGHzot24Wfjfzia197jAuf+jK7oY/dql5/n2/VmDk21jdTjPRPBUft8uMoSmcaJfZxRN4+Iiv8m6GuszTIAW9WKPLPMVhzB6kAF1mF/bZxu0wQ4t3N7ctQrXWcdTbFnnXOJ4Hi5eWTyYoyAhjxG5VPN8ZI67ffhxHrk+MwuXbYF9MrhN2WhTvg0oyLI3U5EOYBHkeh7oqzyLwZuiHqBooXA2JKpYxmVDxS7t5HR61h9+LnLgogf223an7KVUTOT9kIVNbXSCDhgj5oCERZBsmQENnterT/gEBoUe9SwExpXXt+1ENdW0NLeMnRhfvQUaCSLeD9t1XImxqaBGhZb5LETFy+hJR09R3iyhnLz8VUK5++b0F/Pgv
\ No newline at end of file
diff --git a/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio.png b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio.png
new file mode 100644
index 0000000..9d060e2
Binary files /dev/null and b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-hashtable.drawio.png differ
diff --git a/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio
new file mode 100644
index 0000000..50db4ea
--- /dev/null
+++ b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio
@@ -0,0 +1 @@
+7Zpdb5swFIZ/jS87gY35uAwJbW8qbcqkSbsj4CRoBGfGadL++tnBJAGTZVtDXK2kUmUO+ADvY+xzDgA0Xu0eWLxePtGU5ABa6Q6gCYAwcFzxXxpeKoPrwcqwYFlameyjYZq9EmW0lHWTpaRsHMgpzXm2bhoTWhQk4Q1bzBjdNg+b07x51nW8IJphmsS5bv2WpXxZWX1sHe2PJFss6zPbltqziuuDlaFcxindnphQBNCYUcqr1mo3JrnUrtal6nd/Zu/hwhgp+J90GK3Y09cn195aj1/Y/fT7/O6nf6e8PMf5Rt2wulj+UivA6KZIiXRiARRulxkn03WcyL1bgVzYlnyViy1bNOdZno9pTtm+L5pj+SfsJWf0BznZ4+5/sgct+Im9+gm7ujDCONmdvWP7oKMYf4SuCGcv4hDVASIlvRp7qGazPZK064G2PKFY94vV4FkcXB/1FQ0l8V/IjXqWe07cJOmSO/WCmWX1JCs2LSvuW9Y5PCOrO3Ox24+sjvHR6vUrK7FTTLwuWQPXQ3FfshofrYEua+SDMAIBAhEGPgYjDCIPBCPgOyByQOgD3wKRC/wAhIFsBBiElWUCRpFGRajFm9LHebYoRDsRqhGhcig1zcQyN1I7Vlmayu4hI2X2Gs/2riTTNc0KvpcAhwBPpK8Np2W1UNsavYIWpIW6NomJfqqu0b4KWLcBFmJXB2t1gIV9gbX16V1/Yt4BmwYIta36XOmhs732yos0NvCmaJwBzRk02DGMRl+9rQFNFQHoM9pt0bgaGntAU0URpp8aPTh7EMHYWgYOoQNGUIYS4VhGCu8QWP/Rge3gi9GBf0tgtePfRdPlMl7L5jwnu5EsXghhSJGq5iTJ47LMkia7yglJtTLGRaVOlMAdStQ2RvKYZ89N913yqDN8lkPiCAL5rSfHdj7hppOSblhCVL/TCkbLleNcdMVjtiBcc7Undrj1N0CElyG+JSUSCZGfOl0pkQ9nyO0pJcLGM02oR87DQqPYGF5oYEdxJQqA78ncU2agIm+tFhuRrt6/Q2hXgIK8oAEFeTqUrim0Pyh6YHbVeeg2FS/cWqPN12fhf1Hy0nQ1XvKC/v+wbrZ1NV+hhR21xKvOA0bf32h6Gx/HqCOIH2q3/0AW2w2y5mu3SH8T+kGrUI7fXpUN126RnnR90ORAQ2O6CoWGvO0MGuO1WzS88TiHxvhTo6fUQ+32FJjbrIEYr92i3tPtW3y30w6nzZf9kJ5uD3OUYtPbHCU2jx8dVoXx45ebKPoF
\ No newline at end of file
diff --git a/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio.png b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio.png
new file mode 100644
index 0000000..90a91e0
Binary files /dev/null and b/content/blog/2024-08-23-datafusion-grouped-aggregations/datafusion-aggregation-reorder-accumulator-input.drawio.png differ
diff --git a/content/blog/2024-08-23-datafusion-grouped-aggregations/index.md b/content/blog/2024-08-23-datafusion-grouped-aggregations/index.md
index c7fd6a4..85e0a11 100644
--- a/content/blog/2024-08-23-datafusion-grouped-aggregations/index.md
+++ b/content/blog/2024-08-23-datafusion-grouped-aggregations/index.md
@@ -1,7 +1,6 @@
+++
title = "DataFusion 两阶段并行哈希分组聚合"
date = 2024-08-23
-draft = true
+++
分组聚合功能是任何分析引擎的核心功能,可在海量数据上创建出可以理解的摘要。DataFusion 分析引擎采用了先进的两阶段并行哈希分组聚合技术,高度并行且向量化执行。
@@ -56,7 +55,17 @@ DataFusion 支持多种聚合方案,在不同情况下会选择最优方案。
5. 如果内存充足,则最后将整个内存中的哈希表,计算最终聚合结果并输出到下一算子
### 内存中的哈希表
-TODO
+![](./datafusion-aggregation-hashtable.drawio.png)
+
+以上在逻辑上形成一个哈希表,但物理上并非直接使用哈希表存储 group 到聚合状态的映射。实际上哈希表维护的是 group 值到 group 索引的映射,哈希表负责分配 group 索引,而另外有一个 `Accumulator组` 的数据结构来维护每个 group 的聚合状态,每个 group 索引会对应其中一个 Accumulator。
+
+在接收一批数据时,先由哈希表来计算这批数据每行对应的 group 索引(可能是已存在的,也可能会分配一个新的),然后将这批数据和对应的 group 索引发送给 `Accumulator组` 来进行聚合状态更新。
+
+`Accumulator组` 在更新前会利用 Arrow 计算内核对数据进行一个高效地重排,以便在更新聚合状态时,可以被编译器很好地向量化(SIMD加速)。
+
+![](./datafusion-aggregation-reorder-accumulator-input.drawio.png)
+
+利用类似 `Vec` 连续内存存储,尽可能减少内存分配,尽可能类型特化,可以最大化提高聚合计算效率。
### 利用输入的排序特性
DataFusion 会利用聚合算子的输入在 group keys 上的(部分/完全)排序特性,来加速聚合计算。
@@ -88,11 +97,5 @@ Spill 是一个耗时的操作,涉及到磁盘 IO 和排序,而且对于高
因为如果输入具有排序特性,那么可以利用排序特性来提前输出已聚合完毕的 group,这样不会因为高基数聚合导致需要在内存中维护巨大的哈希表。
-问题:
-1. 如何向量化执行的
-2. 内部hash表结构
-4. 如果是 full group ordering,但还没到 new group value出现时,spill到了磁盘,后续 group_ordering.emit_to() 如何保证正确性?
-
-
参考资料
1. [Aggregating Millions of Groups Fast in Apache Arrow DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/)
\ No newline at end of file