Skip to content

Commit

Permalink
Update blog
Browse files Browse the repository at this point in the history
  • Loading branch information
lewiszlw committed Sep 7, 2024
1 parent 0900ab5 commit 34c5bbf
Show file tree
Hide file tree
Showing 5 changed files with 13 additions and 8 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<mxfile host="Electron" modified="2024-09-07T17:02:36.843Z" agent="5.0 (Macintosh; Intel Mac OS X 13_0_0) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="ufZ3MhtE_dM-D5_-2XNb" version="14.6.13" type="device"><diagram id="_QMhFm_OUff-wKAJVXwj" name="Page-1">7Zxdj6M2FIZ/DZeNMDaGXCaZ6famVaWp1MsVA05CS3AKziSzv74mGAIcz0d2AdOkGWkEBzDwvIeDXxxi4dXu9CUL9ttfecQSy7Gjk4UfLMeZEyr/F4HXMkA9pwxssjgqQ+gSeIq/MRW0VfQQRyxvrSg4T0S8bwdDnqYsFK1YkGX82F5tzZP2XvfBhoHAUxgkMPpnHIltGfVd+xL/hcWbbbVnZKslu6BaWQXybRDxYyOEHy28yjgX5dTutGJJwa7iUm738xtL6wPLWCo+s8EfX7/6L/HL7vnoL/L1bxz/9c/jT8Qtm3kJkoM6Y3W04rVCkPFDGrGiFWTh5XEbC/a0D8Ji6VFqLmNbsUvU4ijIt/W6qm2WCXZ686hRzULmEOM7JrJXuYragFSkVf44vpo/XtRwqIptG0rU3AOVAZu67QskOaE4XcEMAWToPWj2x9DWcZKseMKz87Z47RZ/Mp6LjP/NGkvo+VNswVPRiJeffog7uE0c25C4DjgeircDeN8S7hrvVHDjwdN7zWgY6nhH3vzZtgdKY9cwVzJ0Go+CFaSraazwBtZ7uq6dN7jSZ+rSYdKVmC4DFHDF/0Gs3XQ1jtUbOl0Zilzm6bjOqYeDodLVdBnwB07XcbCCdDWNdQ6wfpEU99aja8mO+HIFGMtzF22QQRJvUjkdSgZMMlsWhGLpsBZqwS6OomLzZcby+FvwfG6qUGjP41Scz8hdWu5D0dZB8Lz0iAhokfKUdYSrQrL79qSOEfUgk+OilkyyrACZvDFlqtxyQ6fgTqWpHG0lDfWBNHhUaaBdfL5PabA9NWmgs7QcmhRKSInoppiYzWZVTO6jDveqoO5O1L5X9XFh0M6tZW741oKgIbJ7xdrThdG6CtS82qanuz7y2tJgGwNpHI00wz3h0piq/6Up7S4xLA30ZdA/3KU0xKaGpYHeDnqQ+5TG+FUD7aHyMZ61JNbCKQyNdDNzd4qCDd81Q91eczXfEMwfVTBoPIEyV41LTeqZP6FeG7emL4aIhnclS/9jLNBA5iIQ7B3on3iG0gcqr4vKn3kwOXWwHG82GC5o6oyTctEkSUGPtQjDw+6QBIJnU+zw9yAFRbglBbYR0GE+Zjl14KBev+V0lLGnbtnExJ7BBB+5cEIXO83Cicl85pgvB1d938RM4ZwIKWj0GoVzinZ8iMLpIs0VPm7phK6u59I5xjgoKJ1zx3zphKZsmqWT2GgKBeEqS2SmdE6DFIZmplE6IbWbLJ3ExqZLJ4Yu6Tyccov4XbdTYDVPI3WDW8PBh8brZuF7Tvt5Y03VGHzot24Wfjfzia197jAuf+jK7oY/dql5/n2/VmDk21jdTjPRPBUft8uMoSmcaJfZxRN4+Iiv8m6GuszTIAW9WKPLPMVhzB6kAF1mF/bZxu0wQ4t3N7ctQrXWcdTbFnnXOJ4Hi5eWTyYoyAhjxG5VPN8ZI67ffhxHrk+MwuXbYF9MrhN2WhTvg0oyLI3U5EOYBHkeh7oqzyLwZuiHqBooXA2JKpYxmVDxS7t5HR61h9+LnLgogf223an7KVUTOT9kIVNbXSCDhgj5oCERZBsmQENnterT/gEBoUe9SwExpXXt+1ENdW0NLeMnRhfvQUaCSLeD9t1XImxqaBGhZb5LETFy+hJR09R3iyhnLz8VUK5++b0F/Pgv</diagram></mxfile>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<mxfile host="Electron" modified="2024-09-07T17:49:33.407Z" agent="5.0 (Macintosh; Intel Mac OS X 13_0_0) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/14.6.13 Chrome/89.0.4389.128 Electron/12.0.7 Safari/537.36" etag="r4KHB74-zJjl5KIARP0-" version="14.6.13" type="device"><diagram id="GERB1R0UqWKGMOunpS6h" name="Page-1">7Zpdb5swFIZ/jS87gY35uAwJbW8qbcqkSbsj4CRoBGfGadL++tnBJAGTZVtDXK2kUmUO+ADvY+xzDgA0Xu0eWLxePtGU5ABa6Q6gCYAwcFzxXxpeKoPrwcqwYFlameyjYZq9EmW0lHWTpaRsHMgpzXm2bhoTWhQk4Q1bzBjdNg+b07x51nW8IJphmsS5bv2WpXxZWX1sHe2PJFss6zPbltqziuuDlaFcxindnphQBNCYUcqr1mo3JrnUrtal6nd/Zu/hwhgp+J90GK3Y09cn195aj1/Y/fT7/O6nf6e8PMf5Rt2wulj+UivA6KZIiXRiARRulxkn03WcyL1bgVzYlnyViy1bNOdZno9pTtm+L5pj+SfsJWf0BznZ4+5/sgct+Im9+gm7ujDCONmdvWP7oKMYf4SuCGcv4hDVASIlvRp7qGazPZK064G2PKFY94vV4FkcXB/1FQ0l8V/IjXqWe07cJOmSO/WCmWX1JCs2LSvuW9Y5PCOrO3Ox24+sjvHR6vUrK7FTTLwuWQPXQ3FfshofrYEua+SDMAIBAhEGPgYjDCIPBCPgOyByQOgD3wKRC/wAhIFsBBiElWUCRpFGRajFm9LHebYoRDsRqhGhcig1zcQyN1I7Vlmayu4hI2X2Gs/2riTTNc0KvpcAhwBPpK8Np2W1UNsavYIWpIW6NomJfqqu0b4KWLcBFmJXB2t1gIV9gbX16V1/Yt4BmwYIta36XOmhs732yos0NvCmaJwBzRk02DGMRl+9rQFNFQHoM9pt0bgaGntAU0URpp8aPTh7EMHYWgYOoQNGUIYS4VhGCu8QWP/Rge3gi9GBf0tgtePfRdPlMl7L5jwnu5EsXghhSJGq5iTJ47LMkia7yglJtTLGRaVOlMAdStQ2RvKYZ89N913yqDN8lkPiCAL5rSfHdj7hppOSblhCVL/TCkbLleNcdMVjtiBcc7Undrj1N0CElyG+JSUSCZGfOl0pkQ9nyO0pJcLGM02oR87DQqPYGF5oYEdxJQqA78ncU2agIm+tFhuRrt6/Q2hXgIK8oAEFeTqUrim0Pyh6YHbVeeg2FS/cWqPN12fhf1Hy0nQ1XvKC/v+wbrZ1NV+hhR21xKvOA0bf32h6Gx/HqCOIH2q3/0AW2w2y5mu3SH8T+kGrUI7fXpUN126RnnR90ORAQ2O6CoWGvO0MGuO1WzS88TiHxvhTo6fUQ+32FJjbrIEYr92i3tPtW3y30w6nzZf9kJ5uD3OUYtPbHCU2jx8dVoXx45ebKPoF</diagram></mxfile>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 11 additions & 8 deletions content/blog/2024-08-23-datafusion-grouped-aggregations/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
+++
title = "DataFusion 两阶段并行哈希分组聚合"
date = 2024-08-23
draft = true
+++

分组聚合功能是任何分析引擎的核心功能,可在海量数据上创建出可以理解的摘要。DataFusion 分析引擎采用了先进的两阶段并行哈希分组聚合技术,高度并行且向量化执行。
Expand Down Expand Up @@ -56,7 +55,17 @@ DataFusion 支持多种聚合方案,在不同情况下会选择最优方案。
5. 如果内存充足,则最后将整个内存中的哈希表,计算最终聚合结果并输出到下一算子

### 内存中的哈希表
TODO
![](./datafusion-aggregation-hashtable.drawio.png)

以上在逻辑上形成一个哈希表,但物理上并非直接使用哈希表存储 group 到聚合状态的映射。实际上哈希表维护的是 group 值到 group 索引的映射,哈希表负责分配 group 索引,而另外有一个 `Accumulator组` 的数据结构来维护每个 group 的聚合状态,每个 group 索引会对应其中一个 Accumulator。

在接收一批数据时,先由哈希表来计算这批数据每行对应的 group 索引(可能是已存在的,也可能会分配一个新的),然后将这批数据和对应的 group 索引发送给 `Accumulator组` 来进行聚合状态更新。

`Accumulator组` 在更新前会利用 Arrow 计算内核对数据进行一个高效地重排,以便在更新聚合状态时,可以被编译器很好地向量化(SIMD加速)。

![](./datafusion-aggregation-reorder-accumulator-input.drawio.png)

利用类似 `Vec` 连续内存存储,尽可能减少内存分配,尽可能类型特化,可以最大化提高聚合计算效率。

### 利用输入的排序特性
DataFusion 会利用聚合算子的输入在 group keys 上的(部分/完全)排序特性,来加速聚合计算。
Expand Down Expand Up @@ -88,11 +97,5 @@ Spill 是一个耗时的操作,涉及到磁盘 IO 和排序,而且对于高

因为如果输入具有排序特性,那么可以利用排序特性来提前输出已聚合完毕的 group,这样不会因为高基数聚合导致需要在内存中维护巨大的哈希表。

问题:
1. 如何向量化执行的
2. 内部hash表结构
4. 如果是 full group ordering,但还没到 new group value出现时,spill到了磁盘,后续 group_ordering.emit_to() 如何保证正确性?


参考资料
1. [Aggregating Millions of Groups Fast in Apache Arrow DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/)

0 comments on commit 34c5bbf

Please sign in to comment.