Merge branch 'tencentmusic:master' into master

tencentmusic · Mar 12, 2024 · 1e35daf · 1e35daf
2 parents a24a373 + c906eb0
commit 1e35daf
Show file tree

Hide file tree

Showing 242 changed files with 2,843 additions and 2,045 deletions.
diff --git a/README.md b/README.md
@@ -4,29 +4,36 @@
 
 # SuperSonic (超音数)
 
-**SuperSonic is the next-generation LLM-powered data analytics platform that integrates ChatBI and HeadlessBI**. SuperSonic provides a chat interface that empowers users to query data using natural language and visualize the results with suitable charts. To enable such experience, the only thing necessary is to build logical semantic models (definition of entities/metrics/dimensions/tags, along with their meaning, context and relationships) with semantic layer, and **no data modification or copying** is required. Meanwhile, SuperSonic is designed to be **highly extensible**, allowing custom functionalities to be added and configured with Java SPI.
+SuperSonic is the next-generation BI platform that integrates **Chat BI** (powered by LLM) and **Headless BI** (powered by semantic layer). This integration ensures that Chat BI has access to the same curated and governed semantic data models as traditional BI. Furthermore, the implementation of both paradigms benefits from the integration: 
+
+- Chat BI's Text2SQL capability gets enhanced with semantic data models.
+- Headless BI's query interface gets augmented with natural language support.
+
+<img src="./docs/images/supersonic_ideas.png" height="75%" width="75%" align="center"/>
+
+SuperSonic provides a chat interface that empowers users to query data using natural language and visualize the results with suitable charts. To enable such experience, the only thing necessary is to build logical semantic models (definition of metric/dimension/entity/tag, along with their meaning and relationships) with semantic layer, and **no data modification or copying** is required. Meanwhile, SuperSonic is designed to be **highly extensible**, allowing custom functionalities to be added and configured with Java SPI.
 
 <img src="./docs/images/supersonic_demo.gif" height="100%" width="100%" align="center"/>
 
 ## Motivation
 
-The emergence of Large Language Model (LLM) like ChatGPT is reshaping the way information is retrieved. In the field of data analytics, both academia and industry are primarily focused on leveraging LLM to convert natural language into SQL (so called Text2SQL or NL2SQL). While some approaches exhibit promising results, their **reliability** and **efficiency** are insufficient for real-world applications.  
+The emergence of Large Language Model (LLM) like ChatGPT is reshaping the way information is retrieved, leading to a new paradigm in the field of data analytics known as Chat BI. To implement Chat BI, both academia and industry are primarily focused on harnessing the power of LLMs to convert natural language into SQL, commonly referred to as Text2SQL or NL2SQL. While some approaches show promising results, their **reliability** falls short for large-scale real-world applications.
+
+Meanwhile, another emerging paradigm called Headless BI, which focuses on constructing unified semantic data models, has garnered significant attention. Headless BI is implemented through a universal semantic layer that exposes consistent data semantics via an open API.
+
+From our perspective, the integration of Chat BI and Headless BI has the potential to enhance the Text2SQL capability in two dimensions:
 
-From our perspective, the key to filling the real-world gap lies in three aspects: 
-1. Integrate ChatBI with HeadlessBI encapsulating underlying data context (joins, keys, formulas, etc) to **reduce complexity**. 
-   <img src="./docs/images/supersonic_ideas.png" height="65%" width="65%" align="center"/>
-2. Augment the LLM with schema mappers(as a kind of preprocessor) and semantic correctors(as a kind of postprocessor) to **mitigate hallucination**.
-3. Utilize rule-based schema parsers when necessary to **improve efficiency**(in terms of latency and cost).
+1. Incorporate data semantics (such as business terms, column values, etc.) into the prompt, enabling LLM to better understand the semantics and **reduce hallucination**.
+2. Offload the generation of advanced SQL syntax (such as join, formula, etc.) from LLM to the semantic layer to **reduce complexity**. 
 
-With these ideas in mind, we develop SuperSonic as a practical reference implementation and use it to power our real-world products. Additionally, to facilitate further development of ChatBI, we decide to open source SuperSonic as an extensible framework.
+With these ideas in mind, we develop SuperSonic as a practical reference implementation and use it to power our real-world products. Additionally, to facilitate further development we decide to open source SuperSonic as an extensible framework.
 
 ## Out-of-the-box Features
 
-- Built-in ChatBI interface for *business users* to enter natural language queries
-- Built-in HeadlessBI interface for *analytics engineers* to build semantic models
-- Built-in GUI for *system administrators* to manage chat agents and third-party plugins
+- Built-in Chat BI interface for *business users* to enter natural language queries
+- Built-in Headless BI interface for *analytics engineers* to build semantic data models
+- Built-in rule-based semantic parser to improve efficiency in certain scenarios
 - Support input auto-completion as well as query recommendation
-- Support multi-turn conversation and history context management 
 - Support four-level permission control: domain-level, model-level, column-level and row-level
 
 ## Extensible Components

diff --git a/README_CN.md b/README_CN.md
@@ -1,29 +1,35 @@
 # SuperSonic (超音数)
 
-**SuperSonic融合ChatBI和HeadlessBI打造新一代的数据分析平台**。通过SuperSonic的问答对话界面，用户能够使用自然语言查询数据，系统会选择合适的可视化图表呈现结果。SuperSonic不需要修改或复制数据，只需要在物理数据模型之上构建逻辑语义模型（指标/维度/实体的定义，以及他们的业务含义、相互间关系等），即可开启数据问答体验。与此同时，SuperSonic被设计为可插拔的框架，采用Java SPI机制来扩展定制功能。
+**SuperSonic融合Chat BI（powered by LLM）和Headless BI（powered by 语义层）打造新一代的BI平台**。这种融合确保了Chat BI能够与传统BI一样访问统一化治理的语义数据模型。此外，两种BI新范式都从中获得收益：
+
+- Chat BI的Text2SQL能力通过语义数据模型得到增强。
+- Headless BI的查询接口通过支持自然语言得到拓展。
+
+<img src="./docs/images/supersonic_ideas.png" height="75%" width="75%" align="center"/>
+
+通过SuperSonic的问答对话界面，用户能够使用自然语言查询数据，系统会选择合适的可视化图表呈现结果。SuperSonic不需要修改或复制数据，只需要在物理数据模型之上构建逻辑语义模型（定义指标/维度/实体/标签，以及它们的业务含义、相互关系等），即可开启数据问答体验。与此同时，SuperSonic被设计为可插拔的框架，采用Java SPI机制来扩展定制功能。
 
 <img src="./docs/images/supersonic_demo.gif" height="100%" width="100%" align="center"/>
 
 ## 项目动机
 
-大型语言模型（LLMs）如ChatGPT的出现正在重塑信息检索的方式。在数据分析领域，学术界和工业界主要关注利用深度学习模型将自然语言查询转换为SQL查询。虽然一些工作显示出有前景的结果，但它们的可靠性还达不到生产可用的要求。
+大型语言模型（LLM）如ChatGPT的出现正在重塑信息检索的方式，引领数据分析领域的一种新范式，被称为Chat BI。为了实现Chat BI，学术界和工业界主要关注利用LLM的能力将自然语言转换为SQL，通常称为Text2SQL或NL2SQL。尽管一些方法显示出有希望的结果，但它们在大规模实际应用中的可靠性还不足。
+
+与此同时，另一种新兴范式被称为Headless BI，它专注于构建统一的语义数据模型，并引起了广泛的关注。Headless BI通过一个通用的语义层来实现，通过开放的API公开一致的数据语义。
 
-在我们看来，为了在实际场景发挥价值，有三个关键点：
-1. 融合HeadlessBI，通过统一语义层封装底层数据细节（关联、键值、公式等），降低SQL生成的**复杂度**。
+从我们的角度来看，Chat BI和Headless BI的融合有潜力在两个方面增强Text2SQL的能力：
 
-   <img src="./docs/images/supersonic_ideas.png" height="65%" width="65%" align="center"/>
-2. 通过一前一后的模式映射器和语义修正器，来缓解LLM常见的**幻觉**现象。
-3. 设计启发式的规则，在一些特定场景提升语义解析的**效率**。
+1. 将数据语义（如业务术语、列值等）纳入提示词中，使LLM能够更好地理解语义，以**减少幻觉**。
+2. 将高级SQL语法（如连接、公式等）的生成从LLM卸载到语义层，以**减少复杂度**。
 
 为了验证上述想法，我们开发了SuperSonic项目，并将其应用在实际的内部产品中。与此同时，我们将SuperSonic作为一个可扩展的框架开源，希望能够促进数据问答对话领域的进一步发展。
 
 ## 开箱即用的特性
 
-- 内置ChatBI界面以便*业务用户*输入数据查询。
-- 内置HeadlessBI界面以便*分析工程师*构建语义模型。
-- 内置图形用户界面以便*系统管理员*管理第三方插件和对话助理。
+- 内置Chat BI界面以便*业务用户*输入数据查询。
+- 内置Headless BI界面以便*分析工程师*构建语义模型。
+- 内置基于规则的语义解析器，在特定场景可以提升运行效率。
 - 支持文本输入的联想和查询问题的推荐。
-- 支持多轮对话，根据语境自动切换上下文。
 - 支持四级权限控制：主题域级、模型级、列级、行级。
 
 ## 易于扩展的组件

diff --git a/.../supersonic/chat/api/pojo/ViewSchema.java → ...personic/chat/api/pojo/DataSetSchema.java b/.../supersonic/chat/api/pojo/ViewSchema.java → ...personic/chat/api/pojo/DataSetSchema.java
@@ -12,13 +12,14 @@
 import java.util.Set;
 
 @Data
-public class ViewSchema {
+public class DataSetSchema {
 
-    private SchemaElement view;
+    private SchemaElement dataSet;
     private Set<SchemaElement> metrics = new HashSet<>();
     private Set<SchemaElement> dimensions = new HashSet<>();
     private Set<SchemaElement> dimensionValues = new HashSet<>();
     private Set<SchemaElement> tags = new HashSet<>();
+    private Set<SchemaElement> tagValues = new HashSet<>();
     private SchemaElement entity = new SchemaElement();
     private QueryConfig queryConfig;
 
@@ -29,8 +30,8 @@ public SchemaElement getElement(SchemaElementType elementType, long elementID) {
             case ENTITY:
                 element = Optional.ofNullable(entity);
                 break;
-            case VIEW:
-                element = Optional.of(view);
+            case DATASET:
+                element = Optional.of(dataSet);
                 break;
             case METRIC:
                 element = metrics.stream().filter(e -> e.getId() == elementID).findFirst();
@@ -44,34 +45,8 @@ public SchemaElement getElement(SchemaElementType elementType, long elementID) {
             case TAG:
                 element = tags.stream().filter(e -> e.getId() == elementID).findFirst();
                 break;
-            default:
-        }
-
-        if (element.isPresent()) {
-            return element.get();
-        } else {
-            return null;
-        }
-    }
-
-    public SchemaElement getElement(SchemaElementType elementType, String name) {
-        Optional<SchemaElement> element = Optional.empty();
-
-        switch (elementType) {
-            case ENTITY:
-                element = Optional.ofNullable(entity);
-                break;
-            case VIEW:
-                element = Optional.of(view);
-                break;
-            case METRIC:
-                element = metrics.stream().filter(e -> name.equals(e.getName())).findFirst();
-                break;
-            case DIMENSION:
-                element = dimensions.stream().filter(e -> name.equals(e.getName())).findFirst();
-                break;
-            case VALUE:
-                element = dimensionValues.stream().filter(e -> name.equals(e.getName())).findFirst();
+            case TAG_VALUE:
+                element = tagValues.stream().filter(e -> e.getId() == elementID).findFirst();
                 break;
             default:
         }

diff --git a/chat/api/src/main/java/com/tencent/supersonic/chat/api/pojo/SchemaMapInfo.java b/chat/api/src/main/java/com/tencent/supersonic/chat/api/pojo/SchemaMapInfo.java
@@ -9,25 +9,25 @@
 
 public class SchemaMapInfo {
 
-    private Map<Long, List<SchemaElementMatch>> viewElementMatches = new HashMap<>();
+    private Map<Long, List<SchemaElementMatch>> dataSetElementMatches = new HashMap<>();
 
-    public Set<Long> getMatchedViewInfos() {
-        return viewElementMatches.keySet();
+    public Set<Long> getMatchedDataSetInfos() {
+        return dataSetElementMatches.keySet();
     }
 
-    public List<SchemaElementMatch> getMatchedElements(Long view) {
-        return viewElementMatches.getOrDefault(view, Lists.newArrayList());
+    public List<SchemaElementMatch> getMatchedElements(Long dataSet) {
+        return dataSetElementMatches.getOrDefault(dataSet, Lists.newArrayList());
     }
 
-    public Map<Long, List<SchemaElementMatch>> getViewElementMatches() {
-        return viewElementMatches;
+    public Map<Long, List<SchemaElementMatch>> getDataSetElementMatches() {
+        return dataSetElementMatches;
     }
 
-    public void setViewElementMatches(Map<Long, List<SchemaElementMatch>> viewElementMatches) {
-        this.viewElementMatches = viewElementMatches;
+    public void setDataSetElementMatches(Map<Long, List<SchemaElementMatch>> dataSetElementMatches) {
+        this.dataSetElementMatches = dataSetElementMatches;
     }
 
-    public void setMatchedElements(Long view, List<SchemaElementMatch> elementMatches) {
-        viewElementMatches.put(view, elementMatches);
+    public void setMatchedElements(Long dataSet, List<SchemaElementMatch> elementMatches) {
+        dataSetElementMatches.put(dataSet, elementMatches);
     }
 }
diff --git a/chat/api/src/main/java/com/tencent/supersonic/chat/api/pojo/SemanticParseInfo.java b/chat/api/src/main/java/com/tencent/supersonic/chat/api/pojo/SemanticParseInfo.java
@@ -26,7 +26,7 @@ public class SemanticParseInfo {
 
     private Integer id;
     private String queryMode;
-    private SchemaElement view;
+    private SchemaElement dataSet;
     private Set<SchemaElement> metrics = new TreeSet<>(new SchemaNameLengthComparator());
     private Set<SchemaElement> dimensions = new LinkedHashSet();
     private SchemaElement entity;
@@ -72,15 +72,11 @@ public Set<SchemaElement> getMetrics() {
         return metrics;
     }
 
-    public Long getViewId() {
-        if (view == null) {
+    public Long getDataSetId() {
+        if (dataSet == null) {
             return null;
         }
-        return view.getView();
-    }
-
-    public SchemaElement getModel() {
-        return view;
+        return dataSet.getDataSet();
     }
 
 }