release-2.2 branch merge master (#1078)

* 2.2.0 -> 2.3.0 (#947) * Add tests for primary key (#948) * add changelog (#955) * add multi-column tests (#954) * fix range partition throw UnsupportedSyntaxException error (#960) * fix view parsing problem (#953) * make tispark can read from a hash partition table (#966) * increase ci worker number (#965) * update readme for tispark-2.1.2 release (#968) * update document for pyspark (#975) * fix one jar bug (#972) * adding common port number used by spark cluster (#973) * fix cost model in table scan (#977) * create an UninitializedType for TypeDecimal (#979) * update sparkr doc (#976) * use spark-2.4.3 to run ut (#978) * use spark-2.4.3 to run ut * fix ci * a better design for get auto table id (#980) * fix bug: ci SpecialTiDBTypeTestSuite failed with tidb-3.0.1 (#984) * improve TiConfiguration getPdAddrsString function (#963) * bump grpc to 1.17 (#982) * Add multiple-column PK tests (#970) * add retry for batchGet (#986) * use tispark self-made m2 cahce file (#990) * add spark sql document for batch write (#991) * add auto mode for test.data.load (#994) * fix typo (#996) * fix index scan bug (#995) * refine doc (#1003) * add tidb-3.0 compatibility document (#998) * add tidb-3.0 compatibility document * address code review * address code review * add log4j config document (#1008) * refactor batch write region pre-split (#999) * add ci simple mode (#1012) * clean up redundant code (#997) * prohibit agg or groupby pushdown on double read (#1004) * remove split region code (#1015) * add supported scala version (#1013) * Fix scala compiler version (#1010) * fix reflection bug for hdp release (#1017) (#1018) (cherry picked from commit 118b12e) * check by grammarly (#1022) * add benchmark result for batch write (#1025) * release tispark 2.1.3 (#1026) (#1035) (cherry picked from commit 107eb2b) * support setting random seed in daily regression test (#1032) * Remove create in tisession (#1021) * set tikv region size from 96M to 1M (#1031) * adding unique indices test for batch write (#1014) * use one unique seed (#1043) * remove unused code (#1030) * adding batch write pk insertion test (#1044) * fix table not found bug in TiSession because of synchronization (#1041) * fix test failure (#1051) * fix reflection bug: pass in different arguments for different version of same function (#1037) (#1052) (cherry picked from commit a5462c2) * Adding pk and unique index test for batch write (#1049) * fix distinct without alias bug: disable pushdown aggregate with alias (#1054) * improve the doc (#1053) * Refactor RegionStoreClient logic (#989) * using stream rather removeIf (#1057) * Remove redundant pre-write/commit logic in LockResolverTest (#1062) * adding recreate flag when create tisession (#1064) * fix issue 1047 (#1066) * cleanup code in TiBatchWrite (#1067) * release tispark-2.1.4 (#1068) (#1069) (cherry picked from commit fd8068a) * update document for tispark-2.1.4 release (#1070)
pingcap · Aug 29, 2019 · 5aeb1b8 · 5aeb1b8
1 parent b4007e4
commit 5aeb1b8
Show file tree

Hide file tree

Showing 200 changed files with 6,258 additions and 2,457 deletions.
diff --git a/.ci/build.groovy b/.ci/build.groovy
@@ -7,13 +7,12 @@ def call(ghprbActualCommit, ghprbPullId, ghprbPullTitle, ghprbPullLink, ghprbPul
 
     catchError {
         node ('build') {
-            def ws = pwd()
             deleteDir()
             container("java") {
                 stage('Checkout') {
                     dir("/home/jenkins/git/tispark") {
                         sh """
-                        archive_url=http://172.16.30.25/download/builds/pingcap/tiflash/cache/tiflash-m2-cache_latest.tar.gz
+                        archive_url=http://fileserver.pingcap.net/download/builds/pingcap/tispark/cache/tispark-m2-cache-latest.tar.gz
                         if [ ! "\$(ls -A /maven/.m2/repository)" ]; then curl -sL \$archive_url | tar -zx -C /maven || true; fi
                         """
                         if (sh(returnStatus: true, script: '[ -d .git ] && [ -f Makefile ] && git rev-parse --git-dir > /dev/null 2>&1') != 0) {

diff --git a/.ci/integration_test.groovy b/.ci/integration_test.groovy
@@ -6,45 +6,48 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
     def TIDB_BRANCH = "master"
     def TIKV_BRANCH = "master"
     def PD_BRANCH = "master"
-    def MVN_PROFILE = ""
-    def PARALLEL_NUMBER = 9
+    def MVN_PROFILE = "-Pjenkins"
+    def TEST_MODE = "simple"
+    def PARALLEL_NUMBER = 18
 
     // parse tidb branch
     def m1 = ghprbCommentBody =~ /tidb\s*=\s*([^\s\\]+)(\s|\\|$)/
     if (m1) {
         TIDB_BRANCH = "${m1[0][1]}"
     }
-    m1 = null
     println "TIDB_BRANCH=${TIDB_BRANCH}"
+
     // parse pd branch
     def m2 = ghprbCommentBody =~ /pd\s*=\s*([^\s\\]+)(\s|\\|$)/
     if (m2) {
         PD_BRANCH = "${m2[0][1]}"
     }
-    m2 = null
     println "PD_BRANCH=${PD_BRANCH}"
+
     // parse tikv branch
     def m3 = ghprbCommentBody =~ /tikv\s*=\s*([^\s\\]+)(\s|\\|$)/
     if (m3) {
         TIKV_BRANCH = "${m3[0][1]}"
     }
-    m3 = null
     println "TIKV_BRANCH=${TIKV_BRANCH}"
+
     // parse mvn profile
     def m4 = ghprbCommentBody =~ /profile\s*=\s*([^\s\\]+)(\s|\\|$)/
     if (m4) {
-        MVN_PROFILE = "-P${m4[0][1]}"
+        MVN_PROFILE = MVN_PROFILE + " -P${m4[0][1]}"
+    }
+
+    // parse test mode
+    def m5 = ghprbCommentBody =~ /mode\s*=\s*([^\s\\]+)(\s|\\|$)/
+    if (m5) {
+        TEST_MODE = "${m5[0][1]}"
     }
 
     def readfile = { filename ->
         def file = readFile filename
         return file.split("\n") as List
     }
 
-    def remove_last_str = { str ->
-        return str.substring(0, str.length() - 1)
-    }
-
     def get_mvn_str = { total_chunks ->
         def mvnStr = " -DwildcardSuites="
         for (int i = 0 ; i < total_chunks.size() - 1; i++) {
@@ -65,8 +68,7 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
                 println "${NODE_NAME}"
                 container("golang") {
                     deleteDir()
-                    def ws = pwd()
-
+
                     // tidb
                     def tidb_sha1 = sh(returnStdout: true, script: "curl ${FILE_SERVER_URL}/download/refs/pingcap/tidb/${TIDB_BRANCH}/sha1").trim()
                     sh "curl ${FILE_SERVER_URL}/download/builds/pingcap/tidb/${tidb_sha1}/centos7/tidb-server.tar.gz | tar xz"
@@ -90,23 +92,38 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
                         sh """
                         cp -R /home/jenkins/git/tispark/. ./
                         git checkout -f ${ghprbActualCommit}
-                        find core/src -name '*Suite*' > test
+                        find core/src -name '*Suite*' | grep -v 'MultiColumnPKDataTypeSuite' > test
+                        shuf test -o  test2
+                        mv test2 test
+                        """
+
+                        if(TEST_MODE != "simple") {
+                            sh """
+                            find core/src -name '*MultiColumnPKDataTypeSuite*' >> test
+                            """
+                        }
+
+                        sh """
                         sed -i 's/core\\/src\\/test\\/scala\\///g' test
                         sed -i 's/\\//\\./g' test
                         sed -i 's/\\.scala//g' test
-                        shuf test -o  test2
-                        mv test2 test
-                        split test -n r/$PARALLEL_NUMBER test_unit_ -a 1 --numeric-suffixes=1
+                        split test -n r/$PARALLEL_NUMBER test_unit_ -a 2 --numeric-suffixes=1
                         """
 
                         for (int i = 1; i <= PARALLEL_NUMBER; i++) {
-                            sh """cat test_unit_$i"""
+                            if(i < 10) {
+                                sh """cat test_unit_0$i"""
+                            } else {
+                                sh """cat test_unit_$i"""
+                            }
                         }
 
                         sh """
-                        cd tikv-client
-                        ./scripts/proto.sh
-                        cd ..
+                        cp .ci/log4j-ci.properties core/src/test/resources/log4j.properties
+                        bash core/scripts/version.sh
+                        bash core/scripts/fetch-test-data.sh
+                        mv core/src/test core-test/src/
+                        bash tikv-client/scripts/proto.sh
                         """
                     }
 
@@ -120,31 +137,35 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
 
             def run_tispark_test = { chunk_suffix ->
                 dir("go/src/github.com/pingcap/tispark") {
-                    run_chunks = readfile("test_unit_${chunk_suffix}")
+                    if(chunk_suffix < 10) {
+                        run_chunks = readfile("test_unit_0${chunk_suffix}")
+                    } else {
+                        run_chunks = readfile("test_unit_${chunk_suffix}")
+                    }
+
                     print run_chunks
                     def mvnStr = get_mvn_str(run_chunks)
                     sh """
-                        archive_url=http://172.16.30.25/download/builds/pingcap/tiflash/cache/tiflash-m2-cache_latest.tar.gz
+                        archive_url=http://fileserver.pingcap.net/download/builds/pingcap/tispark/cache/tispark-m2-cache-latest.tar.gz
                         if [ ! "\$(ls -A /maven/.m2/repository)" ]; then curl -sL \$archive_url | tar -zx -C /maven || true; fi
                     """
                     sh """
-                        cp .ci/log4j-ci.properties core/src/test/resources/log4j.properties
                         export MAVEN_OPTS="-Xmx6G -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=51M"
-                        mvn compile ${MVN_PROFILE} -DskipCloneProtoFiles=true
-                        mvn test ${MVN_PROFILE} -Dtest=moo ${mvnStr} -DskipCloneProtoFiles=true
+                        mvn compile ${MVN_PROFILE}
+                        mvn test ${MVN_PROFILE} -Dtest=moo ${mvnStr}
                     """
                 }
             }
 
             def run_tikvclient_test = { chunk_suffix ->
                 dir("go/src/github.com/pingcap/tispark") {
                     sh """
-                        archive_url=http://172.16.30.25/download/builds/pingcap/tiflash/cache/tiflash-m2-cache_latest.tar.gz
+                        archive_url=http://fileserver.pingcap.net/download/builds/pingcap/tispark/cache/tispark-m2-cache-latest.tar.gz
                         if [ ! "\$(ls -A /maven/.m2/repository)" ]; then curl -sL \$archive_url | tar -zx -C /maven || true; fi
                     """
                     sh """
                         export MAVEN_OPTS="-Xmx6G -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512M"
-                        mvn test ${MVN_PROFILE} -am -pl tikv-client -DskipCloneProtoFiles=true
+                        mvn test ${MVN_PROFILE} -am -pl tikv-client
                     """
                     unstash "CODECOV_TOKEN"
                     sh 'curl -s https://codecov.io/bash | bash -s - -t @CODECOV_TOKEN'
@@ -155,7 +176,6 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
                 node("test_java") {
                     println "${NODE_NAME}"
                     container("java") {
-                        def ws = pwd()
                         deleteDir()
                         unstash 'binaries'
                         unstash 'tispark'
@@ -167,17 +187,17 @@ def call(ghprbActualCommit, ghprbCommentBody, ghprbPullId, ghprbPullTitle, ghprb
                             killall -9 tikv-server || true
                             killall -9 pd-server || true
                             sleep 10
-                            bin/pd-server --name=pd --data-dir=pd &>pd.log &
+                            bin/pd-server --name=pd --data-dir=pd --config=go/src/github.com/pingcap/tispark/config/pd.toml &>pd.log &
                             sleep 10
-                            bin/tikv-server --pd=127.0.0.1:2379 -s tikv --addr=0.0.0.0:20160 --advertise-addr=127.0.0.1:20160 &>tikv.log &
+                            bin/tikv-server --pd=127.0.0.1:2379 -s tikv --addr=0.0.0.0:20160 --advertise-addr=127.0.0.1:20160 --config=go/src/github.com/pingcap/tispark/config/tikv.toml &>tikv.log &
                             sleep 10
                             ps aux | grep '-server' || true
                             curl -s 127.0.0.1:2379/pd/api/v1/status || true
                             bin/tidb-server --store=tikv --path="127.0.0.1:2379" --config=go/src/github.com/pingcap/tispark/config/tidb.toml &>tidb.log &
                             sleep 60
                             """
 
-                            timeout(60) {
+                            timeout(120) {
                                 run_test(chunk_suffix)
                             }
                         } catch (err) {

diff --git a/.ci/log4j-ci.properties b/.ci/log4j-ci.properties
@@ -24,3 +24,5 @@ log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
 
  # tispark
 log4j.logger.com.pingcap=ERROR
+log4j.logger.com.pingcap.tispark.utils.ReflectionUtil=DEBUG
+log4j.logger.org.apache.spark.sql.test.SharedSQLContext=DEBUG
diff --git a/.ci/tidb_config-for-daily-test.properties b/.ci/tidb_config-for-daily-test.properties
@@ -0,0 +1,2 @@
+# The seed used to generate test data (0 means random).
+test.data.generate.seed=0
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,147 @@
+# TiSpark Changelog
+All notable changes to this project will be documented in this file.
+
+## [TiSpark 2.1.4] 2019-08-27
+### Fixes
+- Fix distinct without alias bug: disable pushdown aggregate with alias [#1055](https://github.com/pingcap/tispark/pull/1055)
+- Fix reflection bug: pass in different arguments for different version of same function [#1037](https://github.com/pingcap/tispark/pull/1037)
+
+## [TiSpark 2.1.3] 2019-08-15
+### Fixes
+- Fix cost model in table scan [#1023](https://github.com/pingcap/tispark/pull/1023)
+- Fix index scan bug [#1024](https://github.com/pingcap/tispark/pull/1024)
+- Prohibit aggregate or group by pushdown on double read [#1027](https://github.com/pingcap/tispark/pull/1027)
+- Fix reflection bug for HDP release [#1017](https://github.com/pingcap/tispark/pull/1017)
+- Fix scala compiler version [#1019](https://github.com/pingcap/tispark/pull/1019)
+
+## [TiSpark 2.2.0]
+### New Features
+* Natively support writing data to TiKV using Spark Data Source API
+* Support select from partition table [#916](https://github.com/pingcap/tispark/pull/916)
+* Release one tispark jar (both support Spark-2.3.x and Spark-2.4.x) instead of two [#933](https://github.com/pingcap/tispark/pull/933)
+* Add spark version to tispark udf ti_version [#943](https://github.com/pingcap/tispark/pull/943)
+
+## [TiSpark 2.1.2] 2019-07-29
+### Fixes
+* Fix improper response with region error [#922](https://github.com/pingcap/tispark/pull/922)
+* Fix view parsing problem [#953](https://github.com/pingcap/tispark/pull/953)
+
+## [TiSpark 1.2.1]
+### Fixes
+* Fix count error, if advanceNextResponse is empty, we should read next region (#899)
+* Use fixed version of proto (#898)
+
+## [TiSpark 2.1.1]
+### Fixes
+* Add TiDB/TiKV/PD version and Spark version supported for each latest major release (#804) (#887)
+* Fix incorrect timestamp of tidbMapDatabase (#862) (#885)
+* Fix column size estimation (#858) (#884)
+* Fix count error, if advanceNextResponse is empty, we should read next region (#878) (#882)
+* Use fixed version of proto instead of master branch (#843) (#850)
+
+## [TiSpark 2.1]
+### Features
+* Support range partition pruning (Beta) (#599)
+* Support show columns command (#614)
+
+### Fixes
+* Fix build key ranges with xor expression (#576)
+* Fix cannot initialize pd if using ipv6 address (#587)
+* Fix default value bug (#596)
+* Fix possible IndexOutOfBoundException in KeyUtils (#597)
+* Fix outputOffset is incorrect when building DAGRequest (#615)
+* Fix incorrect implementation of Key.next() (#648)
+* Fix partition parser can't parser numerical value 0 (#651)
+* Fix prefix length may be larger than the value used. (#668)
+* Fix retry logic when scan meet lock (#666)
+* Fix inconsistent timestamp (#676)
+* Fix tempView may be unresolved when applying timestamp to plan (#690)
+* Fix concurrent DAGRequest issue (#714)
+* Fix downgrade scan logic (#725)
+* Fix integer type default value should be parsed to long (#741)
+* Fix index scan on partition table (#735)
+* Fix KeyNotInRegion may occur when retrieving rows by handle (#755)
+* Fix encode value long max (#761)
+* Fix MatchErrorException may occur when Unsigned BigInt contains in group by columns (#780)
+* Fix IndexOutOfBoundException when trying to get pd member (#788)
+
+## [TiSpark 2.0]
+### Features
+* Work with Spark 2.3
+* Support use `$database` statement
+* Support show databases statement
+* Support show tables statement
+* No need to use `TiContext.mapTiDBDatabase`, use `$database.$table` to identify a table instead
+* Support data type SET and ENUM
+* Support data type YEAR
+* Support data type TIME
+* Support isolation level settings
+* Support describe table command
+* Support cache tables and uncache tables
+* Support read from a TiDB partition table
+* Support use TiDB as metastore
+
+### Fixes
+* Fix JSON parsing (#491)
+* Fix count on empty table (#498)
+* Fix ScanIterator unable to read from adjacent empty regions (#519)
+* Fix possible NullPointerException when setting show_row_id true (#522)
+
+### Improved
+* Make ti version usable without selecting database (#545)
+
+## [TiSpark 1.2]
+### Fixes
+* Fixes compatibility with PDServer #480
+
+## [TiSpark 1.1]
+### Fixes multiple bugs:
+* Fix daylight saving time (DST) (#347)
+* Fix count(1) result is always 0 if subquery contains limit (#346)
+* Fix incorrect totalRowCount calculation (#353)
+* Fix request fail with Key not in region after retrying NotLeaderError (#354)
+* Fix ScanIterator logic where index may be out of bound (#357)
+* Fix tispark-sql dbName (#379)
+* Fix StoreNotMatch (#396)
+* Fix utf8 prefix index (#400)
+* Fix decimal decoding (#401)
+* Refactor not leader logic (#412)
+* Fix global temp view not visible in thriftserver (#437)
+
+### Adds:
+* Allow TiSpark retrieve row id (#367)
+* Decode json to string (#417)
+
+### Improvements:
+* Improve PD connection issue's error log (#388)
+* Add DB prefix option for TiDB tables (#416)
+
+## [TiSpark 1.0.1]
+* Fix unsigned index
+* Compatible with TiDB before and since 48a42f
+
+## [TiSpark 1.0 GA]
+### New Features
+TiSpark provides distributed computing of TiDB data using Apache Spark.
+
+* Provide a gRPC communication framework to read data from TiKV
+* Provide encoding and decoding of TiKV component data and communication protocol
+* Provide calculation pushdown, which includes:
+    - Aggregate pushdown
+    - Predicate pushdown
+    - TopN pushdown
+    - Limit pushdown
+* Provide index related support
+    - Transform predicate into Region key range or secondary index
+    - Optimize Index Only queries
+    - Adaptive downgrade index scan to table scan per region
+* Provide cost-based optimization
+    - Support statistics
+    - Select index
+    - Estimate broadcast table cost
+* Provide support for multiple Spark interfaces
+    - Support Spark Shell
+    - Support ThriftServer/JDBC
+    - Support Spark-SQL interaction
+    - Support PySpark Shell
+    - Support SparkR
diff --git a/R/.gitignore b/R/.gitignore
diff --git a/R/DESCRIPTION b/R/DESCRIPTION
diff --git a/R/NAMESPACE b/R/NAMESPACE