The VMware Tanzu Greenplum Connector for Apache Spark provides high speed, parallel data transfer between Greenplum Database and an Apache Spark cluster using Spark's Scala API for programmatic access (including the spark-shell REPL).
Refer to the VMware Tanzu Greenplum documentation for detailed information about Greenplum Database.
See the Apache Spark documentation for information about Apache Spark version 2.4.
Supported Platforms
The following table identifies the supported component versions for the VMware Tanzu Greenplum Connector for Apache Spark 2.0:
| Connector Version | Greenplum Version | Spark Version | Scala Version | PostgreSQL JDBC Driver Version |
|---|---|---|---|---|
| 2.0.0 | 5.x, 6.x | 2.3.x , 2.4.x 2.4.x, 3.0.x |
2.11 2.12 |
42.2.14 |
The Connector is certified against the Greenplum, Spark, and Scala versions listed above. The Connector is bundled with, and certified against, the listed PostgreSQL JDBC driver version.
VMware Tanzu Greenplum Connector for Apache Spark 2.0.0
Released: October 2, 2020
VMware Tanzu Greenplum Connector for Apache Spark 2.0.0 includes new and changed features and bug fixes.
New and Changed Features
VMware Tanzu Greenplum Connector for Apache Spark 2.0.0 includes these new and changed features:
The Connector is certified against the Scala, Spark, and JDBC driver versions identified in Supported Platforms above.
The Connector is now bundled with the PostgreSQL JDBC driver version 42.2.14.
The Connector package that you download from Broadcom Support Portal is now a
.tar.gzfile that includes the product open source license and the Connector JAR file. The naming format of the file isgreenplum-connector-apache-spark-scala_<scala-version>-<gsc-version>.tar.gz.For example:
greenplum-connector-apache-spark-scala_2.11-2.0.0.tar.gzgreenplum-connector-apache-spark-scala_2.12-2.0.0.tar.gz
The default
gpfdistserver connection activity timeout changes from 30 seconds to 5 minutes.A new
server.timeoutoption is provided that a developer can use to specify thegpfdistserver connection activity timeout.The Connector improves read performance from Greenplum Database by using the internal Greenplum table column named
gp_segment_idas the defaultpartitionColumnwhen the developer does not specify this option.
Resolved Issues
The following issues were resolved in VMware Tanzu Greenplum Connector for Apache Spark version 2.0.0:
| Bug Id | Summary |
|---|---|
| 30731 | Resolved an issue where the Connector timed out with a serialization exception when writing aggregated results to Greenplum Database. The Connector now exposes the server.timeout option to specify the gpfdist "no activity" timeout, and sets the default timeout to 5 minutes. |
| 174495848 | Resolved an issue where predicate pushdown was not working correctly because the Connector did not use parentheses to join the predicates together when it constructed the filter string. |
Removed Features
The Connector version 2.x removes:
- Support for Greenplum Database 4.x.
- The
connector.portoption (deprecated in 1.6). - The
partitionsPerSegmentoption (deprecated in 1.5).
Known Issues and Limitations
Known issues and limitations related to the 2.x release of the VMware Tanzu Greenplum Connector for Apache Spark include the following:
- The Connector cannot use
gp_segment_idas thepartitionColumn(the default) when reading data from Greenplum Database and mirroring is enabled in the Greenplum cluster. - The Connector does not support reading from or writing to Greenplum Database when your Spark cluster is deployed on Kubernetes.
- The Connector supports basic data types like Float, Integer, String, and Date/Time data types. The Connector does not yet support more complex types. See Greenplum Database ↔ Spark Data Type Mapping for additional information.
Content feedback and comments