Spark

This release provides Spark 1.6.3 with no additional Apache patches. In addition, this release provides Spark 2.2.0 and the following Apache patches:

SPARK-207: Remove hardcode FS scheme from Spark archive.
SPARK-12717: Adding thread-safe broadcast pickle registry.
SPARK-13656: Delete spark.sql.parquet.cacheMetadata from SQLConf and docs.
SPARK-14387: Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc.
SPARK-15648: Add TeradataDialect.
SPARK-16060: Vectorized Orc Reader.
SPARK-16628: OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files.
SPARK-18061: Support spnego for ThriftServer thrift/http auth.
SPARK-18136: Fix SPARK_JARS_DIR for Python pip install on Windows.
SPARK-18355: Spark SQL fails to read data from a ORC hive table that has a new column added to it.
SPARK-18464: support old table which doesn't store schema in table properties.
SPARK-18608: Fix double caching.
SPARK-18608: Fix double caching for PySpark OneVsRest..
SPARK-19809: NullPointerException on empty ORC file.
SPARK-20079: Fix client AM not allocating executors after restart.
SPARK-20098: dataType's typeName fix.
SPARK-20256: SessionState should be created more lazily.
SPARK-20342: Update task accumulators before sending task end event..
SPARK-20466: HadoopRDD#addLocalConfiguration throws NPE.
SPARK-20682: Support a new faster ORC data source based on Apache ORC.
SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core.
SPARK-20904: Don't report task failures to driver during shutdown..
SPARK-20981: Add new configuration spark.jars.repositories as equivalence of --repositories.
SPARK-21069: Add rate source to programming guide..
SPARK-21083: Store zero size and row count when analyzing empty table.
SPARK-21128: Remove both "spark-warehouse" and "metastore_db" before listing files in R tests.
SPARK-21170: Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted.
SPARK-21219: Task retry occurs on same executor due to race co….
SPARK-21228: InSet incorrect handling of structs.
SPARK-21243: Limit no. of map outputs in a shuffle fetch.
SPARK-21247: Type comparison should respect case-sensitive SQL conf.
SPARK-21254: History UI performance fixes.
SPARK-21267: Follow up to avoid referencing programming-guide redirector.
SPARK-21267: Update Structured Streaming Documentation.
SPARK-21272: SortMergeJoin LeftAnti does not update numOutputRows.
SPARK-21300: ExternalMapToCatalyst should null-check map key prior to converting to internal value..
SPARK-21306: OneVsRest should support setWeightCol.
SPARK-21312: correct offsetInBytes in UnsafeRow.writeToStream.
SPARK-21321: Spark very verbose on shutdown.
SPARK-21330: Bad partitioning does not allow to read a JDBC table with extreme values on the partition column Fix Java SimpleApp spark application.
SPARK-21332: Incorrect result type inferred for some decimal expressions.
SPARK-21333: Removed invalid joinTypes from javadoc of Dataset#joinWith.
SPARK-21339: spark-shell --packages option does not add jars to classpath on windows.
SPARK-21342: Fix DownloadCallback to work well with RetryingBlockFetcher..
SPARK-21343: Refine the document for spark.reducer.maxReqSizeShuffleToMem..
SPARK-21344: BinaryType comparison does signed byte array comparison.
SPARK-21345: SparkSessionBuilderSuite should clean up stopped sessions..
SPARK-21366: Add sql test for window functions.
SPARK-21369: Don't use Scala Tuple2 in common/network-*.
SPARK-21374: Fix reading globbed paths from S3 into DF with disabled FS cache.
SPARK-21376: Fix yarn client token expire issue when cleaning the staging files in long running scenario.
SPARK-21376: Fix yarn client token expire issue when cleaning the staging files in long running scenario.
SPARK-21377: Make jars specify with --jars/--packages load-able in AM's credential renwer.
SPARK-21383: Fix the YarnAllocator allocates more Resource.
SPARK-21384: Spark + YARN fails with LocalFileSystem as default FS.
SPARK-21414: Refine SlidingWindowFunctionFrame to avoid OOM..
SPARK-21418: NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true.
SPARK-21422: Depend on Apache ORC 1.4.0.
SPARK-21434: Add pyspark pip documentation..
SPARK-21441: Incorrect Codegen in SortMergeJoinExec results failures in some cases.
SPARK-21445: Make IntWrapper and LongWrapper in UTF8String Serializable.
SPARK-21446: Fix setAutoCommit never executed.
SPARK-21447: Spark history server fails to render compressed.
SPARK-21457: ExternalCatalog.listPartitions should correctly handle partition values with dot.
SPARK-21464: Minimize deprecation warnings caused by ProcessingTime class.
SPARK-21475: Use NIO's Files API to replace FileInputStream/FileOutputStream in some critical paths.
SPARK-21477: Mark LocalTableScanExec's input data transient.
SPARK-21494: Use correct app id when authenticating to external service..
SPARK-21503: Spark UI shows incorrect task status for a killed Executor Process.
SPARK-21508: Fix example code provided in Spark Streaming Documentation.
SPARK-21522: Fix flakiness in LauncherServerSuite..
SPARK-21523: update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search.
SPARK-21538: Attribute resolution inconsistency in the Dataset API.
SPARK-21546: dropDuplicates should ignore watermark when it's not a key.
SPARK-21549: Respect OutputFormats with no output directory provided.
SPARK-21549: Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs.
SPARK-21551: Increase timeout for PythonRDD.serveIterator.
SPARK-21555: RuntimeReplaceable should be compared semantically by its canonicalized child.
SPARK-21563: Fix race condition when serializing TaskDescriptions and adding jars.
SPARK-21565: Propagate metadata in attribute replacement..
SPARK-21580: Integers in aggregation expressions are wrongly taken as group-by ordinal.
SPARK-21588: SQLContext.getConf(key, null) should return null.
SPARK-21593: Fix 2 rendering errors on configuration page.
SPARK-21595: Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray.
SPARK-21596: Ensure places calling HDFSMetadataLog.get check the return value.
SPARK-21597: Fix a potential overflow issue in EventTimeStats.
SPARK-21617: Store correct table metadata when altering schema in Hive metastore..
SPARK-21621: Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called.
SPARK-21647: Fix SortMergeJoin when using CROSS.
SPARK-21648: Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided..
SPARK-21656: spark dynamic allocation should not idle timeout executors when tasks still to run.
SPARK-21663: test("remote fetch below max RPC message size") should call masterTracker.stop() in MapOutputTrackerSuite.
SPARK-21681: fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2).
SPARK-21696: Fix a potential issue that may generate partial snapshot files.
SPARK-21699: Remove unused getTableOption in ExternalCatalog.
SPARK-21714: Avoiding re-uploading remote resources in yarn client mode.
SPARK-21721: Clear FileSystem deleteOnExit cache when paths are successfully removed.
SPARK-21721: Clear FileSystem deleteOnExit cache when paths are successfully removed.
SPARK-21723: Fix writing LibSVM (key not found: numFeatures).
SPARK-21738: Thriftserver doesn't cancel jobs when session is closed.
SPARK-21739: Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType.
SPARK-21787: Support for pushing down filters for date types in ORC.
SPARK-21791: ORC should support column names with dot.
SPARK-21798: No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server.
SPARK-21805: Disable R vignettes code on Windows.
SPARK-21807: Override ++ operation in ExpressionSet to reduce clone time.
SPARK-21818: Fix bug of MultivariateOnlineSummarizer.variance generate negative result.
SPARK-21826: outer broadcast hash join should not throw NPE.
SPARK-21831: Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite.
SPARK-21834: Incorrect executor request in case of dynamic allocation.
SPARK-21890: Credentials not being passed to add the tokens.
SPARK-21901: Define toString for StateOperatorProgress.
SPARK-21907: NullPointerException in UnsafeExternalSorter.spill().
SPARK-21912: ORC/Parquet table should not create invalid column names.
SPARK-21915: Model 1 and Model 2 ParamMaps Missing.
SPARK-21924: Update structured streaming programming guide doc.
SPARK-21925: Update trigger interval documentation in docs with behavior change in Spark 2.2.
SPARK-21928: Set classloader on SerializerManager's private kryo.
SPARK-21929: Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source.
SPARK-21936: backward compatibility test framework for HiveExternalCatalog.
SPARK-21946: fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite.
SPARK-21950: pyspark.sql.tests.SQLTests2 should stop SparkContext..
SPARK-21953: Show both memory and disk bytes spilled if either is present.
SPARK-21954: JacksonUtils should verify MapType's value type instead of key type.
SPARK-21976: Fix wrong documentation for Mean Absolute Error..
SPARK-21980: References in grouping functions should be indexed with semanticEquals.
SPARK-21985: PairDeserializer is broken for double-zipped RDDs.
SPARK-22043: Improves error message for show_profiles and dump_profiles.
SPARK-22047: HiveExternalCatalogVersionsSuite.
SPARK-22047: ignore HiveExternalCatalogVersionsSuite.
SPARK-22052: Incorrect Metric assigned in MetricsReporter.scala.
SPARK-22072: Improve release build scripts.
SPARK-22076: Expand.projections should not be a Stream.
SPARK-22083: Release locks in MemoryStore.evictBlocksToFreeSpace.
SPARK-22092: Reallocation in OffHeapColumnVector.reserveInternal corrupts struct and array data.
SPARK-22094: processAllAvailable should check the query state.
SPARK-22107: Change as to alias in python quickstart.
SPARK-22109: Resolves type conflicts between strings and timestamps in partition column.
SPARK-22120: TestHiveSparkSession.reset() should clean out Hive warehouse directory.
SPARK-22129: Release script improvements.
SPARK-22135: metrics in spark-dispatcher not being registered properly.
SPARK-22140: Add TPCDSQuerySuite.
SPARK-22141: Propagate empty relation before checking Cartesian products.
SPARK-22143: Fix memory leak in OffHeapColumnVector.
SPARK-22146: FileNotFoundException while reading ORC files containing special characters.
SPARK-22158: convertMetastore should not ignore table property.
SPARK-22161: Add Impala-modified TPC-DS queries.
SPARK-22167: sparkr packaging issue allow zinc.
SPARK-22178: Refresh Persistent Views by REFRESH TABLE Command.
SPARK-22188: Adding security headers for preventing XSS, MitM and MIME sniffing.
SPARK-22206: gapply in R can't work on empty grouping columns.
SPARK-22218: spark shuffle services fails to update secret on app re-attempts.
SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle.
SPARK-22252: FileFormatWriter should respect the input query schema.
SPARK-22271: mean overflows and returns null for some decimal variables.
SPARK-22273: Fix key/value schema field names in HashMapGenerators..
SPARK-22300: Update ORC to 1.4.1.

HDP 2.6.2 provided Spark 1.6.3 and Spark 2.1.1 with no additional Apache patches.

HDP 2.6.1 provided Spark 1.6.3 with no additional Apache patches.

HDP 2.6.1 also provided Spark 2.1.1 and the following Apache patches:

SPARK-4105: retry the fetch or stage if shuffle block is corrupt.
SPARK-12717: Adding thread-safe broadcast pickle registry.
SPARK-13931: Resolve stage hanging up problem in a particular case.
SPARK-14658: when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished.
SPARK-16251: Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message.
SPARK-16929: Speculation-related synchronization bottleneck in checkSpeculatableTasks.
SPARK-17424: Fix unsound substitution bug in ScalaReflection..
SPARK-17663: SchedulableBuilder should handle invalid data access via scheduler.allocation.file.
SPARK-17685: Make SortMergeJoinExec's currentVars is null when calling createJoinKey.
SPARK-18099: Spark distributed cache should throw exception if same file is specified to dropped in --files --archives.
SPARK-18113: Use ask to replace askWithRetry in canCommit and make receiver idempotent..
SPARK-18251: DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class.
SPARK-18406: Race between end-of-task and completion iterator read lock release.
SPARK-18535: Redact sensitive information.
SPARK-18579: Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing.
SPARK-18629: Fix numPartition of JDBCSuite Testcase.
SPARK-18967: Locality preferences should be used when scheduling even when delay scheduling is turned off.
SPARK-18986: ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator.
SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore.
SPARK-19104: Lambda variables in ExternalMapToCatalyst should be global.
SPARK-19218: Fix SET command to show a result correctly and in a sorted order.
SPARK-19219: Fix Parquet log output defaults.
SPARK-19220: SSL redirect handler only redirects the server's root.
SPARK-19263: DAGScheduler should avoid sending conflicting task set..
SPARK-19263: Fix race in SchedulerIntegrationSuite..
SPARK-19276: FetchFailures can be hidden by user (or sql) exception handling.
SPARK-19539: Block duplicate temp table during creation.
SPARK-19556: Broadcast data is not encrypted when I/O encryption is on.
SPARK-19570: Allow to disable hive in pyspark shell.
SPARK-19631: OutputCommitCoordinator should not allow commits for already failed tasks.
SPARK-19688: Not to read `spark.yarn.credentials.file` from checkpoint..
SPARK-19727: Fix for round function that modifies original column.
SPARK-19775: Remove an obsolete `partitionBy().insertInto()` test case.
SPARK-19796: taskScheduler fails serializing long statements received by thrift server.
SPARK-19812: YARN shuffle service fails to relocate recovery DB acro….
SPARK-19868: conflict TasksetManager lead to spark stopped.
SPARK-20211: Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0.
SPARK-20217: Executor should not fail stage if killed task throws non-interrupted exception.
SPARK-20250: Improper OOM error when a task been killed while spilling data.
SPARK-20250: Improper OOM error when a task been killed while spilling data.
SPARK-20275: Do not display "Completed" column for in-progress applications.
SPARK-20341: Support BigInt's value that does not fit in long value range.
SPARK-20342: Update task accumulators before sending task end event..
SPARK-20358: Executors failing stage on interrupted exception thrown by cancelled tasks.
SPARK-20393: Strengthen Spark to prevent XSS vulnerabilities.
SPARK-20405: Dataset.withNewExecutionId should be private.
SPARK-20412: Throw ParseException from visitNonOptionalPartitionSpec instead of returning null values..
SPARK-20426: OneForOneStreamManager occupies too much memory..
SPARK-20439: Fix Catalog API listTables and getTable when failed to fetch table metadata.
SPARK-20459: JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException.
SPARK-20496: Bug in KafkaWriter Looks at Unanalyzed Plans.
SPARK-20517: Fix broken history UI download link.
SPARK-20540: Fix unstable executor requests..
SPARK-20546: spark-class gets syntax error in posix mode.
SPARK-20555: Fix mapping of Oracle DECIMAL types to Spark types in read path.
SPARK-20558: clear InheritableThreadLocal variables in SparkContext when stopping it.
SPARK-20566: ColumnVector should support `appendFloats` for array.
SPARK-20603: Set default number of topic partitions to 1 to reduce the load.
SPARK-20613: Remove excess quotes in Windows executable.
SPARK-20615: SparseVector.argmax throws IndexOutOfBoundsException.
SPARK-20616: RuleExecutor logDebug of batch results should show diff to start of batch.
SPARK-20627: Drop the hadoop distirbution name from the Python version.
SPARK-20631: LogisticRegression._checkThresholdConsistency should use values not Params.
SPARK-20665: Bround" and "Round" function return NULL.
SPARK-20685: Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg..
SPARK-20686: PropagateEmptyRelation incorrectly handles aggregate without grouping.
SPARK-20687: mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix.
SPARK-20688: correctly check analysis for scalar sub-queries.
SPARK-20705: The sort function can not be used in the master page when you use Firefox or Google Chrome..
SPARK-20735: Enable cross join in TPCDSQueryBenchmark.
SPARK-20756: yarn-shuffle jar references unshaded guava.
SPARK-20759: SCALA_VERSION in _config.yml should be consistent with pom.xml.
SPARK-20763: The function of `month` and `day` return the value which is not we expected..
SPARK-20769: Incorrect documentation for using Jupyter notebook.
SPARK-20781: the location of Dockerfile in docker.properties.templat is wrong.
SPARK-20796: the location of start-master.sh in spark-standalone.md is wrong.
SPARK-20798: GenerateUnsafeProjection should check if a value is null before calling the getter.
SPARK-20843: Add a config to set driver terminate timeout.
SPARK-20848: Shutdown the pool after reading parquet files.
SPARK-20848: Shutdown the pool after reading parquet files.
SPARK-20862: Avoid passing float to ndarray.reshape in LogisticRegressionModel.
SPARK-20868: UnsafeShuffleWriter should verify the position after FileChannel.transferTo.
SPARK-20874: Add Structured Streaming Kafka Source to examples project.
SPARK-20914: Javadoc contains code that is invalid.
SPARK-20920: ForkJoinPool pools are leaked when writing hive tables with many partitions.
SPARK-20922: Add whitelist of classes that can be deserialized by the launcher..
SPARK-20922: Don't use Java 8 lambdas in older branches..
SPARK-20940: Replace IllegalAccessError with IllegalStateException.
SPARK-20974: we should run REPL tests if SQL module has code changes.
SPARK-21041: SparkSession.range should be consistent with SparkContext.range.
SPARK-21064: Fix the default value bug in NettyBlockTransferServiceSuite.
SPARK-21072: TreeNode.mapChildren should only apply to the children node..
SPARK-21083: Store zero size and row count when analyzing empty table.
SPARK-21114: Fix test failure in Spark 2.1/2.0 due to name mismatch.
SPARK-21123: Options for file stream source are in a wrong table - version to fix 2.1.
SPARK-21138: Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different.
SPARK-21159: Don't try to connect to launcher in standalone cluster mode..
SPARK-21167: Decode the path generated by File sink to handle special characters.
SPARK-21176: Limit number of selector threads for admin ui proxy servlets to 8.
SPARK-21181: Release byteBuffers to suppress netty error messages.
SPARK-21203: Fix wrong results of insertion of Array of Struct.
SPARK-21306: For branch 2.1, OneVsRest should support setWeightCol.
SPARK-21312: correct offsetInBytes in UnsafeRow.writeToStream.
SPARK-21330: Bad partitioning does not allow to read a JDBC table with extreme values on the partition column.
SPARK-21332: Incorrect result type inferred for some decimal expressions.
SPARK-21345: SparkSessionBuilderSuite should clean up stopped sessions..
SPARK-21376: Token is not renewed in yarn client process in cluster mode.
SPARK-21441: Incorrect Codegen in SortMergeJoinExec results failures in some cases.
SPARK-21446: Fix setAutoCommit never executed.
SPARK-21522: Fix flakiness in LauncherServerSuite..
SPARK-21555: RuntimeReplaceable should be compared semantically by its canonicalized child.
SPARK-21588: SQLContext.getConf(key, null) should return null.

HDP 2.6.0 provided Spark 1.6.3 and the following Apache patches:

SPARK-6717: Clear shuffle files after checkpointing in ALS.
SPARK-6735: Add window based executor failure tracking mechanism for long running service.
SPARK-6847: Stack overflow on updateStateByKey which followed by a stream with checkpoint set.
SPARK-7481: Add spark-cloud module to pull in aws+azure object store FS accessors; test integration.
SPARK-7889: Jobs progress of apps on complete page of HistoryServer shows uncompleted.
SPARK-10582: using dynamic-executor-allocation, if AM failed, the new AM will be started. But the new AM does not allocate executors to driver.
SPARK-11137: Make StreamingContext.stop() exception-safe.
SPARK-11314: Add service API and test service for Yarn Cluster schedulers.
SPARK-11315: Add YARN extension service to publish Spark events to YARN timeline service (part of SPARK-1537).
SPARK-11323: Add History Service Provider to service application histories from YARN timeline server (part of SPARK-1537).
SPARK-11627: Spark Streaming backpressure mechanism has no initial rate limit, receivers receive data at the maximum speed , it might cause OOM exception.
SPARK-12001: StreamingContext cannot be completely stopped if the stop() is interrupted.
SPARK-12009: Avoid re-allocate yarn container while driver want to stop all executors.
SPARK-12142: Can't request executor when container allocator us bit ready.
SPARK-12241: Improve failure reporting in Yarn client obtainTokenForHBase().
SPARK-12353: wrong output for countByValue and countByValueAndWIndow.
SPARK-12513: SocketReceiver hang in Netcat example.
SPARK-12523: Support long-running of the Spark on HBase and hive metastore.
SPARK-12920: Fix high CPU usage in Spark thrift server with concurrent users..
SPARK-12948: OrcRelation uses HadoopRDD which can broadcast conf objects frequently..
SPARK-12967: NettyRPC races with SparkContext.stop() and throws exception.
SPARK-12998: Enable OrcRelation even when connecting via spark thrift server..
SPARK-13021: Fail fast when custom RDD's violate RDD.partition's API contract.
SPARK-13117: WebUI should use the local ip not 0.0.0.0.
SPARK-13278: Launcher fails to start with JDK 9 EA.
SPARK-13308: ManagedBuffers passed to OneToOneStreamManager need to be freed in non error cases.
SPARK-13360: pyspark related enviroment variable is not propagated to driver in yarn-cluster mode.
SPARK-13468: Fix a corner case where the page UI should show DAG but it doesn't show.
SPARK-13478: Use real user when fetching delegation tokens.
SPARK-13885: Fix attempt id regression for Spark running on Yarn.
SPARK-13902: Make DAGScheduler not to create duplicate stage.
SPARK-14062: Fix log4j and upload metrics.properties automatically with distributed cache.
SPARK-14091: Consider improving performance of SparkContext.getCallSite()..
SPARK-15067: YARN executors are launched with fixed perm gen size.
SPARK-1537: Add integration with Yarn's Application Timeline Server.
SPARK-15705: Change the default value of spark.sql.hive.convertMetastoreOrc to false.
SPARK-15844: HistoryServer doesn't come up if spark.authenticate = true.
SPARK-15990: Add rolling log aggregation support for Spark on yarn.
SPARK-16110: Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set.
SPARK-19033: HistoryServer still uses old ACLs even if ACLs are updated.
SPARK-19306: Fix inconsistent state in DiskBlockObjectWriter when exception occurred.
SPARK-19970: Table owner should be USER instead of PRINCIPAL in kerberized clusters.

​Spark

Spark