Spark
This release provides Spark 2.3.2 and no additional patches.
In addition, this release provides Spark 2.3.0 and the following Apache patches:
-
SPARK-13587: Support virtualenv in pyspark.
-
SPARK-19964: Avoid reading from remote repos in SparkSubmitSuite.
-
SPARK-22882: ML test for structured streaming: ml.classification.
-
SPARK-22915: Streaming tests for spark.ml.feature, from N to Z.
-
SPARK-23020: Fix another race in the in-process launcher test.
-
SPARK-23040: Returns interruptible iterator for shuffle reader.
-
SPARK-23173: Avoid creating corrupt parquet files when loading data from JSON.
-
SPARK-23264: Fix scala.MatchError in literals.sql.out.
-
SPARK-23288: Fix output metrics with parquet sink.
-
SPARK-23329: Fix documentation of trigonometric functions.
-
SPARK-23406: Enable stream-stream self-joins for branch-2.3.
-
SPARK-23434: Spark should not warn `metadata directory` for a HDFS file path.
-
SPARK-23436: Infer partition as Date only if it can be casted to Date.
-
SPARK-23457: Register task completion listeners first in ParquetFileFormat.
-
SPARK-23462: improve missing field error message in `StructType`.
-
SPARK-23490: Check storage.locationUri with existing table in CreateTable.
-
SPARK-23524: Big local shuffle blocks should not be checked for corruption.
-
SPARK-23525: Support ALTER TABLE CHANGE COLUMN COMMENT for external hive table.
-
SPARK-23553: Tests should not assume the default value of `spark.sql.sources.default`.
-
SPARK-23569: Allow pandas_udf to work with python3 style type-annotated functions.
-
SPARK-23570: Add Spark 2.3.0 in HiveExternalCatalogVersionsSuite.
-
SPARK-23598: Make methods in BufferedRowIterator public to avoid runtime error for a large query.
-
SPARK-23599: Add a UUID generator from Pseudo-Random Numbers.
-
SPARK-23599: Use RandomUUIDGenerator in Uuid expression.
-
SPARK-23601: Remove .md5 files from release.
-
SPARK-23608: Add synchronization in SHS between attachSparkUI and detachSparkUI functions to avoid concurrent modification issue to Jetty Handlers.
-
SPARK-23614: Fix incorrect reuse exchange when caching is used.
-
SPARK-23623: Avoid concurrent use of cached consumers in CachedKafkaConsumer (branch-2.3).
-
SPARK-23624: Revise doc of method pushFilters in Datasource V2.
-
SPARK-23628: calculateParamLength should not return 1 + num of expressions.
-
SPARK-23630: Allow user's hadoop conf customizations to take effect.
-
SPARK-23635: Spark executor env variable is overwritten by same name AM env variable.
-
SPARK-23637: Yarn might allocate more resource if a same executor is killed multiple times.
-
SPARK-23639: Obtain token before init metastore client in SparkSQL CLI.
-
SPARK-23642: AccumulatorV2 subclass isZero scaladoc fix.
-
SPARK-23644: Use absolute path for REST call in SHS.
-
SPARK-23645: Add docs RE `pandas_udf` with keyword args.
-
SPARK-23649: Skipping chars disallowed in UTF-8.
-
SPARK-23658: InProcessAppHandle uses the wrong class in getLogger.
-
SPARK-23660: Fix exception in yarn cluster mode when application ended fast.
-
SPARK-23670: Fix memory leak on SparkPlanGraphWrapper.
-
SPARK-23671: Fix condition to enable the SHS thread pool.
-
SPARK-23691: Use sql_conf util in PySpark tests where possible.
-
SPARK-23695: Fix the error message for Kinesis streaming tests.
-
SPARK-23706: spark.conf.get(value, default=None) should produce None in PySpark.
-
SPARK-23728: Fix ML tests with expected exceptions running streaming tests.
-
SPARK-23729: Respect URI fragment when resolving globs.
-
SPARK-23759: Unable to bind Spark UI to specific host name / IP.
-
SPARK-23760: CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly.
-
SPARK-23769: Remove comments that unnecessarily disable Scalastyle check.
-
SPARK-23788: Fix race in StreamingQuerySuite.
-
SPARK-23802: PropagateEmptyRelation can leave query plan in unresolved state.
-
SPARK-23806: Broadcast.unpersist can cause fatal exception when used with dynamic allocation.
-
SPARK-23808: Set default Spark session in test-only spark sessions.
-
SPARK-23809: Active SparkSession should be set by getOrCreate.
-
SPARK-23816: Killed tasks should ignore FetchFailures.
-
SPARK-23822: Improve error message for Parquet schema mismatches.
-
SPARK-23823: Keep origin in transformExpression.
-
SPARK-23827: StreamingJoinExec should ensure that input data is partitioned into specific number of partitions.
-
SPARK-23838: Running SQL query is displayed as "completed" in SQL tab.
-
SPARK-23881: Fix flaky test JobCancellationSuite."interruptible iterator of shuffle reader".
HDP 2.6.4 provided Spark 1.6.3 with the followingApache patches:
-
SPARK-14234: Executor crashes for TaskRunner thread interruption.
In addition, HDP 2.6.4 provided Spark 2.2.0 and the following Apache patches:
-
SPARK-17321: Avoid writing shuffle metadata to disk if NM recovery is disabled.
-
SPARK-17902: Revive stringsAsFactors option for collect() in SparkR.
-
SPARK-17920: Support writing to Hive table which uses Avro schema url 'avro.schema.url'.
-
SPARK-19580: Support for avro.schema.url while writing to hive table.
-
SPARK-19606: fix mesos break.
-
SPARK-19606: Support constraints in spark-dispatcher.
-
SPARK-19611: set dataSchema correctly in HiveMetastoreCatalog.convertToLogicalRelation.
-
SPARK-19644: Clean up Scala reflection garbage after creating Encoder (branch-2.2).
-
SPARK-19727: Fix for round function that modifies original column.
-
SPARK-19878: Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala.
-
SPARK-21667: ConsoleSink should not fail streaming query with checkpointLocation option.
-
SPARK-21694: Reduce max iterations in Linear SVM test in R to speed up AppVeyor build.
-
SPARK-21720: Fix 64KB JVM bytecode limit problem with AND or OR.
-
SPARK-21807: Override ++ operation in ExpressionSet to reduce clone time.
-
SPARK-21936: backward compatibility test framework for HiveExternalCatalog.
-
SPARK-21991: Fix java lint.
-
SPARK-21991: Fix race condition in LauncherServer#acceptConnections.
-
SPARK-22211: Remove incorrect FOJ limit pushdown.
-
SPARK-22227: DiskBlockManager.getAllBlocks now tolerates temp files.
-
SPARK-22243: spark.yarn.jars should reload from config when checkpoint recovery.
-
SPARK-22249: isin with empty list throws exception on cached DataFrame.
-
SPARK-22281: Handle R method breaking signature changes.
-
SPARK-22284: Fix 64KB JVM bytecode limit problem in calculating hash for nested structs.
-
SPARK-22287: SPARK_DAEMON_MEMORY not honored by MesosClusterD….
-
SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL.
-
SPARK-22294: Reset spark.driver.bindAddress when starting a Checkpoint.
-
SPARK-22306: alter table schema should not erase the bucketing metadata at hive side.
-
SPARK-22315: Warn if SparkR package version doesn't match SparkContext.
-
SPARK-22319: call loginUserFromKeytab before accessing hdfs.
-
SPARK-22327: check for version warning.
-
SPARK-22328: ClosureCleaner should not miss referenced superclass fields.
-
SPARK-22332: Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic).
-
SPARK-22333: timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference.
-
SPARK-22344: clean up install dir if running test as source package.
-
SPARK-22344: Set java.io.tmpdir for SparkR tests.
-
SPARK-22355: Dataset.collect is not threadsafe.
-
SPARK-22356: data source table should support overlapped columns between data and partition schema.
-
SPARK-22377: Use /usr/sbin/lsof if lsof does not exists in release-build.sh.
-
SPARK-22403: Add optional checkpointLocation argument to StructuredKafkaWordCount example.
-
SPARK-22417: Fix for createDataFrame from pandas.DataFrame with timestamp.
-
SPARK-22417: Fix for createDataFrame from pandas.DataFrame with timestamp.
-
SPARK-22429: Streaming checkpointing code does not retry after failure.
-
SPARK-22442: ScalaReflection should produce correct field names for special characters.
-
SPARK-22442: ScalaReflection should produce correct field names for special characters.
-
SPARK-22464: No pushdown for Hive metastore partition predicates containing null-safe equality.
-
SPARK-22469: Accuracy problem in comparison with string and numeric.
-
SPARK-22471: SQLListener consumes much memory causing OutOfMemoryError.
-
SPARK-22472: add null check for top-level primitive values.
-
SPARK-22479: Exclude credentials from SaveintoDataSourceCommand.simpleString.
-
SPARK-22488: Fix the view resolution issue in the SparkSession internal table() API.
-
SPARK-22490: Add PySpark doc for SparkSession.builder.
-
SPARK-22494: Fix 64KB limit exception with Coalesce and AtleastNNonNulls.
-
SPARK-22498: Fix 64KB JVM bytecode limit problem with concat.
-
SPARK-22499: Fix 64KB JVM bytecode limit problem with least and greatest.
-
SPARK-22500: Fix 64KB JVM bytecode limit problem with cast.
-
SPARK-22501: Fix 64KB JVM bytecode limit problem with in.
-
SPARK-22508: Fix 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create().
-
SPARK-22511: Update maven central repo address.
-
SPARK-22535: Sleep before killing the python worker in PythRunner.MonitorThread (branch-2.2).
-
SPARK-22538: SQLTransformer should not unpersist possibly cached input dataset.
-
SPARK-22540: Ensure HighlyCompressedMapStatus calculates correct avgSize.
-
SPARK-22544: FileStreamSource should use its own hadoop conf to call globPathIfNecessary.
-
SPARK-22548: Incorrect nested AND expression pushed down to JDBC data source.
-
SPARK-22549: Fix 64KB JVM bytecode limit problem with concat_ws.
-
SPARK-22550: Fix 64KB JVM bytecode limit problem with elt.
HDP 2.6.3 provided Spark 1.6.3 with no additional Apache patches. In addition, this release provides Spark 2.2.0 and the following Apache patches:
-
SPARK-207: Remove hardcode FS scheme from Spark archive.
-
SPARK-12717: Adding thread-safe broadcast pickle registry.
-
SPARK-13656: Delete spark.sql.parquet.cacheMetadata from SQLConf and docs.
-
SPARK-14387: Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc.
-
SPARK-15648: Add TeradataDialect.
-
SPARK-16060: Vectorized Orc Reader.
-
SPARK-16628: OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files.
-
SPARK-18061: Support spnego for ThriftServer thrift/http auth.
-
SPARK-18136: Fix SPARK_JARS_DIR for Python pip install on Windows.
-
SPARK-18355: Spark SQL fails to read data from a ORC hive table that has a new column added to it.
-
SPARK-18464: support old table which doesn't store schema in table properties.
-
SPARK-18608: Fix double caching.
-
SPARK-18608: Fix double caching for PySpark OneVsRest..
-
SPARK-19809: NullPointerException on empty ORC file.
-
SPARK-20079: Fix client AM not allocating executors after restart.
-
SPARK-20098: dataType's typeName fix.
-
SPARK-20256: SessionState should be created more lazily.
-
SPARK-20342: Update task accumulators before sending task end event..
-
SPARK-20466: HadoopRDD#addLocalConfiguration throws NPE.
-
SPARK-20682: Support a new faster ORC data source based on Apache ORC.
-
SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core.
-
SPARK-20904: Don't report task failures to driver during shutdown..
-
SPARK-20981: Add new configuration spark.jars.repositories as equivalence of --repositories.
-
SPARK-21069: Add rate source to programming guide..
-
SPARK-21083: Store zero size and row count when analyzing empty table.
-
SPARK-21128: Remove both "spark-warehouse" and "metastore_db" before listing files in R tests.
-
SPARK-21170: Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted.
-
SPARK-21219: Task retry occurs on same executor due to race co….
-
SPARK-21228: InSet incorrect handling of structs.
-
SPARK-21243: Limit no. of map outputs in a shuffle fetch.
-
SPARK-21247: Type comparison should respect case-sensitive SQL conf.
-
SPARK-21254: History UI performance fixes.
-
SPARK-21267: Follow up to avoid referencing programming-guide redirector.
-
SPARK-21267: Update Structured Streaming Documentation.
-
SPARK-21272: SortMergeJoin LeftAnti does not update numOutputRows.
-
SPARK-21300: ExternalMapToCatalyst should null-check map key prior to converting to internal value..
-
SPARK-21306: OneVsRest should support setWeightCol.
-
SPARK-21312: correct offsetInBytes in UnsafeRow.writeToStream.
-
SPARK-21321: Spark very verbose on shutdown.
-
SPARK-21330: Bad partitioning does not allow to read a JDBC table with extreme values on the partition column Fix Java SimpleApp spark application.
-
SPARK-21332: Incorrect result type inferred for some decimal expressions.
-
SPARK-21333: Removed invalid joinTypes from javadoc of Dataset#joinWith.
-
SPARK-21339: spark-shell --packages option does not add jars to classpath on windows.
-
SPARK-21342: Fix DownloadCallback to work well with RetryingBlockFetcher..
-
SPARK-21343: Refine the document for spark.reducer.maxReqSizeShuffleToMem..
-
SPARK-21344: BinaryType comparison does signed byte array comparison.
-
SPARK-21345: SparkSessionBuilderSuite should clean up stopped sessions..
-
SPARK-21366: Add sql test for window functions.
-
SPARK-21369: Don't use Scala Tuple2 in common/network-*.
-
SPARK-21374: Fix reading globbed paths from S3 into DF with disabled FS cache.
-
SPARK-21376: Fix yarn client token expire issue when cleaning the staging files in long running scenario.
-
SPARK-21376: Fix yarn client token expire issue when cleaning the staging files in long running scenario.
-
SPARK-21377: Make jars specify with --jars/--packages load-able in AM's credential renwer.
-
SPARK-21383: Fix the YarnAllocator allocates more Resource.
-
SPARK-21384: Spark + YARN fails with LocalFileSystem as default FS.
-
SPARK-21414: Refine SlidingWindowFunctionFrame to avoid OOM..
-
SPARK-21418: NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true.
-
SPARK-21422: Depend on Apache ORC 1.4.0.
-
SPARK-21434: Add pyspark pip documentation..
-
SPARK-21441: Incorrect Codegen in SortMergeJoinExec results failures in some cases.
-
SPARK-21445: Make IntWrapper and LongWrapper in UTF8String Serializable.
-
SPARK-21446: Fix setAutoCommit never executed.
-
SPARK-21447: Spark history server fails to render compressed.
-
SPARK-21457: ExternalCatalog.listPartitions should correctly handle partition values with dot.
-
SPARK-21464: Minimize deprecation warnings caused by ProcessingTime class.
-
SPARK-21475: Use NIO's Files API to replace FileInputStream/FileOutputStream in some critical paths.
-
SPARK-21477: Mark LocalTableScanExec's input data transient.
-
SPARK-21494: Use correct app id when authenticating to external service..
-
SPARK-21503: Spark UI shows incorrect task status for a killed Executor Process.
-
SPARK-21508: Fix example code provided in Spark Streaming Documentation.
-
SPARK-21522: Fix flakiness in LauncherServerSuite..
-
SPARK-21523: update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search.
-
SPARK-21538: Attribute resolution inconsistency in the Dataset API.
-
SPARK-21546: dropDuplicates should ignore watermark when it's not a key.
-
SPARK-21549: Respect OutputFormats with no output directory provided.
-
SPARK-21549: Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs.
-
SPARK-21551: Increase timeout for PythonRDD.serveIterator.
-
SPARK-21555: RuntimeReplaceable should be compared semantically by its canonicalized child.
-
SPARK-21563: Fix race condition when serializing TaskDescriptions and adding jars.
-
SPARK-21565: Propagate metadata in attribute replacement..
-
SPARK-21580: Integers in aggregation expressions are wrongly taken as group-by ordinal.
-
SPARK-21588: SQLContext.getConf(key, null) should return null.
-
SPARK-21593: Fix 2 rendering errors on configuration page.
-
SPARK-21595: Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray.
-
SPARK-21596: Ensure places calling HDFSMetadataLog.get check the return value.
-
SPARK-21597: Fix a potential overflow issue in EventTimeStats.
-
SPARK-21617: Store correct table metadata when altering schema in Hive metastore..
-
SPARK-21621: Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called.
-
SPARK-21647: Fix SortMergeJoin when using CROSS.
-
SPARK-21648: Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided..
-
SPARK-21656: spark dynamic allocation should not idle timeout executors when tasks still to run.
-
SPARK-21663: test("remote fetch below max RPC message size") should call masterTracker.stop() in MapOutputTrackerSuite.
-
SPARK-21681: fix bug of MLOR do not work correctly when featureStd contains zero (backport PR for 2.2).
-
SPARK-21696: Fix a potential issue that may generate partial snapshot files.
-
SPARK-21699: Remove unused getTableOption in ExternalCatalog.
-
SPARK-21714: Avoiding re-uploading remote resources in yarn client mode.
-
SPARK-21721: Clear FileSystem deleteOnExit cache when paths are successfully removed.
-
SPARK-21721: Clear FileSystem deleteOnExit cache when paths are successfully removed.
-
SPARK-21723: Fix writing LibSVM (key not found: numFeatures).
-
SPARK-21738: Thriftserver doesn't cancel jobs when session is closed.
-
SPARK-21739: Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType.
-
SPARK-21787: Support for pushing down filters for date types in ORC.
-
SPARK-21791: ORC should support column names with dot.
-
SPARK-21798: No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server.
-
SPARK-21805: Disable R vignettes code on Windows.
-
SPARK-21807: Override ++ operation in ExpressionSet to reduce clone time.
-
SPARK-21818: Fix bug of MultivariateOnlineSummarizer.variance generate negative result.
-
SPARK-21826: outer broadcast hash join should not throw NPE.
-
SPARK-21831: Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite.
-
SPARK-21834: Incorrect executor request in case of dynamic allocation.
-
SPARK-21890: Credentials not being passed to add the tokens.
-
SPARK-21901: Define toString for StateOperatorProgress.
-
SPARK-21907: NullPointerException in UnsafeExternalSorter.spill().
-
SPARK-21912: ORC/Parquet table should not create invalid column names.
-
SPARK-21915: Model 1 and Model 2 ParamMaps Missing.
-
SPARK-21924: Update structured streaming programming guide doc.
-
SPARK-21925: Update trigger interval documentation in docs with behavior change in Spark 2.2.
-
SPARK-21928: Set classloader on SerializerManager's private kryo.
-
SPARK-21929: Support `ALTER TABLE table_name ADD COLUMNS(..)` for ORC data source.
-
SPARK-21936: backward compatibility test framework for HiveExternalCatalog.
-
SPARK-21946: fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite.
-
SPARK-21950: pyspark.sql.tests.SQLTests2 should stop SparkContext..
-
SPARK-21953: Show both memory and disk bytes spilled if either is present.
-
SPARK-21954: JacksonUtils should verify MapType's value type instead of key type.
-
SPARK-21976: Fix wrong documentation for Mean Absolute Error..
-
SPARK-21980: References in grouping functions should be indexed with semanticEquals.
-
SPARK-21985: PairDeserializer is broken for double-zipped RDDs.
-
SPARK-22043: Improves error message for show_profiles and dump_profiles.
-
SPARK-22047: HiveExternalCatalogVersionsSuite.
-
SPARK-22047: ignore HiveExternalCatalogVersionsSuite.
-
SPARK-22052: Incorrect Metric assigned in MetricsReporter.scala.
-
SPARK-22072: Improve release build scripts.
-
SPARK-22076: Expand.projections should not be a Stream.
-
SPARK-22083: Release locks in MemoryStore.evictBlocksToFreeSpace.
-
SPARK-22092: Reallocation in OffHeapColumnVector.reserveInternal corrupts struct and array data.
-
SPARK-22094: processAllAvailable should check the query state.
-
SPARK-22107: Change as to alias in python quickstart.
-
SPARK-22109: Resolves type conflicts between strings and timestamps in partition column.
-
SPARK-22120: TestHiveSparkSession.reset() should clean out Hive warehouse directory.
-
SPARK-22129: Release script improvements.
-
SPARK-22135: metrics in spark-dispatcher not being registered properly.
-
SPARK-22140: Add TPCDSQuerySuite.
-
SPARK-22141: Propagate empty relation before checking Cartesian products.
-
SPARK-22143: Fix memory leak in OffHeapColumnVector.
-
SPARK-22146: FileNotFoundException while reading ORC files containing special characters.
-
SPARK-22158: convertMetastore should not ignore table property.
-
SPARK-22161: Add Impala-modified TPC-DS queries.
-
SPARK-22167: sparkr packaging issue allow zinc.
-
SPARK-22178: Refresh Persistent Views by REFRESH TABLE Command.
-
SPARK-22188: Adding security headers for preventing XSS, MitM and MIME sniffing.
-
SPARK-22206: gapply in R can't work on empty grouping columns.
-
SPARK-22218: spark shuffle services fails to update secret on app re-attempts.
-
SPARK-22223: ObjectHashAggregate should not introduce unnecessary shuffle.
-
SPARK-22252: FileFormatWriter should respect the input query schema.
-
SPARK-22271: mean overflows and returns null for some decimal variables.
-
SPARK-22273: Fix key/value schema field names in HashMapGenerators..
-
SPARK-22300: Update ORC to 1.4.1.
HDP 2.6.2 provided Spark 1.6.3 and Spark 2.1.1 with no additional Apache patches.
HDP 2.6.1 provided Spark 1.6.3 with no additional Apache patches.
HDP 2.6.1 also provided Spark 2.1.1 and the following Apache patches:
-
SPARK-4105: retry the fetch or stage if shuffle block is corrupt.
-
SPARK-12717: Adding thread-safe broadcast pickle registry.
-
SPARK-13931: Resolve stage hanging up problem in a particular case.
-
SPARK-14658: when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished.
-
SPARK-16251: Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message.
-
SPARK-16929: Speculation-related synchronization bottleneck in checkSpeculatableTasks.
-
SPARK-17424: Fix unsound substitution bug in ScalaReflection..
-
SPARK-17663: SchedulableBuilder should handle invalid data access via scheduler.allocation.file.
-
SPARK-17685: Make SortMergeJoinExec's currentVars is null when calling createJoinKey.
-
SPARK-18099: Spark distributed cache should throw exception if same file is specified to dropped in --files --archives.
-
SPARK-18113: Use ask to replace askWithRetry in canCommit and make receiver idempotent..
-
SPARK-18251: DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class.
-
SPARK-18406: Race between end-of-task and completion iterator read lock release.
-
SPARK-18535: Redact sensitive information.
-
SPARK-18579: Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing.
-
SPARK-18629: Fix numPartition of JDBCSuite Testcase.
-
SPARK-18967: Locality preferences should be used when scheduling even when delay scheduling is turned off.
-
SPARK-18986: ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator.
-
SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore.
-
SPARK-19104: Lambda variables in ExternalMapToCatalyst should be global.
-
SPARK-19218: Fix SET command to show a result correctly and in a sorted order.
-
SPARK-19219: Fix Parquet log output defaults.
-
SPARK-19220: SSL redirect handler only redirects the server's root.
-
SPARK-19263: DAGScheduler should avoid sending conflicting task set..
-
SPARK-19263: Fix race in SchedulerIntegrationSuite..
-
SPARK-19276: FetchFailures can be hidden by user (or sql) exception handling.
-
SPARK-19539: Block duplicate temp table during creation.
-
SPARK-19556: Broadcast data is not encrypted when I/O encryption is on.
-
SPARK-19570: Allow to disable hive in pyspark shell.
-
SPARK-19631: OutputCommitCoordinator should not allow commits for already failed tasks.
-
SPARK-19688: Not to read `spark.yarn.credentials.file` from checkpoint..
-
SPARK-19727: Fix for round function that modifies original column.
-
SPARK-19775: Remove an obsolete `partitionBy().insertInto()` test case.
-
SPARK-19796: taskScheduler fails serializing long statements received by thrift server.
-
SPARK-19812: YARN shuffle service fails to relocate recovery DB acro….
-
SPARK-19868: conflict TasksetManager lead to spark stopped.
-
SPARK-20211: Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0.
-
SPARK-20217: Executor should not fail stage if killed task throws non-interrupted exception.
-
SPARK-20250: Improper OOM error when a task been killed while spilling data.
-
SPARK-20250: Improper OOM error when a task been killed while spilling data.
-
SPARK-20275: Do not display "Completed" column for in-progress applications.
-
SPARK-20341: Support BigInt's value that does not fit in long value range.
-
SPARK-20342: Update task accumulators before sending task end event..
-
SPARK-20358: Executors failing stage on interrupted exception thrown by cancelled tasks.
-
SPARK-20393: Strengthen Spark to prevent XSS vulnerabilities.
-
SPARK-20405: Dataset.withNewExecutionId should be private.
-
SPARK-20412: Throw ParseException from visitNonOptionalPartitionSpec instead of returning null values..
-
SPARK-20426: OneForOneStreamManager occupies too much memory..
-
SPARK-20439: Fix Catalog API listTables and getTable when failed to fetch table metadata.
-
SPARK-20459: JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException.
-
SPARK-20496: Bug in KafkaWriter Looks at Unanalyzed Plans.
-
SPARK-20517: Fix broken history UI download link.
-
SPARK-20540: Fix unstable executor requests..
-
SPARK-20546: spark-class gets syntax error in posix mode.
-
SPARK-20555: Fix mapping of Oracle DECIMAL types to Spark types in read path.
-
SPARK-20558: clear InheritableThreadLocal variables in SparkContext when stopping it.
-
SPARK-20566: ColumnVector should support `appendFloats` for array.
-
SPARK-20603: Set default number of topic partitions to 1 to reduce the load.
-
SPARK-20613: Remove excess quotes in Windows executable.
-
SPARK-20615: SparseVector.argmax throws IndexOutOfBoundsException.
-
SPARK-20616: RuleExecutor logDebug of batch results should show diff to start of batch.
-
SPARK-20627: Drop the hadoop distirbution name from the Python version.
-
SPARK-20631: LogisticRegression._checkThresholdConsistency should use values not Params.
-
SPARK-20665: Bround" and "Round" function return NULL.
-
SPARK-20685: Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg..
-
SPARK-20686: PropagateEmptyRelation incorrectly handles aggregate without grouping.
-
SPARK-20687: mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix.
-
SPARK-20688: correctly check analysis for scalar sub-queries.
-
SPARK-20705: The sort function can not be used in the master page when you use Firefox or Google Chrome..
-
SPARK-20735: Enable cross join in TPCDSQueryBenchmark.
-
SPARK-20756: yarn-shuffle jar references unshaded guava.
-
SPARK-20759: SCALA_VERSION in _config.yml should be consistent with pom.xml.
-
SPARK-20763: The function of `month` and `day` return the value which is not we expected..
-
SPARK-20769: Incorrect documentation for using Jupyter notebook.
-
SPARK-20781: the location of Dockerfile in docker.properties.templat is wrong.
-
SPARK-20796: the location of start-master.sh in spark-standalone.md is wrong.
-
SPARK-20798: GenerateUnsafeProjection should check if a value is null before calling the getter.
-
SPARK-20843: Add a config to set driver terminate timeout.
-
SPARK-20848: Shutdown the pool after reading parquet files.
-
SPARK-20848: Shutdown the pool after reading parquet files.
-
SPARK-20862: Avoid passing float to ndarray.reshape in LogisticRegressionModel.
-
SPARK-20868: UnsafeShuffleWriter should verify the position after FileChannel.transferTo.
-
SPARK-20874: Add Structured Streaming Kafka Source to examples project.
-
SPARK-20914: Javadoc contains code that is invalid.
-
SPARK-20920: ForkJoinPool pools are leaked when writing hive tables with many partitions.
-
SPARK-20922: Add whitelist of classes that can be deserialized by the launcher..
-
SPARK-20922: Don't use Java 8 lambdas in older branches..
-
SPARK-20940: Replace IllegalAccessError with IllegalStateException.
-
SPARK-20974: we should run REPL tests if SQL module has code changes.
-
SPARK-21041: SparkSession.range should be consistent with SparkContext.range.
-
SPARK-21064: Fix the default value bug in NettyBlockTransferServiceSuite.
-
SPARK-21072: TreeNode.mapChildren should only apply to the children node..
-
SPARK-21083: Store zero size and row count when analyzing empty table.
-
SPARK-21114: Fix test failure in Spark 2.1/2.0 due to name mismatch.
-
SPARK-21123: Options for file stream source are in a wrong table - version to fix 2.1.
-
SPARK-21138: Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different.
-
SPARK-21159: Don't try to connect to launcher in standalone cluster mode..
-
SPARK-21167: Decode the path generated by File sink to handle special characters.
-
SPARK-21176: Limit number of selector threads for admin ui proxy servlets to 8.
-
SPARK-21181: Release byteBuffers to suppress netty error messages.
-
SPARK-21203: Fix wrong results of insertion of Array of Struct.
-
SPARK-21306: For branch 2.1, OneVsRest should support setWeightCol.
-
SPARK-21312: correct offsetInBytes in UnsafeRow.writeToStream.
-
SPARK-21330: Bad partitioning does not allow to read a JDBC table with extreme values on the partition column.
-
SPARK-21332: Incorrect result type inferred for some decimal expressions.
-
SPARK-21345: SparkSessionBuilderSuite should clean up stopped sessions..
-
SPARK-21376: Token is not renewed in yarn client process in cluster mode.
-
SPARK-21441: Incorrect Codegen in SortMergeJoinExec results failures in some cases.
-
SPARK-21446: Fix setAutoCommit never executed.
-
SPARK-21522: Fix flakiness in LauncherServerSuite..
-
SPARK-21555: RuntimeReplaceable should be compared semantically by its canonicalized child.
-
SPARK-21588: SQLContext.getConf(key, null) should return null.
HDP 2.6.0 provided Spark 1.6.3 and the following Apache patches:
-
SPARK-6717: Clear shuffle files after checkpointing in ALS.
-
SPARK-6735: Add window based executor failure tracking mechanism for long running service.
-
SPARK-6847: Stack overflow on updateStateByKey which followed by a stream with checkpoint set.
-
SPARK-7481: Add spark-cloud module to pull in aws+azure object store FS accessors; test integration.
-
SPARK-7889: Jobs progress of apps on complete page of HistoryServer shows uncompleted.
-
SPARK-10582: using dynamic-executor-allocation, if AM failed, the new AM will be started. But the new AM does not allocate executors to driver.
-
SPARK-11137: Make StreamingContext.stop() exception-safe.
-
SPARK-11314: Add service API and test service for Yarn Cluster schedulers.
-
SPARK-11315: Add YARN extension service to publish Spark events to YARN timeline service (part of SPARK-1537).
-
SPARK-11323: Add History Service Provider to service application histories from YARN timeline server (part of SPARK-1537).
-
SPARK-11627: Spark Streaming backpressure mechanism has no initial rate limit, receivers receive data at the maximum speed , it might cause OOM exception.
-
SPARK-12001: StreamingContext cannot be completely stopped if the stop() is interrupted.
-
SPARK-12009: Avoid re-allocate yarn container while driver want to stop all executors.
-
SPARK-12142: Can't request executor when container allocator us bit ready.
-
SPARK-12241: Improve failure reporting in Yarn client obtainTokenForHBase().
-
SPARK-12353: wrong output for countByValue and countByValueAndWIndow.
-
SPARK-12513: SocketReceiver hang in Netcat example.
-
SPARK-12523: Support long-running of the Spark on HBase and hive metastore.
-
SPARK-12920: Fix high CPU usage in Spark thrift server with concurrent users..
-
SPARK-12948: OrcRelation uses HadoopRDD which can broadcast conf objects frequently..
-
SPARK-12967: NettyRPC races with SparkContext.stop() and throws exception.
-
SPARK-12998: Enable OrcRelation even when connecting via spark thrift server..
-
SPARK-13021: Fail fast when custom RDD's violate RDD.partition's API contract.
-
SPARK-13117: WebUI should use the local ip not 0.0.0.0.
-
SPARK-13278: Launcher fails to start with JDK 9 EA.
-
SPARK-13308: ManagedBuffers passed to OneToOneStreamManager need to be freed in non error cases.
-
SPARK-13360: pyspark related enviroment variable is not propagated to driver in yarn-cluster mode.
-
SPARK-13468: Fix a corner case where the page UI should show DAG but it doesn't show.
-
SPARK-13478: Use real user when fetching delegation tokens.
-
SPARK-13885: Fix attempt id regression for Spark running on Yarn.
-
SPARK-13902: Make DAGScheduler not to create duplicate stage.
-
SPARK-14062: Fix log4j and upload metrics.properties automatically with distributed cache.
-
SPARK-14091: Consider improving performance of SparkContext.getCallSite()..
-
SPARK-15067: YARN executors are launched with fixed perm gen size.
-
SPARK-1537: Add integration with Yarn's Application Timeline Server.
-
SPARK-15705: Change the default value of spark.sql.hive.convertMetastoreOrc to false.
-
SPARK-15844: HistoryServer doesn't come up if spark.authenticate = true.
-
SPARK-15990: Add rolling log aggregation support for Spark on yarn.
-
SPARK-16110: Can't set Python via spark-submit for YARN cluster mode when PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON are set.
-
SPARK-19033: HistoryServer still uses old ACLs even if ACLs are updated.
-
SPARK-19306: Fix inconsistent state in DiskBlockObjectWriter when exception occurred.
-
SPARK-19970: Table owner should be USER instead of PRINCIPAL in kerberized clusters.