Spark integration best practices
Is is best to avoid multiple Kudu clients per cluster.
A common Kudu-Spark coding error is instantiating extra KuduClient
objects. In
kudu-spark, a KuduClient
is owned by the KuduContext
. Spark
application code should not create another KuduClient
connecting to the same
cluster. Instead, application code should use the KuduContext
to access a
KuduClient
using KuduContext#syncClient
.
To diagnose multiple KuduClient
instances in a Spark job, look for signs in
the logs of the master being overloaded by many GetTableLocations
or
GetTabletLocations
requests coming from different clients, usually around the
same time. This symptom is especially likely in Spark Streaming code, where creating a
KuduClient
per task will result in periodic waves of master requests from new
clients.