Spark integration best practices
It is best to avoid multiple Kudu clients per cluster.
A common Kudu-Spark coding error is instantiating extra KuduClient objects. In
kudu-spark, a KuduClient is owned by the KuduContext. Spark
application code should not create another KuduClient connecting to the same
cluster. Instead, application code should use the KuduContext to access a
KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in
the logs of the master being overloaded by many GetTableLocations or
GetTabletLocations requests coming from different clients, usually around the
same time. This symptom is especially likely in Spark Streaming code, where creating a
KuduClient per task will result in periodic waves of master requests from new
clients.
