Determining the Cause of Slow and Failed Queries

Identifying the cause of slow query run times and queries that fail to complete.

Describes how to determine the cause of slow and failed queries.

Steps with examples from a Spark engine are included that explain how to further investigate and troubleshoot the cause of a slow and failed query.

  1. In a supported browser, log in to the web UI by doing the following:
    1. In the web browser URL field, enter the URL that you were given by your system administrator and press Enter.
    2. When the Log in page opens, enter your user name and password access credentials.
    3. Click Log in.
  2. In the Clusters page do one of the following:
    • In the Search field, enter the name of the cluster whose workloads you want to analyze.
    • From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
  3. From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
  4. From the Trend widget, select the tab of an engine whose jobs you wish to analyze and then click its Total Jobs value.
    The engine's Jobs page opens.
  5. From the Health Check list in the Jobs page, select Task Wait Time, which filters and displays a list of jobs with longer than average wait times before the process was executed.


  6. Display more details by selecting a job's name from the Job column and then clicking the Health Checks tab.
    The Baseline Health checks are displayed.
  7. From the Health Checks panel on the left, click the Task Wait Time health check, which opens a panel that describes the health check, displays information about the possible causes, and lists recommended solutions.
    In the following example, the long wait time occurred in Stage-5 of the job process due to insufficient resources. The Recommendation section lists items for you to complete that may resolve the problem and the specific outlier tasks that produced the unusual results:


  8. To display more details about why this job is experiencing longer than average wait times, click one of the tasks listed under Outlier Tasks.
    In the following example, the Task Metrics section shows higher than average criteria measurement results and the Task Details reveal an ExecutorLostFailure error. This indicates a probable memory issue, where the container is exceeding the memory limits. In this case, more details maybe found by clicking Full error log and reviewing the log: