Determining the Cause of Slow and Failed Queries
Identifying the cause of slow query run times and queries that fail to complete.
Steps with examples from a Spark engine are included that explain how to further investigate and troubleshoot the cause of a slow and failed query.
In a supported browser, log in to the web UI by doing
- In the web browser URL field, enter the URL that you were given by your system administrator and press Enter.
- When the Log in page opens, enter your user name and password access credentials.
- Click Log in.
In the Clusters page do one of the following:
- In the Search field, enter the name of the cluster whose workloads you want to analyze.
- From the Cluster Name column, locate and click on the name of the cluster whose workloads you want to analyze.
- From the time-range list in the Cluster Summary page, select a time period that meets your requirements.
From the Trend widget, select the tab of an engine whose
jobs you wish to analyze and then click its Total Jobs
The engine's Jobs page opens.
From the Health Check list in the Jobs page, select
Task Wait Time, which filters and displays a list of
jobs with longer than average wait times before the process was executed.
Display more details by selecting a job's name from the
Job column and then clicking the Health
The Baseline Health checks are displayed.
From the Health Checks panel on the left, click the
Task Wait Time health check, which opens a panel that
describes the health check, displays information about the possible causes, and
lists recommended solutions.
In the following example, the long wait time occurred in Stage-5 of the job process due to insufficient resources. The Recommendation section lists items for you to complete that may resolve the problem and the specific outlier tasks that produced the unusual results:
To display more details about why this job is experiencing longer than average
wait times, click one of the tasks listed under Outlier
In the following example, the Task Metrics section shows higher than average criteria measurement results and the Task Details reveal an ExecutorLostFailure error. This indicates a probable memory issue, where the container is exceeding the memory limits. In this case, more details maybe found by clicking Full error log and reviewing the log: