Troubleshooting with Azure Application Insights Using KQL
Users at least get access to Azure Application Insights even within minimum access level if you are requesting access to DXP management portals at https://world.optimizely.com/service-request-forms/access-to-dxp-portal/. Azure Application Insights is a powerful application performance monitoring (APM) service that helps you track the health, performance, and usage of your applications in real time. It offers end‑to‑end observability through telemetry such as requests, dependencies, exceptions, traces, and custom events, enabling developers to quickly detect anomalies, diagnose issues, and understand user behaviour. With built‑in dashboards, analytics, and seamless integration with Azure Monitor, it provides deep insights that improve reliability, optimize performance, and enhance the overall user experience. There will be moments when provided dash board won't be just enough, Developer will need to do queries to telemetries, Kusto Query Language (KQL) is the engine that unlocks that data. KQL gives you the flexibility to slice and analyze telemetry at scale.
KQL Basics
Application Insights stores telemetry in tables such as:
- requests -Incoming HTTP requests - name, url, duration, resultCode, success
- dependencies - External calls (DB, APIs) - type, target, name, duration, success
- exceptions - Application exceptions - type, message, outerMessage, problemId
- traces - Custom log messages - message, severityLevel
- customMetrics - Custom performance - metrics name, value
- pageViews - Client-side page views - name, url, duration
- traces
- availabilityResults
- customEvents
Common Operators
- where - Filter rows
- project - Select specific columns
- summarize - Aggregations (count, avg, percentiles)
- extend - Create calculated fields
- join - Combine tables
- order by - Sort results
- take - Limit rows
Limitations
- Query Timeout: 3 minutes maximum execution time, Large datasets may require sampling or time filtering.
- Result Set Size: Maximum 500,000 rows returned
- Data Retention: 90 days for detailed data (configurable up to 730 days), Long-term retention requires exporting to Log Analytics or Storage.
- Concurrent Queries: Throttling applied under heavy loads
- Sampling: Application Insights may apply adaptive sampling. This can affect counts unless you use itemCount.
- No Data Modification: KQL is read-only. You cannot update or delete telemetry.
- Join Limitations: Joining large tables can be slow or truncated.
- Query Complexity: Some advanced operations (e.g., recursive logic) are not supported.
Best Practices for Monitoring & Troubleshooting
Use Time Filters Early
requests
| where timestamp > ago(1h)
This improves performance dramatically.
Project Only What You Need
| project name, duration
Reduces memory and speeds up queries.
Use Summaries for High-Volume Data
Instead of returning thousands of rows:
| summarize count() by name
Use itemCount When Sampling Is Enabled
| summarize total = sum(itemCount)
Correlate Telemetry Using Operation Id
requests
| join traces on operation_Id
KQL References for future refernce
// Basic query pattern
TableName
| where Timestamp > ago(1d) // Filtering
| where Operation_Name == "HomeController.Index"
| project Column1, Column2 // Selecting columns
| summarize Count = count() by Column1 // Aggregation
| order by Count desc // Sorting
| limit 10 // Limiting results
Filtering with where:
requests
| where timestamp > ago(24h)
| where success == false
| where duration > 1000 // Duration in milliseconds
Selecting columns with project:
requests
| project timestamp, name, url, duration, resultCode
| take 100
Aggregation with summarize:
requests
| summarize
AvgDuration = avg(duration),
RequestCount = count(),
FailedCount = countif(success == false)
by name, bin(timestamp, 1h)
Top 5 Slowest Requests:
requests
| where timestamp > ago(7d)
| top 5 by duration desc
| project timestamp, name, duration, url, resultCode
Failure Rate Trend:
requests
| where timestamp > ago(30d)
| summarize
TotalRequests = count(),
FailedRequests = countif(success == false)
by bin(timestamp, 1d)
| extend FailureRate = FailedRequests * 100.0 / TotalRequests
| order by timestamp asc
| render timechart
Exception Analysis:
exceptions
| where timestamp > ago(1d)
| summarize Count = count() by type, innermostMessage
| order by Count desc
| take 20
Alert on Patterns, Not Single Events
Use:
avg(duration)
percentile(duration, 95)
count()
Explore Requests
requests
| summarize count(), avg(duration) by name
Find Failing Dependencies
dependencies
| where success == false
| summarize count() by target, type
Trace-Level Debugging
traces
| where severityLevel >= 3
| order by timestamp desc
Exception Analysis
exceptions
| summarize count() by type, innermostMessage
User Behavior
customEvents
| summarize count() by name
End-to-End Transaction Diagnostics
requests
| where operation_Id == "<operation-id>"
Using parse to Extract Data
traces
| parse message with "User:" userId ", Action:" action
Using bin() for Time Bucketing
requests
| summarize count() by bin(timestamp, 5m)
Using join for Correlation
requests
| join kind=leftouter dependencies on operation_Id
Using make-series for Time-Series Analysis
requests
| make-series count() on timestamp in range(ago(1d), now(), 1h)
Comments