Troubleshooting with Azure Application Insights Using KQL

Users at least get access to Azure Application Insights even within minimum access level if you are requesting access to DXP management portals at https://world.optimizely.com/service-request-forms/access-to-dxp-portal/. Azure Application Insights is a powerful application performance monitoring (APM) service that helps you track the health, performance, and usage of your applications in real time. It offers end‑to‑end observability through telemetry such as requests, dependencies, exceptions, traces, and custom events, enabling developers to quickly detect anomalies, diagnose issues, and understand user behaviour. With built‑in dashboards, analytics, and seamless integration with Azure Monitor, it provides deep insights that improve reliability, optimize performance, and enhance the overall user experience. There will be moments when provided dash board won't be just enough, Developer will need to do queries to telemetries, Kusto Query Language (KQL) is the engine that unlocks that data. KQL gives you the flexibility to slice and analyze telemetry at scale.

KQL Basics

Application Insights stores telemetry in tables such as:

requests -Incoming HTTP requests - name, url, duration, resultCode, success
dependencies - External calls (DB, APIs) - type, target, name, duration, success
exceptions - Application exceptions - type, message, outerMessage, problemId
traces - Custom log messages - message, severityLevel
customMetrics - Custom performance - metrics name, value
pageViews - Client-side page views - name, url, duration
traces
availabilityResults
customEvents

Common Operators

where - Filter rows
project - Select specific columns
summarize - Aggregations (count, avg, percentiles)
extend - Create calculated fields
join - Combine tables
order by - Sort results
take - Limit rows

Limitations

Query Timeout: 3 minutes maximum execution time, Large datasets may require sampling or time filtering.
Result Set Size: Maximum 500,000 rows returned
Data Retention: 90 days for detailed data (configurable up to 730 days), Long-term retention requires exporting to Log Analytics or Storage.
Concurrent Queries: Throttling applied under heavy loads
Sampling: Application Insights may apply adaptive sampling. This can affect counts unless you use itemCount.
No Data Modification: KQL is read-only. You cannot update or delete telemetry.
Join Limitations: Joining large tables can be slow or truncated.
Query Complexity: Some advanced operations (e.g., recursive logic) are not supported.

Best Practices for Monitoring & Troubleshooting

Use Time Filters Early
requests
| where timestamp > ago(1h)
This improves performance dramatically.

Project Only What You Need
| project name, duration
Reduces memory and speeds up queries.

Use Summaries for High-Volume Data
Instead of returning thousands of rows:
| summarize count() by name

Use itemCount When Sampling Is Enabled
| summarize total = sum(itemCount)

Correlate Telemetry Using Operation Id
requests
| join traces on operation_Id

KQL References for future refernce

// Basic query pattern
TableName
| where Timestamp > ago(1d)  // Filtering
| where Operation_Name == "HomeController.Index"
| project Column1, Column2    // Selecting columns
| summarize Count = count() by Column1  // Aggregation
| order by Count desc        // Sorting
| limit 10                   // Limiting results

Filtering with where:
requests
| where timestamp > ago(24h)
| where success == false
| where duration > 1000  // Duration in milliseconds

Selecting columns with project:
requests
| project timestamp, name, url, duration, resultCode
| take 100

Aggregation with summarize:
requests
| summarize 
    AvgDuration = avg(duration),
    RequestCount = count(),
    FailedCount = countif(success == false)
    by name, bin(timestamp, 1h)

Top 5 Slowest Requests:
requests
| where timestamp > ago(7d)
| top 5 by duration desc
| project timestamp, name, duration, url, resultCode

Failure Rate Trend:
requests
| where timestamp > ago(30d)
| summarize 
    TotalRequests = count(),
    FailedRequests = countif(success == false)
    by bin(timestamp, 1d)
| extend FailureRate = FailedRequests * 100.0 / TotalRequests
| order by timestamp asc
| render timechart

Exception Analysis:
exceptions
| where timestamp > ago(1d)
| summarize Count = count() by type, innermostMessage
| order by Count desc
| take 20

Alert on Patterns, Not Single Events
Use:
avg(duration)
percentile(duration, 95)
count()

Explore Requests
requests
| summarize count(), avg(duration) by name

Find Failing Dependencies
dependencies
| where success == false
| summarize count() by target, type

Trace-Level Debugging
traces
| where severityLevel >= 3
| order by timestamp desc

Exception Analysis
exceptions
| summarize count() by type, innermostMessage

User Behavior
customEvents
| summarize count() by name

End-to-End Transaction Diagnostics
requests
| where operation_Id == "<operation-id>"

Using parse to Extract Data
traces
| parse message with "User:" userId ", Action:" action

Using bin() for Time Bucketing
requests
| summarize count() by bin(timestamp, 5m)

Using join for Correlation
requests
| join kind=leftouter dependencies on operation_Id

Using make-series for Time-Series Analysis
requests
| make-series count() on timestamp in range(ago(1d), now(), 1h)

Dec 21, 2025

Comments

Please login to comment.