Troubleshooting Web Application Performance
Every app is different and every server is different. It’s only through extensive understanding of the application that decision can be made to improve its performance. There are no specific set of rules we can define that will ensure app will run with great performance on a given server. Only thing we can do is to come up with a framework/checklist that we might want to review.
So let’s start by defining key performance pillars:
End user – defines performance of the site as viewed by the end user. For example, do customers notice that how long it takes to render, what’s the speed, is it fast enough etc.
Throughput – measuring per second values (Request, bytes and connections) which apply at several levels like, Farm, Server, Website, Application Pool, and even URL users are browsing.
Capacity – How much we can support in terms of Users, Connections, Throughput, and Content etc.
Scaling – a way to fix performance problems. After we have listed out key performance pillars, we need to measure them.
Measuring End user Use (or may be test) the site as end users would use. What is their connection speed, check out for client or proxy caching happening, what browsers (with versions) are used to browse the site etc. One thing we need to keep in mind that application is developed and tested in high speed LAN and conditions will not be same when deployed on internet where still most of the users are running on modem and low speed connections. So customer experience and your experience will be different.
But the challenge is we really can’t get all our customers so best we can do is to bucket-ize the customers and put them into groups like how much % of them are modem users, how many are on high speed connections, how many have direct T1 lines etc. Now the question is, how do I find those buckets? What are the best tools to capture and report such data? One of the tools is Log Parser.
It’s an extensive tool that will help us to analyze IIS performance by parsing IIS logs. We will also look out for different type of browsers being used and how long these request takes. This tool can be executed through command line and it expects a query (almost similar to SQL syntax) for performing heuristics on a given log.
For example, we will write a script that will return type of browsers used to browse the website.
%logparserinstalldir%> logparser.exe GetBrowsers.sql GetBrowser.sql
SELECT TOP 10 to_int(null(100.0,propcount(*))) as Percent Count(*) as TotalHits cs(user-Agent) as Browse FROM %logfile% GROUP BY Browser ORDER BY TotalHits desc
Percent | Total Hits | Browser |
15 | 771 | MSIE 6.0 |
50 | 565 | MSIE 7.0 |
10 | 109 | Some other browser … |
Why is that important to know about browser? It is because each browser has a different caching technique, rendering mechanism etc. Accordingly we may want to change output HTML along with header for better performance.
We can start identifying important page(s) that users could be hitting and what is the average, maximum, minimum and hit count for such page(s).
SELECT to_lowercase(cs-url-stem) as URL AVG(time-taken) as AvgTime MIN (time-taken) as MinTime MAX(time-taken) as MaxTime count(*) as HitCount FROM %logfile% WHERE URL = ‘/default.asp’ GROUP BY URL
Script below will list out clients connected (requesting something) and we will try to identify slow connection(s).
SELECT c-ip as Client Avg(sc-bytes) Avg(time-taken) to_int ( mul ( div (to_real (sc_bytes), case avg(time-taken) when 0 then 1 else avg(time-taken)), 1000 ) ) as BytesPerSecond Count(*) as Hits FROM %logfile% WHERE sc-bytes > 1 and time-taken > 1000 [ where condition to ensure that we are taking connections that did not get dropped. 1000 is measured in milliseconds which means 1 second] GROUP BY Client, cs(user-Agent) HAVING hits > 1
By running above query we can find out slowest client . It could be possible that it is one of the most important clients. This does not mean that server is performing badly nor has a fast connection to internet rather it seems that client is on slow connection may be modem. It gives us few points to discuss. For example, can we do something to reduce the payload (IIS compression is one way to achieve it) so that app works for slow clients as well?
Measuring Throughput Performance monitor is the key way to learn what the throughput is. However it does only at server and site level. But we can use log parser for other levels for example, for URL, we can use log parser to know URL request per second/hour/minute and no of bytes transferred to different clients as we see in above examples.
ETW (Event tracing for windows) is yet another excellent tool to understand performance, throughput of the server and other issues through extensive logging mechanism. It traces each and every call within the server till the time request reaches to server (HTTP.SYS in case of win 2k3) and served back to the client, what all operations are involved for that individual request. I would like to take an example, where ETW tracing proved to be useful in diagnosing performance issues. After publishing a website to another server, very first request to any aspx/asp page was taking huge amount of time. Running an ETW trace on that server resulted in a log file upon investigating which it was found that first request was taking long time because of ISAPI filter installed on top of IIS was taking long time to load hence blocking all other operations.
Now we can understand & define performance and know some ways to measure it we will move on to how we can improve performance of a given web application. Again, there is no defined way to improve the performance. It includes making guesses, see if they work, if they do, celebrate and make another guess. If they don’t, undo it and make another guess.
Improving end user performance In the internet scenario we can define key issues: –
Download time
According to a survey, still more than 50% of internet users still have narrow band connections let’s say, modem. So if testing and verification is done in a high speed LAN environment, we cannot understand and foresee customer problems that might be running on narrow band connection.
So specific items we can look at to address such key issues are: –
- Download size is performance key driver on low band connections. How do we fix download size? One of the best ways is to enable IIS compression.
- Try to split up the helper content (style sheets, client side scripts). For example, if you don’t need a specific JS function/Style sheet for a page currently being requested, do not download it!! Download it only when you need it.
- Do not make copies of things (script code, style sheets etc.) that are being downloaded and have duplication in website. Let’s understand it with an example,
Bad CSS (Replication of data)
.article-ByLine { font-family: Tahoma,sans-serif; font-size: 9.5pt; font-style: normal; line-height: normal; font-weight: normal; color: #000000} .article-Caption {font-family: Tahoma,sans-serif; font-size: 8pt; font-style: normal; line-height: normal; font-weight: normal; color: #000000} .article-Headline { font-family: Tahoma,sans-serif; font-size: 14pt; font-style: normal; line-height: normal; font-weight: bold; color: #000000}
In above example, highlighted text is same for each definition except font-size. It results, in increased payload. Same CSS can be represented as: –
BODY {font-family: Tahoma,sans-serif; font-style: normal; line-height: normal; font-weight: normal; color: #000000} .article-ByLine {font-size: 9.5pt;} .article-Caption { font-size: 8pt} .article-Headline {font-size: 14pt;font-weight:bold}
- Set HTTP expire header for all images and for HTML so proxy servers & browsers make fewer calls to web serve. For more information, visit Content Expiration.
- Use SSL certificates only when needed and only for content that requires security. Because SSL uses complex encryption that consumes considerable processor resources, it takes much longer to retrieve and send data from SSL-enabled directories. Also, keep these pages free of other elements that consume resources, such as images.
- Another thing to verify is “connection=keep-alive” state for each TCP connection. If it is turned off every file requires a new TCP connection which is not good for a slow connection.
- Set expiration dates on files – When customer returns to a web page, IE already has most of the files for the page in its cache, but it does not use these files if the expiration dates are in the past. Instead, IE sends a GET request to the server for the file, indicating the date of the file in the cache. If the file has not changed, the server sends a Not Modified message. This GET/Not-Modified sequence costs the client a roundtrip.
- Identify slow loading files which will provide clues what needs to be improved. Causes of very long load times can include server capacity issues, network congestion. This data can be collected by running log parser on IIS logs or by using ETW tracing mechanism.
- Files often contain TABS, spaces, newlines, and comments contributing some % of page size. Try removing those.
Hardware Resources
If CPU is the issue, think about caching so that we don’t compute so often. Is HTTP compression causing this to happen etc.?
If memory is an issue, check if we are caching too much, how many copies of same data we are caching etc. So there is a tradeoff, which is hitting you much, CPU or Memory and take the judgment accordingly. You can monitor memory by making use of existing performance monitor counters. Here are few of them: –
- Monitoring Available System Memory – Memory\Available Bytes,
- Monitoring Paging – Memory\Page Faults/sec, Memory\Pages Input/sec, Memory\Page Reads/sec, Memory\Transition Faults/sec (If these numbers are low, server is responding to requests quickly. If these numbers are high, investigate whether we have dedicated too much memory to the caches, leaving too little memory for the rest of the system. If reducing cache sizes does not improve system performance, we might need to increase the amount of RAM on server)
- To know more about IIS performance counters, visit http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/0a6c9f07-a70c-4c3d-b93d-5dfef593c744.mspx?mfr=true.
If disk is slow determine who is causing file access. Best tool available for this purpose is Filemon. For example, it could be possible that web site logging is turned on causing several file I/O’s hence degrading the performance.
Till now we have discussed key performance pillars, how we can measure them and ways to fix them.
Performance issues can be difficult to troubleshoot and often take a long time to determine the root cause & resolve. This process can be less painful if we collect good data and then use that information further to solve the issue. So how can we collect good data? By asking as many questions we can. For example,
- When the problem did started happening? Was there any change or update on the server prior to the problem?
- What are the different symptoms? Is there one or more error messages shown?
- Is there a High CPU on the server at the time of the problem?
- Is the worker process consuming high memory (Approx 500-600MB+) at the time of the problem?
- What are the technologies (including third-party) involved?
- Are there any COM/COM+ components being used? If so, are they STA or MTA?
- Does the problem happen at a specific interval?
- Do ASPX/ASP/ HTML pages in the same/different Applications on the same server work fine?
- Is the issue specific to any pages in the application or only to some pages?
- Does a simple hello world aspx page work fine?
- What is the architecture of the application?
- Is Data Access involved? If so, does the issue happen with any page connecting to a particular database?
- If Data Access is involved, does any page that does not do data access work fine?
- What is the application workflow with respect to the current problem?
- What are the steps to repro the problem?
- Is problem reproducible in test environment?
Once we have gathered the above data I am pretty sure we will have a good idea on where to focus. Based on symptoms and data we need to decide if the problem is on client side, server side, database/other tier or combination of one or more of these. There are several tools available that can now assist us to find root cause.
- Network monitoring tools (netmon) can be used to examine network related problems, page rendering delays etc.
- File monitoring (filemon) and Registry monitoring (regmon) tool can help us in identifying file system and registry based access issues.
- ETW (Event tracing for windows) which can be used to trace internal IIS/kernel (HTTP.SYS in case of win2k3) events and pinpoint web bottlenecks on the server and often determine where to tune server for improved performance.
- WinDbg/CorDbg – Used for advanced debugging by analyzing memory dumps on production server without even going through source code.
I hope this article gets you going to start troubleshooting and analyzing performance issues for a given website hosted on IIS.
Until next time,
–Parag