Service level agreements, commonly known by the acronym SLA, are used to define the minimum quantity of deliverable key metrics. For example, an SLA might specify a minimum network availability of 99% uptime per month. This gives the provider less than eight hours per month in which the network can be unavailable without applying penalties.
This type of agreement provides the basis for 3rd party vendor relationships, especially in the field of telecommunications where ISPs and Carriers are almost always in a better position to provide Wide Area Network carriage services.
It is common for SLAs to define availability, response and restoration thresholds. Better developed SLAs also define technology specific metrics such as latency which is a metric that is commonly measured in networks carrying Voice over IP.
With all the aforementioned metrics, the recording and graphing mechanisms play a key role in the overall result regarding SLA compliance. Let me give you an example. Assume that an SLA stated 99% uptime per month and the monitoring station was configured to send one probe every 30 seconds with 3 consecutive lost probes required to record an outage. This means that any carrier outage less than 1 minute 30 seconds will not be recorded. This is just one of many examples that demonstrates how the configuration of the monitoring station itself plays a key role in network statistics.
Key strategies for avoiding this type of problem and producing an accurate reporting process can be implemented. The strategies include:
1. Validate data across a number of systems. For example, if an outage is recorded by monitoring stations one would expect to see corresponding trouble tickets from the help desk. Similarly, help desk tickets pertaining to lost connectivity without a corresponding recorded outage should be investigated.
2. Use event driven logging to complement scheduled probing. Event driven logs can be populated by technologies such as SNMP Traps. SNMP Traps can be generated and sent instantly when a pre-determined event such as a down link takes place.
3. Analyse historical data. Historical data tends to form usage patterns and these patterns can be used to identify change. A higher quality analysis can be produced as a result of more accurate historical data. Many historical graphs tend to roll-over long term data in order to control the size of the database. But very aggressive summarizing of data can result in lost peaks and troughs and therefore lost information.
4. Utilize alternate technologies. Quality of Service (QOS) is one example of a technology that serves one purpose but can be used as a reference for another purpose. QOS statistics are a fantastic source of information because they are accurate to the packet level and can reveal much more detail than graphs.
In summary, the effectiveness of an SLA agreement relies primarily on two factors. The quality or relevancy of the measured metrics and the quality of the reporting process. So I pose the question to you, would you know if your network was experiencing short intermittent faults?
We have more information about monitoring network performance and monitoring server performance here.
IT-Pathways.com is committed to bringing you the highest quality Information Technology discussions with articles from current industry professionals.