Managed Network Health

Contents
casp1032
 
 
Network health affects business services and also can affect 
DX NetOps Spectrum
, which monitors network health for business services. With an increased focus on the health of business services, we sometimes overlook chronic network health issues that do not appear to impact the business. To monitor business services and resolve faults down to the network layer, any infrastructure management strategy must monitor and react-to constant change. Chronic network health issues can increase the amount of change and the resulting work for 
DX NetOps Spectrum
 by an order of magnitude.   
Consider the case where a single link on a core router is going down and coming back up several times each minute, a condition known as "flapping". In this case, 
DX NetOps Spectrum
 receives a link-down and link-up trap and will also poll the interface for status each time the link flaps. In addition, this flapping link can result in a temporary loss of contact with many of the devices in the network behind it. The result can be more polling and fault-isolation overhead as 
DX NetOps Spectrum
 works to determine the state of the devices in this network. However, this single case is not an issue, as far as 
DX NetOps Spectrum
 capacity is concerned.
But imagine a second scenario in which several core routers and switches each have multiple "flapping" interfaces. The effect of thousands of link-down/link-up traps, subsequent polling, and fault-isolation overhead in this example could result in a high and continuous 
DX NetOps Spectrum
 workload. The increased workload can include tens of thousands of alarms being generated and cleared continually.
Generally and in the earlier two cases, 
DX NetOps Spectrum
 provides the data in terms of events and alarms to locate and resolve network health issues. 
DX NetOps Spectrum
 operators must pay attention to network health issues and must take steps to resolve them, or tune 
DX NetOps Spectrum
 to mitigate their impact. In the earlier examples, resolving the flapping interface problem is the solution. When the connectivity between 
DX NetOps Spectrum
 and the managed devices are generally unreliable, or if devices are slow to respond, verify your polling timeout thresholds and retry thresholds. Failure to do so can result in large numbers of "false" alerts due to failed polls, which increase fault isolation overhead.
Finally, connectivity among 
DX NetOps Spectrum
 components (
SpectroSERVER
s and OneClick servers) is an important consideration. Everything from basic server-to-server communications to cross-server searches relies on network connectivity. Therefore, reliable communications among servers are critical for 
DX NetOps Spectrum
 performance.
 
Spectrum Report Manager
 (SRM)
Most of the advice that we have provided thus far has focused on the major real-time aspects of a 
DX NetOps Spectrum
 deployment. Many customers have come to rely also on 
Spectrum Report Manager
 for historical data collection, analysis, and reporting. Report Manager includes a separate database that archives data from all connected 
SpectroSERVER
s. Therefore, pay particular attention to the disk capacity and disk I/O performance of the system. Tuning can be required, depending on the amount of data being stored, the opportunities for filtering unnecessary data, and the size of the report.
A best-practice recommendation is to determine the total database size that is required to store event history, and then allocate twice the space on that disk partition to accommodate transient space requirements. The topic titled Spectrum Report Manager Sizing Guidance provides advice and formulas to help you calculate disk space requirements.
The following considerations are also important for 
Spectrum Report Manager
 performance and capacity:
  • Consider the volume of the data and system resources for the Report Manager performance. Running reports from a smaller volume minimizes report generation failure especially for event and alarm reports. Smaller volume of data decreases the response time of the database query.
  • When the result set is large, or when a large amount of data is sorted or grouped, the database writes the results to disk. This activity affects the Report Manager performance.
  • If your environment generates a high volume of events without generating event reports, consider purging the event table periodically. Purging this table saves space on the reporting DB system.
  • If you generate event reports on a specific set of events, consider purging the event types that you do not require. Or, if selected event types are not required to produce alarm, asset, availability, or other reports, consider filtering these events before they reach the reporting database. For more information, see Install Report Manager.
  • If your environment generates a large volume of events on models that you do not include in reports, consider filtering these events from the reporting database. Install Report Manager contains more information.
  • Some of the filtering mechanisms on the reports themselves can cause performance issues. We are still researching this possibility, but we have seen anecdotal evidence that alarm and event filters, when used on reports, degrade the performance. Where possible, try to limit their use.
  • CA Support maintains some best-practice recommendations for the CA Business Intelligence (CABI) component that 
    Spectrum Report Manager
     uses for reporting capabilities. These considerations also apply to any CA product that uses CABI. Contact CA Support for more information.
 
Spectrum Report Manager
 Sizing Guidance
The following formula can help you estimate the amount of disk space that is likely to be required to support the Reporting database for a user-specified amount of time.
The total number of required disk spaces in GB equals:
((# of devices) * (avg # of events per device per day) * (# of days of storage desired) * (avg size of event in KB)) / 1048576
  •  
    # of devices 
    Environment-specific value. Consider future growth when specifying this value.
  •  
    avg # of events per device per day 
    Represents the total number of events that are (1) generated daily and (2) are associated with the creation of a single device model. This total includes all events that result from the related application, port, and interface models. The easiest way to approximate this number is to look at the total number of events that were generated on one 
    SpectroSERVER
     in a day. Divide that total by the number of devices that are modeled on that 
    SpectroSERVER
    .
  •  
    # of days of storage required
    Environment-specific value.
  •  
    avg size of events, in KB
    An estimation of the amount of disk space a single event consumes in the Reporting database. This value is measured in KB.
  •  
    1048576
    The product of the earlier equation is divided by this number to get a measurement in GB.
You probably have an idea of the number of devices and the number of days of storage that you want. Only two variables are then required in the calculation:
  • Average number of events, per device, per day
    Environment-specific value. You can query the DDMDB to see the average number of events that are generated on a given day.
    If you are a new 
    DX NetOps Spectrum
     user, or if you are unsure how to determine the average number of events, use a reasonable default value. Consider that 300 events per day, per device, for 500 devices equate to 150,000 events per day. A default value of 300 is a good starting-point.
    To get an idea of the average daily number of events that are generated per device, find out how many events are generated daily. The following query returns the total event count for the last ten days:
    SELECT date(from_unixtime(utime)) as x, count(*) as cnt FROM event GROUP BY x ORDER BY x DESC LIMIT 10;
    The following query returns the days and event volume for the busiest ten days:
    SELECT date(from_unixtime(utime)) as x, count(*) as cnt FROM event GROUP BY x ORDER BY cnt DESC LIMIT 10;
    Use the results of these queries to devise a reasonable event count. Once you know the event count, divide that number by the total number of devices that are modeled on that server. The result is the average event count per device, per day.
  • Average size of events in the Reporting database (KB)
    We recommend 1 KB as an appropriate amount of space to store your average event and the corresponding records. This number can obviously rise if most events are large - containing large amounts of data. The types of events also affect data size. Alarm events turn into multiple Reporting table records. NCM events only affect a single table (event). But for purposes of generalizing the behavior, 1 KB seems to be an appropriate measure.
 
Sizing Guidance Examples
 
Here are a couple of examples that illustrate useful calculations of the required storage capacity:
Example A
. Your environment contains 600 devices, and you want to retain data for 4 years (1460 days).
You do not know how many events are generated per device, so we default to 300.
The total data in GB that must be stored equals:
(600 * 300 * 1460 * 1) / 1,048,576 = 262,800,000 / 1,048,576 = 250 GB
  • Example B
    . You have 1900 devices across three servers, and you want to retain data for 2 years (730 days). Your deployment seems to be averaging 400 events per device, per day.
    In this example, we ignore the fact that you have three servers.
    The total data in GB that must be stored equals:
    (1900 * 400 * 730 * 1) / 1,048,576 = 554,800,000 / 1,048,576 = 530 GB