IT Focus Area: infrastructure operations
February 23, 2015
Improving Availability, Data Protection and System Performance
Editor's Note: Sirius and Forsythe are now one company. Sirius acquired Forsythe in October 2017 and we are pleased to share their exceptional thought leadership with you.
Downtime of critical business services, associated data loss and the need to maintain required performance levels are top concerns for all enterprise information technology (IT) teams. As a result, companies are shifting toward an operational resiliency model to maximize performance and achieve 24/7 availability of business operations and IT environments.
The cost and risks associated with downtime and lost data are significant enough for enterprises to justify making significant investments in high availability and the new disaster recovery to help ensure that business-critical applications remain available.
Despite the plethora of high availability and disaster recovery technologies that can be found in the typical enterprise data center—clustering, load-balancing, replication, as well as new technologies such as grid computing, parallel clusters, and virtualization-based high availability—downtime, data loss and degraded performance are still quite common.
Three key challenges affect service availability, data integrity, and system performance:
The need for cross-domain collaboration and cross-vendor integration
The proliferation of management tools
Changes to an IT environment occur frequently as part of normal operations.
These changes include:
Operating systems, patches, and software installs or updates
Storage allocations changes
Kernel, system and networking parameters adjustments
Hardware configurations (server, network, storage area network) updates
And many more
The changes can introduce discrepancies which are extremely difficult to notice, especially when multiple teams—such as storage, server, and database administrator (DBA)—must all take part.
Consider, for example, a cluster standby missing storage area network (SAN) paths to a shared storage volume, or one missing a correct startup parameter. This may not be detected unless the failover process is actively tested. However, failover testing does not happen very frequently meaning hidden vulnerabilities may linger for weeks or months. These undetected hidden risks can lead to resiliency or recovery failures.
Need for Cross-Domain Collaboration and Cross-Vendor Integration
A high-availability environment typically spans a range of components such as networks, servers and storage, and responsibility for configuring and managing these components. These items also typically correspond to separate organizational teams.
Often, more than one subject matter expert is required to correctly configure the relevant layers. Miscommunications may result in hidden discrepancies. Another important aspect adding to the complexity is the need to use hardware and software from multiple vendors such as storage, server, operating system, cluster software, and multi-pathing. Vendors usually publish specific guidelines and best practices describing configuration of components and minimum required settings. In general, it is a good idea to follow these vendor-specified best practices for deploying their products. Failure to do so can result in sub-optimal configurations and an increase in risk to continuity, data, and performance.
Proliferation of Management Tools
Given the diversity of vendors, there is no standard tool kit for managing high availability configurations in a consistent manner to help avoid configuration drift. Instead, IT administrators must use multiple point solution tools such as storage resource management tools, cluster management consoles, network management tools, server provisioning tools, and other newer virtualization consoles to manage their environments.
Six Steps to Better Availability, Protection and Performance
Adopting six best practices spanning monitoring and collaboration can help put companies on the path to higher service availability, better protection of data, and improved system performance. Using an automated, environment-spanning tool to monitor the environment is a key enabler of these best practices. It can help make the difference between an environment that may not withstand component failures and natural disasters, and one that is well-managed and likely to support operational resiliency, recovery, and optimized performance.
The six best practices below work together to help you achieve higher service availability, protect data, and deliver high performance.
Step 1: Detect
The first step in achieving higher availability is to detect the risks that can have adverse impacts. While disaster recovery testing is one component of such detection, disaster recovery tests are infrequent, effort-intensive and limited in their capacity to bring risks to light. Beyond point solutions focused on monitoring specific layers in the infrastructure, organizations should use an automated, non-intrusive tool that provides cross-domain visibility to risks across the infrastructure.
Step 2: Anticipate
Responding to failures when various risks come to pass is necessary, but not optimal. The team must act urgently to address the situation, but availability, data, performance or even business reputation may have already been damaged. Understanding how different risks can cause harm and the relative significance of the harm caused can guide teams to take a more proactive approach, identifying risks early before the damage is done. Regular, automated scanning of the IT infrastructure that draws on this understanding can help to identify and prioritize the risks before they manifest actual damage.
Step 3: Alert
IT teams should adopt tools that monitor and identify risks in the infrastructure and prioritize based on likely impact. The tools should alert relevant stakeholders so that they can take proactive actions to mitigate risks. This alert system can provide drill downs into symptoms, explaining root causes, forecasting potential business impacts, and suggesting solutions. Such alerting capability should be integrated with the organization’s existing IT management trouble-ticketing system to facilitate seamless and standardized workflow.
Step 4: Collaborate
Given the high degree of interconnectedness and interdependencies in today’s IT environment, it is imperative that IT teams adopt mechanisms to support strong cross-domain and cross-team collaboration. Prime among these is establishing a cross-group team to oversee IT risks. The team should be supported by a tool that takes a holistic, infrastructure-wide perspective in its identification and assessment of risks.
Step 5: Validate
Since the consequences of some IT risks can have a severe negative impact on the business, it is important that in addition to being identified, prioritized and communicated, the risks are actually being addressed. This calls for a closed-loop system so that resolution responsibility and closure can be tracked and managed.
Step 6: Measure
Identifying and measuring key performance indicators (KPIs) relative to IT risk management allows the team to focus on areas, vendors, and systems that require more attention. They can also show what systems are most frequently under threat so that appropriate resources can be devoted to addressing them. Similarly, tracking KPIs can help identify which best practices are least complied with and provide guidance for remedial training.
A Shift Toward Operational Resiliency to Ensure Business-Critical Apps Are Always Up
The business continuity and disaster recovery practice is undergoing a paradigm shift. It is moving to focus more on operational resiliency, with enterprises looking to achieve optimum levels of performance and architecting for continuous availability of business operations and IT environments. Enterprises concerned with business continuity make significant investments in high availability and disaster recovery to help ensure that business-critical applications remain available. Organizations should consider automated, non-intrusive tools as part of their arsenal for improving availability, data protection and system performance.