Sunday, February 26, 2012

High Availability, Business Continuity Planning, and Disaster Recovery

High Availability, Business Continuity Planning, and Disaster Recovery Planning have become hot keywords in the Information Technology industry as information systems have impacted nearly all aspects of modern business. High availability is fairly isolated to Information Technology, but Business Continuity Planning and Disaster Recovery Planning transcend far beyond just the Information Technology function in modern businesses, governments, and non-profit organizations. This post will give a high level overview of these approaches and subsequent posts in the series will give more expansive coverage of the approaches, examples, ideas, and scenarios that are not strictly limited to the IT function.

Business Continuity Planning (BCP)

Many professionals, managers, and executives don't understand Business Continuity Planning  and often mistake the term for disaster recovery. Business Continuity Planning deals with events that adversely affect an organization, but not to the extent where a disaster is declared and the organizational focus shifts to rebuilding. These events can be specific to the IT function, such as the outage of a corporate phone system or e-mail system, or they can deal with other parts of the organization, such as the sudden loss of a top performing manager or exceptionally skilled employee. The business response to these events may or may not involve multiple business units and the recovery effort is not typically an organization wide initiative. The focus is typically on working around the issue and providing the same services/functionality with different resources (ex. relying more on web, e-mail, and social networking communication methods during a PBX outage).

The main focuses from an IT standpoint for BCP are the continuation of critical functions, providing interim solutions for a wide range of small-scale failure scenarios, and providing a recovery plan for critical IT services in the event of a failure. Since the expected IT impact is limited, focus is placed on identifying service dependencies and working to make systems resilient to failures using traditional fault tolerance methods or implementing highly available systems.

Disaster Recovery Planning (DRP)

Disaster Recovery addresses events that are far more serious in scope than those addressed by BCP. These events are typically serious enough to require an organization wide response to restore normal operation. These events typically affect the organization in a way that it cannot continue normal operation without taking corrective action. These can be related to the IT function, such as a major datacenter fire, or not related to the IT function, such as a flu pandemic that affects more than 50% of the organization's workforce. Other events like hurricanes, earthquakes, and floods can also trigger an organization to implement its disaster recovery plan.

From an IT standpoint, a DR event is any event where a system cannot be returned to a normal operational state without rebuilding the system and recovering data from backup. DR events typically lead to a loss of data and a total loss of an IT service (vs. a degradation during a BCP event). Let's take a moment to consider the difference in the case of a datacenter fire where a traditional water fire suppression system is used instead of a dry release fire suppression system. In this scenario, a fire triggers the water release that shorts out all of the electronic circuitry and renders most of the data unrecoverable from the equipment in the datacenter. In the PBX failure scenario mentioned above, alternative communication methods exist, but in this case it is likely that the PBX, Internet, and all other forms of internal communication have been disabled. The organization has to switch to using cell phones or physical runners to communicate and the focus changes to restoring the entire datacenter's services or failing over to a secondary site.

Since DR events deal with loss of data, the focus of the organization from an IT perspective is on limiting the loss of information and minimizing the time required to restore critical services. At a minimum, creation and verification of off-site backups (to tape or to disk) using an enterprise backup software solution is required. Other activities, such as establishing a DR site and implementing highly available services are more or less optional depending on the organization's dependence on IT services.   

High Availability

High Availability is the gold standard for proactively managing the impact of a routine operational outage, major system failure, or total destruction of an IT infrastructure and its related services. High Availability is a different concept from Business Continuity Planning and Disaster Recovery Planning. High Availability is an IT strategy that helps to simplify, reduce, and sometimes eliminate the business impacts from a BCP or DR event as far as IT services are concerned. When systems are highly available, service interruptions happen rarely and services are managed more from a performance management or capacity management viewpoint than an availability management standpoint.

HA systems are not always easy or cheap to design, implement, and maintain. Since the systems typically have a higher level of complexity, they require better hardware, software, facilities, and IT staff to maintain. They provide a higher benefit because organizational stakeholders receive a higher level of availability and the IT department gains flexibility in replacing failed components and performing routine maintenance activities such as patching and upgrading.

The Difference Between Fault Tolerance and Highly Available  

Fault Tolerance and High Availability are often confused because they both indicate a state of service resiliency. High Availability is a superset of fault tolerance, meaning that high availability implies fault tolerance, but fault tolerance does not necessarily provide high availability. High availability requires that a service be resilient to most failures (at the component, system, and site levels) and be load balanced in a way that the services are the same regardless of the user's location.  Fault tolerance often indicates a state where a single component (disk, server, storage controller, NIC, etc.) can fail without impacting the service or the organization's stakeholders.

Some Lessons For Organizations

Implementing High Availability IT services is no longer optional.

The major stakeholders: customers, the organization's staff, and the individuals managing IT services are all demanding higher service uptime and better quality systems. Customers want to  access services 24/7/365 and will typically pick a competitor that allows them to do this. Customers aren't forgiving to outages and issues involving a loss of data or a loss of security. Don't believe me? Consider your response to receiving a letter from your bank that they lost control of your private information (SSN, account number(s), etc) or had an outage that lost your paycheck deposit. For me, this typically leads to a loss of business for the organization.

IT staff don't want to be woken up  before/after hours and the best potential employees typically survey the current state of an organization's IT systems and the on-call structure before agreeing to a new position. High level individuals typically accept or reject an organization based on whether they think the IT function is given the appropriate resources. As a result, if your organization isn't trying to get it right, then it will lose its best IT staff and will never be able to recruit high quality IT staff. There are a few stories of this that I will detail from my personal experience in future posts.

Organizational staff are seeking a borderless office and the ability to do their work from any system at any location at any time of the day. This is especially true for global organizations where  meetings can happen from multiple points around the globe at any time of the day. The typical 9-5 workday and "business hours" are 20th century concepts. In the 21st century, successful organizations are much more flexible and work towards a mutually beneficial relationship with employees.

The most effective organizations will be able to leverage highly available services both on premises and in the cloud to improve service uptime, improve the organizational climate for IT staff, and better serve the organization's stakeholders.

See Also:

Thursday, February 16, 2012

Addressing Low AdSense Revenue

Google Adsense is a tricky tool to try to make money with because there are potentially many ways to optimize content and it can be extremely difficult to attract both the right audience and the highest cost per click (CPC) ads. Often it feels like trying to make money with a blog is a waste of time, but there are individuals who are able to make many thousands of dollars per month in passive income from blogging. The question is often "how do they do it?" The answer is very easy to explain, but extremely difficult to implement because there are many factors relating to the popularity of your site and/or your blog that are outside of your control. In a lot of ways it is like running a presidential campaign without having control over any advertising channels.

Search Engine Optimization (SEO)

There are many books and websites devoted to Search Engine Optimization (SEO) and a number of them I have read myself. They all essentially say the same thing in slightly different ways. A few of the key principles that are communicated are:
  1. Optimize your keywords to your content and your content to your keywords
  2. Generate incoming links to your content
  3. Generate more traffic to your site
  4. Create good content
These are written in order of decreasing importance. Don't believe me?... Dig around for about 5 minutes with different keywords and you will find many sites with useless content that manage to beat legitimate useful sites on search engines (one example is "windows drivers," you get a large number of useless driver "update utilities" and very few PC manufacturer's sites in the top results) .

Search engines and crawlers don't have a good heuristic approach to determining if a site is actually good content, instead they pick up on measurable metrics involving keywords, incoming links, and content freshness. Not to be unduly hard on search engines, most people can't really come up with a definition for 'good' content (the definition often involves synonyms of 'good' and 'content' leading to a definition of a term that is in it's own definition...). Unfortunately, since people can't do it well, we can't really expect a machine (or large network of machines) to do any better. This gets even more complex with some of the different black hat SEO strategies that are employed, such as providing different content to users and crawlers.

The real key to search engine optimization and generating traffic goes back to some of the fundamental ideas in marketing. Since I did an undergraduate degree with three minors (focusing on economics, applied math, and Russian studies) and an MBA in business and accounting, I had to take a couple of classes in marketing (though they weren't my best grades). At the time I didn't think they were particularly useful and I had a rather strong distaste for marketers and advertising, but when I look back the concepts are fundamental to building a successful ad supported blog or ad supported website. Website optimization falls into the 4 Ps (I touch on them now, but I'll explore them more in a later post):
  1. Product - This is the content that you are providing.
  2. Price - This is the price that advertisers are willing to pay to target the keywords in your site.
  3. Place - Your website.
  4. Promotion - Your effectiveness with generating incoming traffic through incoming links and good search engine optimization.
Let's take an example of an unsuccessful AdSense post below that I created on adding Hyperterminal from a Windows XP/Server 2003 system to a more modern version of Windows (Vista/Server 2008 R2). It is a fairly successful post that brings in over 100 hits per day, but initially even though the click through rate (CTR) was high, the revenue generated was unacceptably low (averaging 1-2 cents per click, averaging around 14$ per year). Google Analytics reported how the page was doing on a typical day:



So I took a little bit of my accounting knowledge and decided to follow the money. Advertisers pay Google a certain amount based on ad clicks and costs per 1000 impressions (CPM). Advertisers bid for specific keywords and the highest bidders get the best placement and the largest number of impressions while lower bidders get worse placement and fewer numbers of impressions. Advertising through publishers and through its search engine and other online portfolio (Google Voice, Google Docs, GMail, etc) is the primary way that Google makes its money.  Advertisers pay a higher cost for more targeted and more competitive keywords. Google in turn pays publishers (bloggers, website owners, etc) a certain percentage of the per-click amount for clicks and the per-CPM amount for hosting the ads. The end amount that publishers receive for clicks and impressions is directly proportional to the amount that advertisers pay. Currently publishers have no visibility into what content will generate the highest revenue, but it is possible to take a reasonably good guess with the AdWords keyword tool. This tool provides Adwords advertisers approximate per-click costs for specific keywords and closely related keywords (known in the Google world as keyword ideas).  

Going back to my poorly performing HyperTerminal post, I decided to look at the highest CPC keywords around the HyperTerminal keyword and retrieved the list below.



I rewrote the post with some of the original content and incorporated content around the top 7-10 CPC keywords reported by the AdWords keyword tool. Within a couple of weeks I noticed a difference that each click was generating a little less than 50 cents per click (around 25 times more than each click before the revision). In this case, the CTR went down, but I'd estimate that this post will make a little over $200 per year for the time that people are still interested in HyperTerminal. This is a  roughly 1400% increase from the previous projected amount. Looking back at Google Analytics, the increase is reflected in the Adsense Pages report.



There are a lot of factors that affect AdSense revenue for a publisher and many of them I am still exploring myself. Improving the keyword targeting of your site/blog/article can have a profound effect on the CPC for AdSense ads served on your site and can increase your revenue significantly.


Wednesday, February 8, 2012

Deploying a Kerberos KDC in Ubuntu 11.10 or Fedora 15

What is Kerberos?


Kerberos is an authentication and authorization protocol that allows authenticated (and sometimes encrypted) communication between two systems. Authentication is performed by a third system that is known as a key distribution center (KDC) that stores passwords for all of the principals (typically users and systems) known within it's realm. A unique feature of Kerberos is that the password is never sent over any type of connection (encrypted or plain text) during authentication or authorization. Instead the KDC generates a token (called a ticket granting ticket) that is encrypted using the password as the key and sends it to the requesting client. If the client has the correct password and decrypts the ticket granting ticket, the client can now use this token to request access to different services (anything from logon capability, file transfer, or shell access to authenticating web applications, services, and clients across domains).

Kerberos is a protocol that can be implemented by anyone, but the two main implementations known in the IT industry are MIT Kerberos V (currently in a 1.10 release) and Microsoft Active Directory (stable with Windows Server 2008 R2, unstable in the Windows 8 Server Developer Preview). MIT Kerberos is often used in IT environments dominated by UNIX and Linux systems while Active Directory is predominantly used in Microsoft environments. Mechanisms exist to allow MIT Kerberos and Active Directory to communicate successfully in certain mixed environments. A key thing to note is that in addition to a Kerberos infrastructure, Active Directory also provides an integrated LDAP V3 compliant directory. MIT Kerberos V would need to be integrated with another LDAP implementation (such as OpenLDAP) to provide similar functionality. Kerberos is simply a piece that fits into a larger identity management (IdM) or authentication, authorization, and accounting (AAA) strategy.

To demonstrate the functionality of the protocol and the mechanics of setting up a realm that can be used as an authentication source I will show how to set up a KDC in a RedHat based distribution (RH) and a Debian based distribution (D). Although I am demonstrating with Fedora Core 15 (RH) and Ubuntu 11.10 (D), a similar application of these steps could easily be carried out in RedHat Enterprise Linux (RH), CentOS (RH), Mandriva (RH), Yellow Dog Linux (RH), Knoppix (D), etc.

Security in a Kerberos Environment


Security is a paramount concern in a Kerberos environment for deploying a KDC. All of the passwords for every user and system in the realm are stored on the KDC and an intruder can reasonably decrypt passwords and gain dozens to thousands points of potential entry to the network. Security for a KDC is a large topic and I will cover some of the practices that I have seen/considered in a separate post.

For simplicity, I demonstrate a setup that might be reasonable for a development/testing non-production Kerberos environment (this involves a less strict system setup with regard to allowing a build environment on a KDC in addition to allowing remote logon capability utilizing SSH). For a production environment, the potential attack footprint should be minimized to a level that is considered suitable based on the available legal/regulatory/business requirements and policies.

Options For Setting Up The Servers


It is possible to obtain pre-built packages for Kerberos 5, but depending on the needs of the organization or security requirements, it is probably more desirable to build from source.

For the die hard package users, in Fedora the command to install a kdc is:

yum install krb5-server

For Ubuntu users,

apt-get install krb5-kdc

Since I want to have direct control over patching and the options that are compiled in, I will compile krb5-1.10 from source.

Prerequisite Setup


The initial installation process for Fedora 15 and Ubuntu Server is well documented all over the Internet (so I'm not going to cover it here). Since I am using Hyper-V on Windows Server 2008 R2 as a virtualization host, I had to make a couple of minor changes to get networking to function properly. This involved using Legacy Network Adapters for the VMs and uninstalling the irqbalance package (yum erase irqbalance in  Fedora and dpkg -r irqbalance in Ubuntu). Fedora Core ran fine after the change, but networking was still shaky with Ubuntu.

Next I ensured that the GNU C compiler (gcc) was installed and working. 

Configuring and Building


Configuring and building is relatively straightforward, I used a prefix of /usr/local/krb5 and I did not use OpenLDAP for the backend database storage. First, download the latest release:

burrm@ubuntu-kdc:~$ wget http://web.mit.edu/kerberos/dist/krb5/1.10/krb5-1.10-signed.tar
--2012-02-06 23:10:30--  http://web.mit.edu/kerberos/dist/krb5/1.10/krb5-1.10-signed.tar
Resolving web.mit.edu... 18.9.22.69
Connecting to web.mit.edu|18.9.22.69|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10649600 (10M) [application/x-tar]
Saving to: `krb5-1.10-signed.tar'

100%[======================================>] 10,649,600  21.7K/s   in 8m 27s 

2012-02-06 23:19:05 (20.5 KB/s) - `krb5-1.10-signed.tar' saved [10649600/10649600]


Then extract the  source code:

burrm@ubuntu-kdc:~$ tar -xvf krb5-1.10-signed.tar
krb5-1.10.tar.gz
krb5-1.10.tar.gz.asc
burrm@ubuntu-kdc:~$ tar -xzf krb5-1.10.tar.gz
burrm@ubuntu-kdc:~$ cd krb5-1.10/


You can use multiple build directories (if you are building on different platforms), this is what I do:

burrm@ubuntu-kdc:~/krb5-1.10$ mkdir Ubuntu-11.10-Server
burrm@ubuntu-kdc:~/krb5-1.10$ cd Ubuntu-11.10-Server/
burrm@ubuntu-kdc:~/krb5-1.10/Ubuntu-11.10-Server$ ../src/configure --help
burrm@ubuntu-kdc:~/krb5-1.10/Ubuntu-11.10-Server$ ../src/configure --prefix=/usr/local/krb5
burrm@ubuntu-kdc:~/krb5-1.10/Ubuntu-11.10-Server$ make


After everything configures and builds correctly, it needs to be installed:

burrm@ubuntu-kdc:~/krb5-1.10/Ubuntu-11.10-Server$ su -
Password:
root@ubuntu-kdc:~# cd /home/burrm/krb5-1.10/Ubuntu-11.10-Server/
root@ubuntu-kdc:/home/burrm/krb5-1.10/Ubuntu-11.10-Server# make install


Next I do a step that allows versions to be switched quickly (if needed)

root@ubuntu-kdc:/home/burrm/krb5-1.10/Ubuntu-11.10-Server# cd /usr/local/
root@ubuntu-kdc:/usr/local# mv krb5 krb5-1.10
root@ubuntu-kdc:/usr/local# ln -s krb5-1.10 /usr/local/krb5

Now, we are ready to configure the KDC with a basic configuration.

Configuring the KDC

These steps are somewhat platform independent because they involve configuring the application. The main difference between platforms/builds/packages is the default location of the kdc.conf and krb5.conf files. These can be determined on a platform by locating the man pages for the Kerberos installation and then looking at the man pages for krb5.conf and kdc.conf.  Samples may even be available depending on the build (for me this path is /usr/local/krb5/share/examples/krb5/).

First the kdc.conf and krb5.conf files need to be built. For my build, the correct paths are /usr/local/krb5/var/krb5kdc/kdc.conf and /etc/krb5.conf.

My kdc.conf:

[kdcdefaults]
kdc_ports = 750,88

[realms]
MIKESBLOG.LAN = {
        database_name = /usr/local/krb5/var/krb5kdc/principal
        admin_keytab = FILE:/usr/local/krb5/var/krb5kdc/kadm5.keytab
        acl_file = /usr/local/krb5/var/krb5kdc/kadm5.acl
        key_stash_file = /usr/local/krb5/var/krb5kdc/.k5.MIKESBLOG.LAN
        kdc_ports = 750,88
        max_life = 10h 0m 0s
        max_renewable_life = 7d 0h 0m 0s
        }
 
 
My krb5.conf

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = MIKESBLOG.LAN
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[realms]
 MIKESBLOG.LAN = {
  kdc = ubuntu-kdc.mikesblog.lan
  admin_server = ubuntu-kdc.mikesblog.lan
 }

[domain_realm]
 .mikesblog.lan = MIKESBLOG.LAN
 mikesblog.lan = MIKESBLOG.LAN
 
Then the database needs to be initialized, the ACL file needs to be built, and principals can be created. For my ACL file, I allow anyone defined in the database with a /admin instance to perform all Kerberos-related actions on the KDC. This file is moderately customizable (see MIT's installation guide site) based on permissions that need to be granted to other people/service accounts.

root@ubuntu-kdc:/usr/local# /usr/local/krb5/sbin/kdb5_util create -r MIKESBLOG.LAN -s
Loading random data
Initializing database '/usr/local/krb5/var/krb5kdc/principal' for realm 'MIKESBLOG.LAN',
master key name 'K/M@MIKESBLOG.LAN'
You will be prompted for the database Master Password.
It is important that you NOT FORGET this password.
Enter KDC database master key:
Re-enter KDC database master key to verify:

root@ubuntu-kdc:/usr/local# vi /usr/local/krb5/var/krb5kdc/kadm5.acl
root@ubuntu-kdc:/usr/local# cat /usr/local/krb5/var/krb5kdc/kadm5.acl
*/admin@MIKESBLOG.LAN x


From there, a couple of principals can be created with kadmin.local:

root@ubuntu-kdc:/usr/local# /usr/local/krb5/sbin/kadmin.local
Authenticating as principal root/admin@MIKESBLOG.LAN with password.
kadmin.local:  addprinc burrm/admin
WARNING: no policy specified for burrm/admin@MIKESBLOG.LAN; defaulting to no policy
Enter password for principal "burrm/admin@MIKESBLOG.LAN":
Re-enter password for principal "burrm/admin@MIKESBLOG.LAN":
Principal "burrm/admin@MIKESBLOG.LAN" created.
kadmin.local:  q


Finally the KDC and admin server can be started:

root@ubuntu-kdc:/usr/local# /usr/local/krb5/sbin/krb5kdc
root@ubuntu-kdc:/usr/local# /usr/local/krb5/sbin/kadmind


It may be desirable to create init scripts to start these on boot, but another option is to have an administrator manually start/stop them when needed. Additionally the firewall (typically iptables) needs to be adjusted to allow ports 750 and 88 (or others defined in the kdc.conf file).

See Also,
Kerberos Password Policies Made Easy


Sunday, February 5, 2012

Identifying System Cooling Issues

Even the best of us have computer problems from time to time. In my case, I had a problem that resulted from the combination of a lazy system setup (I put a few too few fans in my case) and a dead fan. Since I write a lot of posts on a number of different topics, I build new virtual machines on a weekly basis to demonstrate different features and application configurations in Windows and Linux. In this case, I was working on building a couple of MIT Kerberos servers to demonstrate how to easily apply password policies using MIT Kerberos 5 and how to build an older version and newer version of Kerberos 5 on a newer Fedora build (Fedora 15) and on Ubuntu 11.10.

I had a peculiar problem while I was working on this, I would start two Linux virtual machines that I built on Hyper-V, do some work, then go to bed. When I woke up in the morning, my custom built server would be shut down. The first few times I wrote it off as issues potentially caused by the weather (various wind storms and snow storms here in Boulder) or by the low quality of our power infrastructure (since we have Xcel Energy, our power is about as reliable as my Comcast Internet connection [in the industry we call this 1.5 nines of uptime]). After the first couple of times, I started to wonder because none of our other household appliances (microwave, stove, etc) were showing signs of a power failure. Since it was happening more frequently than normal, I started to wonder if I had a problem with my server... so I started troubleshooting...

Nothing really obvious jumped out from the logs, simply the kernel power message with event ID 41 (in this case no BugCheck information and no dump files, so probably not a blue screen). This really only indicates that the system turned off in an unsupported way (possibly due to a failing power supply, overheating system, or other power fluctuation/issue).

Log Name:      System
Source:        Microsoft-Windows-Kernel-Power
Date:          2/1/2012 6:44:49 AM
Event ID:      41
Task Category: (63)
Level:         Critical
Keywords:      (2)
User:          SYSTEM
Computer:      WIN-BB9Q000LTK1
Description:
The system has rebooted without cleanly shutting down first. 
This error could be caused if the system stopped responding, crashed, or 
lost power unexpectedly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-Power" 
        Guid="{331C3B3A-2005-44C2-AC5E-77220C37D6B4}" />
    <EventID>41</EventID>
    <Version>2</Version>
    <Level>1</Level>
    <Task>63</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000002</Keywords>
    <TimeCreated SystemTime="2012-02-01T13:44:49.171875000Z" />
    <EventRecordID>77389</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="8" />
    <Channel>System</Channel>
    <Computer>WIN-BB9Q000LTK1</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="BugcheckCode">0</Data>
    <Data Name="BugcheckParameter1">0x0</Data>
    <Data Name="BugcheckParameter2">0x0</Data>
    <Data Name="BugcheckParameter3">0x0</Data>
    <Data Name="BugcheckParameter4">0x0</Data>
    <Data Name="SleepInProgress">false</Data>
    <Data Name="PowerButtonTimestamp">0</Data>
  </EventData>
</Event>  

I remembered back to a time when I had a really crappy laptop built     by a company called IBuyPower (I wrote this company a BBB complaint,     FTC complaint, and nearly filed a lawsuit due to their crappy     system). This laptop would constantly overheat and shut itself off     and literally spent more time in transit and in RMA than I had it...     but that's in the past now...

I had never had a thermal issue with the custom-built server before, so I thought it was a long shot. I downloaded the Open Hardware Monitor and was shocked to see some of the numbers that came off when I was running the virtualization load. I had one of the processors hot enough to boil water (100 degrees Celsius):



I immediately killed the virtualization load and shut the server down until I could investigate the cause of the issue since I was close to damaging the system. That night I identified that one of the CPU fans had died and needed to be replaced. Since I hadn't done a good job with cooling the case before, I decided to replace all of the fans in the case (and get a few additional ones that had spaces in the case, but no fans at the time). Since I was somewhat concerned about cooling (more than noise), I went for the fans on NewEgg that had the highest air displacement (3x 80 mm, 1x 120 mm, and 2x 92 mm). I also bought a fan controller since most of the reviews for the fans that I bought stated that they were too loud for a household environment.

After waiting for the next day shipping, I tackled an adventure of splicing 3 pin fan connectors to the wires on the fans that connected to a 4 pin molex connector (since I couldn't find the right adapter on the Internet and the fan controller only had 3 pin connections). Under moderate virtualization load, I was able to reduce the maximum temperature by 40 degrees into a far more acceptable range (and no further failures yet).



The moral of the story is that thermal issues can easily sneak up on you and are often overlooked as a potential cause of issues involving unexpected shutdowns and blue screens of death. When troubleshooting these issues, be sure to investigate thermal issues before blindly replacing components. Examine the manufacturer's documentation for recommended temperature ranges.

Ensure that the changes that you make have a measurable impact on the temperature of the system (as they did in my case... no pun intended).

See Also,
Windows Crash Dump Analysis