The Three-Month Fix: How AbbVie Kept Their VDI Users Up and Running

Published

November 9, 2022

“However much your VDI environment is worth per day, multiply it by 90. That’s the ROI we were able to point to right away.”

The complexity of today’s workplace technology means that all of our environments are incredibly unique. Two organizations may use the same platforms and applications, but the tactics we use to implement these tools are all unique to our own goals and business needs. But all of us who work in IT and engineering can agree: our companies’ success hinges on our ability to keep our environments running smoothly.

I’m a senior engineer at the pharmaceutical company AbbVie. In my primary role as an infrastructure desktop engineer, I’m the de facto application owner and my core responsibility is the Nexthink platform. Every solution that comes in and out of our environment, I’m the funnel point.

Specifically, we have a very expansive VDI environment at AbbVie. Many of our offshore workers, particularly our offshore IT support staff, rely on VDI to do their jobs every single day. We also have what we call “production VDIs”: VDIs that pull and send data to various business organizations; these VDIs are incredibly important to our production and overall business processes.

All in all, we can have as many as 8 to 10,000 VDIs running at any one time. As you can imagine, such an expansive environment requires us to pay very close attention to how our VDIs are performing on a day-to-day basis. Minor delays could result in major productivity losses – and a big incident, if we didn’t intervene immediately, could be a true disaster.

The Incident

One day, we received a call around 10 a.m. – the call that everyone who works in IT dreads, notifying us of a major slowdown in our environment.

That morning, we learned that nearly all of our VDI users were experiencing extreme delays. They couldn’t access their platforms, applications were timing out – VDI was, for these users, completely nonfunctional.

We spoke with our backend support staff, who informed us that every one of our rails was pegged at 100% CPU usage. The problem was: they couldn’t figure out what had caused usage rates to skyrocket.

Considering the number of VDI users we provide service to, this widescale slowdown posed major consequences if we didn’t fix it quickly. We had to identify the change in our environment that sparked this incident – and fast.

The Investigation

In this case, there had been a change that took place prior to our VDI users experiencing issues.

An Office update had been pushed out to all of our VDIs, and part of this update involves a service called SDX Helper. A breakdown in this update process caused our rails to become saturated, pegging all of our VDIs at nearly 100% — while the faulty process continued, never completing, rendering our entire environment relatively inoperable.

Fortunately, the Nexthink platform gave us the visibility we needed to scope this problem swiftly. We were able to identify which devices were impacted by this failed update, write a remote action to stop the service, and deploy that remote action to all affected users.

Additionally, we ran an investigation in which we proactively checked the CPU usage of every single PC, every 10 minutes. If a PC was running above 99% CPU with that service for more than 10 minutes, we intervened.

The entire process took about 45 minutes to develop and push out to our VDI environment. Almost immediately, the rails were no longer saturated and our entire environment was restored.

The Aftermath

While we may have restored our environment through our targeted remote actions, that doesn’t mean the problem was entirely solved. We still had deployed a service that caused a major issue across our environment. If we didn’t take further action, we’d only be putting a temporary band-aid on the problem rather than ensuring it never impacts our users again.

We submitted a ticket with Microsoft, sending them logs that illustrated what went wrong so that they could perform analysis. But anyone who has experienced similar incidents knows that these support tickets with major providers don’t get resolved overnight.

So we had a decision to make. We could remove Office or roll the Office version back to the previous version, hoping that this would efficiently solve the problem.

We ultimately took a different approach, one that protected our environment without rolling back the changes we wanted to implement. We leveraged Nexthink to serve as a watchdog while we waited for a permanent fix from Microsoft.

We kept the platform running, so that it successfully shut down the process if it threatened to repeat the same problem for any VDI users. After all, it’s not the end of the world if an Office client doesn’t get upgraded – and this process allowed us to keep our users up-and-running while we awaited feedback from the provider.

Microsoft eventually came back to us, informing us that they had a fix for the issue which would be deployed in the next patch. All in all, roughly three months had passed between the time we had the issue and the day the permanent fix was successfully deployed. Having access to proactive IT technology enabled us to keep our VDI environment functioning efficiently during these three months.

When your digital workplace is heavily reliant on VDI, as ours is, this solution was invaluable. Remediating a problem the day it occurs is one thing, but in an environment like yours, it’s even more important to have that proactive watchdog, ensuring the problem doesn’t sneak back in and put our users and support staff through the same nightmare.