The job of a VMware TAM is, among many things, to help guide our customers through various challenges such as monitoring and managing their environments. To some, monitoring may sound like a mundane and uncomplicated slice of your IT infrastructure, but it is not the same Boolean up or down gadget that we got by with 10 to 15 years ago. That method worked good enough when data centers were stocked with dozens of servers, not the thousands we see following the virtualization boom.
Let’s face it, managing and monitoring today’s data centers is a challenge that’s not for the faint of heart. Hardware is no longer dedicated to a single application, or thanks to Hyper Convergence, a single purpose. The data center of the last decade could quite feasibly be consolidated down to a single ESXi host. That is why it is more important than ever to continuously and meticulously comb over your environment. If you lose 20 critical virtual machines because you did not know the datastore they were living on was filling up, how much would that cost your company? Moreover, would it send you on the hunt for a new job?
vRealize Operations Manager is a great tool to turn to when you when you are looking for trouble… In your environment that is! I will take you through few of my favorite dashboards that I commonly use with our customers to help improve efficiency and reduce risk in their environments.
One of the first dashboards you will see when you log into vROps is the recommendations dashboard. Think of this dashboard as your daily task list. We have divided this into three sections, health, risk, and efficiency.
The health alerts show you issues that need to be addressed immediately. For example, if you have any guests that are nearly out of disk space, if you have physical networks that are down, or anything else that is going to cause you to have a bad day.
The risk column points out potential issues that could degrade your environment. Risk could be guest contention, vDS configuration issues, or hardening guide violations (if you have these enabled in your monitoring policy).
Efficiency will point out “optimization opportunities” such as oversized VM’s, large snapshots, and idle VM’s.
If you only look at one dashboard in vROps (which I do not recommend), then this is the one. There’s a reason it is the first to appear in vROps after all. If you commit yourself to fixing just one or two alerts a day or a week, depending on the scope of the alert, you will be well on your way to having a robust and optimized environment. After a few weeks, challenge yourself to fix even more! That way you will ace your next TAM Best Practice review!
Virtualization helps us squeeze every ounce of horsepower out of our hardware investments. However, there’s a fine line between maximizing your hardware investments and packing them tighter than a Beijing subway. One of my favorite dashboards that can quickly point out areas of opportunity is the Workload Utilization dashboard.
The workload utilization dashboard is divided up into three sections. Underutilized, Optimal, and Overutilized. Your hosts and clusters will appear on the map based on their workload score which is a single number that represents the percentage of that object’s most consumed resource. In this case, I have hovered over the mgmt-mgmt cluster which has a workload score of 74% based on its memory usage.
Clicking the details hyperlink for this object will take you to its analysis dashboards.
These are some of the most commonly used dashboards when you have questions about your environment.
The workload dashboard breaks down resources and helps us to understand how they are consumed. This view is a lot more granular than the performance overview in vCenter and can help you quickly identify problems. For example, there are two metrics to pay attention to when looking for contention. Demand and usage. Demand is how much of a resource your virtual machines are asking for, and usage is the amount of that resource that the host is providing. If your VM’s are demanding more resources than your host can provide, then you have contention.
I also like this view because it shows us how the resource is being consumed by breaking the bar graph up into the individual VM’s consuming it. This makes it easier to identify the heavy hitters.
Now, let’s focus our attention on capacity. Let’s say your manager comes to you and says he needs 12 VM’s deployed in our mgmt-mgmt cluster and hands you the specs. There are two types of administrators in this scenario. Those who will deploy the VM’s regardless of whether or not there is capacity, and those who carefully consider the requirements and available capacity. Which one are you?
If you are the type who just wings it and you have vROps, then I am sorry to say you are out of excuses. The capacity remaining dashboard takes the guesswork out of deciding whether or not you have the capacity to deploy more VM’s. The capacity remaining badge represents the remaining capacity of your most consumed resource. In this case, it would not be a good idea to deploy any more VM’s in this cluster because its memory has been consumed.
The capacity remaining badge is great, but vROps makes this even easier by specifically calling out how many VM’s you can deploy (or in this case, how many VM’s you are over by). vROps includes several VM configuration profiles out of the box, and creating your own is just a matter of clicking the plus sign.
Creating your on VM configuration profile is really handy if you have a standard configuration for, say, your SQL VM’s. You can simply create a new profile based on your standard configuration, and vROps will tell you how many of those VM’s you can deploy. If you have a specific request for multiple VM’s with unique CPU, memory, and storage requirements, then you could leverage vROps projects to perform what-if scenarios.
Knowing how much capacity you have left is incredibly useful, but how do you know when it is time to expand and purchase more hardware? The time remaining dashboard will look at your consumption trend over last 30 days (configurable by policy) and calculate how much time you have before you run out of resources.
What’s great is that it includes a customizable provisioning time buffer. Let’s say it takes you 60 days from the time you request new hardware, get a quote, submit the RFP, issue a PO, receive the hardware, and install it. You would set your provisioning time buffer to 60 days (or more), and when your remaining time dips below this threshold, your badge score will drop to 0, indicating it is time to expand. You can also set up alerts based on this.
Throwing money at a problem might be fun, but it is not always the most responsible approach. The last stop on our tour de dashboards is reclaimable capacity.
This dashboard gives us a great high-level overview of what resources can be reclaimed. That is to say, what resources have been entitled to our VM’s that they do not need. This dashboard, like the others, includes a badge score. This score is based on the resource with the highest percentage of reclaimable capacity. In this case, we can reclaim almost 92 GB of memory or 41.7%. The badge score is rounded up to 42%. This is pretty exciting, considering the capacity remaining dashboard tells us we are out of memory! When it is time to take action, click on the “virtual machine reclaimable capacity” hyperlink highlighted above. This link will take you to an actionable list of what resources can be reclaimed for each of your virtual machines.
You just saved your company a boat load of money! Go ahead and take the rest of the day off. You have earned it! Tell your boss that some guy on the internet said it’s okay!
This handful of dashboards will be sure to keep you busy, but they will also lead you down the path of having a super-efficient, well-managed vSphere environment. If you are new to vROps and don’t want to get your hands dirty in your production environment, then check out of VMware’s Hands-on Labs today. http://labs.hol.vmware.com/HOL/catalogs/catalog/