Automated Diagnostics and Remediation

As a pilot project to speed up infrastructure troubleshooting, Diagnostics and Remediation empowers System Engineers to directly access diagnostic data and automate remediating on-prem and cloud resources. I spearheaded the design from refining requests to high-fidelity design delivery.

Role

Lead Product Designer

Heading

Lead Product Designer

Progress

Design delivered, to be shipped

Date

Apr 2022 - Jul 2022

Overview

LogicMonitor is an infrastructure observability platform for System Engineers.

System engineers use LogicMonitor to monitor IT infrastructure metrics and receive alerts when the data went abnormal. But receiving alerts is only the starting point of troubleshooting. After that most teams still lose valuable time manually gathering logs and system data to help locate the root cause, before investing more time and attempts to fix the issue. Diagnostics and Remediation is the tool to streamline the process and expedite troubleshooting.

The project starts from a customer request from the field and a product request document from Ed (PM). It highlighted three main use cases where the product should meet: administrative security control with role-based access, enabling users to deploy scripts of diagnostics and remediation, and automated diagnostics and remediation at the time of alerts. In the following five months, I worked with Ed to reframe the design requirements in line with user habits and implemented designs phase by phase. By the time it was shipped, users are able to install diagnostics and remediation scripts easily, manually run them and view results, and view a complete record of output data documented when they stepped away from the tools. It was well received by our users of system engineers, and achieved business success.

50%+

Increased account adoption in one month

Top 5

Largest customers subscribed the product

30%

Reduction of the Mean Time to Repair (MTTR)

4.8

/ 5.0

Degree of helpfulness (5 being most helpful)

Value proposition

Diagnostics and Remediation streamlines the path from problem to resolution with less manual efforts and clearer alert contexts.

Business goals & metrics

Increase ARR (Annual Recurring Revenue)

Increase account adoption

Reduce time spent on troubleshooting

Through a quick conversation with Sales Engineers, a key proxy for customer pain points, I understood why Diagnostics and Remediation provide values to users' troubleshooting. Instead of forcing users to manually SSH into devices, script solutions, or rely on tribal knowledge to find specific codes, it allows users to access infra data without inputting any codes when they need to. We transformed a fragmented, communication-heavy process into a streamlined, one-click experience.

More than that, the real power lies in automation: it captures transient data during flapping alerts that would otherwise be lost and ensures immediate remediation even when a user is offline. Remediation empowers the system to handle routine, repetitive tasks through automated workflows, while keeping the user firmly in control. By offloading these banal operations, we enable engineers to focus on high-impact strategic challenges rather than being sidelined by transient alerts.

cHALLENGES

Before diving into high-fidelity execution, I identified several strategic hurdles that required resolution to ensure a meaningful design outcome.

Unclear user problem and needs

The initial scope on PRD focused narrowly on reactive troubleshooting (triggered by alerts), but we have no idea how it would fit users' troubleshooting journey and accommodate to users' habits. Details like users' frictions and value proposition are missing too.

Ambiguity before design mockups

A significant disconnect existed between the high-level product documentation and concrete design solutions. As a product that could involve changes on multiple products, the execution team is not aligned with where the changes land.

Technical constraints and limited resources

As a pilot project, there are limited front-end resources and a condensed timeline to prove product value. It was imperative to define a lean MVP and phase design plans.

Design Process

Bridging the gap between PRD and high-fi mockups: Navigating and exploring design opportunities

To address these challenges, I initiated a contextual interview with 5 users and 2 meetings with subject matter experts to understand users' troubleshooting experience. It will help clarify how the product will help users to speed up their troubleshooting journey and whether they found the feature useful to ensure the investment is worthwhile.

Following that, I led a workshop on user story mapping with PM and Kevin (Principal Designer on Edwin, LM's AI agent product), for clarifying the end-to-end user flows and specific changes on products. To make sure we could meet the tight timeline and showcasing values for this pilot projects, I worked with Ed to formulate design phases of the whole user flow and align with engineers on a weekly cadence. It realized actual agile development and make sure the product is delivered in time.
‍
After the high-level flows are clarified and aligned with engineers, I started quick prototyping on the flows and layout, followed by high-fidelity mockup delivery and usability testing.

Design process from request to beta testing

chevron_right

AI tools speed up my design process. Click to view more about how it works.

expand_less

Collapse this section of AI assistance

When working on scalable tools which could be helpful in multiple scenarios and when users share part of the creativity of using them, I was always asked: can you give me a more specific example how it can be helpful in troubleshooting?

Coming from a designer background, I turned to my engineer teammates for these questions. But due to the time zone difference and a tight timeline, I used LLM tools instead to run a quick research on online community and compile a very detailed use case. I found it super helpful too when generating diagrams with mock data to prove that the data visualization and interaction helps users to spot the anomaly and dive into troubleshooting. In the real world, a detailed diagram with mock data is way more persuasive than a grey box reading "diagram".
‍
I prompted ChatGPT with the framework of infrastructure troubleshooting stages based on our research, and required it to refer to online community discussion on Reddit and LM community. With a couple of attempts and editing, I have a very detailed user story to share with stakeholders to show that diagnostics is helpful. It helps a lot when presenting my designs to the whole product team and I received great feedback from the team.

Salvio, a Systems Engineer responsible for keeping infrastructure up and remediating issues, rolled out Diagnostics and Remediation Services across our environment. One of the main tasks is ensuring applications perform smoothly and that cpuBusyPercent stays below 95%. To speed up troubleshooting, Salvio set up automated diagnostics so that whenever a high CPUBusyPercent alert triggers, LM automatically collects the top CPU-consuming processes. From there, if a spike is caused by a runaway application, the system can trigger an automated remediation workflow to restart that process.

Discovery

Here are the research findings and how they translate into product decisions:

Research

Decision

Users took both proactive and reactive troubleshooting paths.

We will provide access to Diagnostics and Remediation on Resources and Alerts.

Depending on the actual scenarios, users choose different diagnostics and remediation scripts.

The product need to be flexible to meet users' different use scenarios.

Users have concerns about how the tool will actually help them save the efforts. They are concerned that they may end up in the server regardless.

Users would make multiple efforts to find out root causes. Users should be able to run multiple Diagnostics and Remediation.

What brings values to users is a history record of data at the time of alert.

Being able to view the history record is also important for users to troubleshoot and should be included in the initial product release.

Sentiment score of its value proposition. Score of helpfulness: 4.5 / 5 (5 being most helpful)

It was a helpful tool for users with potential business development opportunities.

With the research insights, the design goals I need to achieve became clear too: it needs to be flexible, scalable and intuitive. To measure the success of design, I listed out the metrics from usability testing: task success rate, easiness to complete, and degree of helpfulness.

User Story Mapping

Further closing the gap between research and design solutions

I led a workshop session with Ed (PM) and Kevin (Principal desinger on Edwin AI) to align on detailed use cases and end-to-end user flow. We started with three top-level user needs and brainstormed the end-to-end user steps and tasks. It clearly maps out the products where we need to make changes. While working on this, the team are gradually clear about the interaction details we need to pay attention to at each step. The challenges became clearer and specific now.

Security control

As a LM Administrator who cares about keeping my infrastructure secure and performance, I need the ability to manage the availability and access to automated Diagnostics & Remediation at a very granular level to limit unintentional or destructive commands running on my environment. I need control over which resources are available for automated scripts.

Configure automated diagnostics and remediation

As a System Engineer, I need to be able to select out-of-the-box diagnostic and remediation scripts that can be run against my resources to pull diagnostic data, and associate those scripts to specified resources, instances, and datapoints, so when those datapoint alert thresholds are met, my scripts will automatically run.

Run and view diagnostics and remediation

As a System Engineer, I need to be able to see automated diagnostic data as part of my troubleshooting workflow. This workflow includes the LM alert details, LM Logs, and also as enriched alert details in tickets we create in third-party ITSM platforms. I will need to see a history of diagnostic data associated to alerts that have triggered in the past.

The design requests on security control and configuration are straightforward and clear after the workshop. For the following exploration and delivery part, I will mostly focus on sharing design decisions made on running and viewing diagnostics and remediation, and automated diagnostics and remediation.

Ideation

Lo-fidelity mockups map out the high-level flows and help clear technical constraints with PMs and engineers.

There are 3 major questions I aim to address with a lo-fi rapid mockups:

How might we enable users to select, run and view outputs of DiagnosticSources and RemediationSources?

At the device level, users may view a list of DiagnosticSources and RemediationSources. It will allow users to view and execute scripts with one click, and also show the latest execution results.

After aligning with the team, we are not able to predict what scripts users will run and what the results are. Instead of using table or other data visualization, I use a window to display the raw text outputs.

Users expressed security control concerns during the interview. To make sure users understand the DiagnosticSource to be executed, users will be able to view the script first.

Given that it's a complex interface on Resources and that users expect to compare history record, I used expanded table for displaying history outputs.

chevron_right

I also explored some other options. Click here if you would like to view.

expand_less

Collapse this section

The tabs for diagnostics and remediation are separate based on SME's feedback that they would like to have separate access control for diagnostics and remediation.

I also explored options where the data is better visualized so that users could track the change over history or view the anomaly immediately. But after aligning with Engineers and PM, we cannot predict the scripts user install. The output data might not be numerical.

How might we enable users to run and view DiagnosticSources outputs at the time of alerts?

For the reactive troubleshooting flow, I compared 2 design options for displaying the diagnostics outputs on Alert. We move forward with Design Option 2.

Option 1 - Less dev efforts
It will display a record of DiagnosticSource outputs on this Resource. It will look similar to the history tab on resources. It requires less efforts but will ask users to spend extra time filtering out the diagnostic outputs at the time of alert. It's difficult for users to manually run a DiagnosticSource without leaving the current interface too.
‍

✅ Option 2 - Display of outputs at the time of alerts
When users dive deep in an alert, they want to view the diagnostic outputs directly relevant to the alert context. Option 2 focuses on showing the outputs so that users will access those information immediately. On top of that, they may manually run diagnostics with a click to see the current results.

How might we help users understand whether the RemediationSources work or not?

Instead of copying the display of DiagnosticSource outputs, I'd argue that outputs of RemediationSources are less important than that of Diagnostics. Instead users care about whether the remediation fixes the issue. While talking to engineers, there are no convenient ways of validating whether it works or not. Instead, I propose that we could visualize diagnostic actions on the alert chart. By observing the attempts, users will understand whether the actions work or not.

The ideation was developed along with the design and development phases of this project. After aligning with stakeholders, I delivered the high-fidelity designs.

Delivery

Manually run a DiagnosticSource

For proactive troubleshooting flow where users would routinely run diagnostics to do health check, users will run diagnostics at the device level on Resources. As what matters most to users is the diagnostics outputs, I used the major viewport to show the diagnostic outputs. In addition, users could check the script by clicking Script preview before running the diagnostics.

Landing page of Synthetics step- a stacked area chart with a detailed table

The design files also include the different status - error, success, in progress, and empty.

❌ I tried to include a "brake" function in the design. But after aligning with engineers, it's not easy to implement or feasible.

✅ As an alternative, I add "friction" to users' action. After they click running, a pop-up window will appear to ask users to confirm their action.

View previous diagnostics outputs

Users are able to view the history of a single DiagnosticSource. History is helpful for users to understand the baseline and any anomaly. Given it's placed in a complex product structure, I used collapsed rows to help users navigate the record while also enable users to view the outputs. Meanwhile, after users locate the record that they are interested in, they could use Fullscreen to view the output in details.

User could click on a test executed to view detailed charts

Automated Diagnostics at the time of alerts

Users could view automated diagnostic outputs after they configured their Diagnostic and Remediation rules. Similar to Resources, Alerts has a complex information architecture. To give users what they want to see immediately, i.e. the diagnostic outputs at the time of alerts, the design is changed to tiles of widgets displaying the information that matters most to users, instead of a list of DiagnosticSources that require select, history, and view outputs among other steps for users to find the data.

The interface when users zoom into a certain period of time

Validate the effects of Remediation

While the outputs of diagnostics matter most to users, the use case for remediation is different - what users care most is whether the remediation actions are effective on their alerts. To visualize the effects of Remediation, I added Remediation timestamps on Alert charts.

Reflections

Strong add-on to LogicMonitor's AI products

The design could be bolder. We delivered MVP designs of this feature at that time, but we should prioritize its integration with LM's AI agent - Edwin. The data could not only support Edwin for finding out more about the root cause, but also with Edwin, it could adopt truly AI-powered automation. Users' security concern and the benefits of AI automation could be balanced by presenting users the options and collecting users' consent before moving forward.
Meanwhile, we could make the diagnostics outputs more accessible to users. Restricted by the timeline and development resources, we are not allowed to organize the diagnostics outputs in a text outputs. But with the development of LLMs, we could enable users to access a quick highlights of the outputs more easily.

Improvements on the access to this feature

We receive a lot of constructive feedback from Beta testing: Dashboard widget for Remediation outputs and results, tracking Diagnostics & Remediations at both Resource & Group level, implementing Visual indicators for Diagnostics and remediation in their alert system, and tracking diagnostic activities and building timelines for incident reviews.
When working on the MVP features, I focused much on enabling users to run diagnosis and remediation. Looking at the whole product ecosystem, actually we could fit D&R better in the whole product and the whole troubleshooting journey, like integrating it to Dashboard, and providing access at the group level so users could find the resources with Diagnostics and Remediation features better.

Keep it agile and involve cross-team communication early

Success on this project was driven by high-frequency communication and ruthless prioritization. By integrating the AI and engineering teams into the early design phase, we synchronized our workflows and avoided the common pitfalls of siloed development. This collaborative approach transformed a complex set of requirements into a manageable, phased release that met our deadline without sacrificing design integrity.