As a pilot project on extending infrastructure troubleshooting capabilities, Diagnostics and Remediation aims to allow System Engineers to directly access data needed and automate remediation on their on-prem or cloud infrastructure devices. I spearheaded the design from scratch to high-fidelity mockup delivery. The product reduced users' mean time to repair by over 30% and increased user adoption by over 50% per month.
Lead Product Designer
Lead Product Designer
Design delivered, to be shipped
Apr 2022 - Jul 2022
As an observability platform, LogicMonitor (LM) continues to collect the health metrics of monitored infrastructure, on-prem or cloud resources in a regular cadence, including CPU usage, memory, storage, network traffic etc. These data provide insights to our main user persona, System Engineers, to understand whether the devices are functioning normally and receiving alerts when monitored datapoints passed configured thresholds.
However, receiving alerts is just a starting point, instead of where System Engineers called jobs done. With user interviews and inputs from our Sales team, we learned that System Engineers will dive deeper into the infrastructure metrics and found out the root causes. Instant metrics and data are vital for users to find out the root cause, but they are too expensive and noisy to be included in the metrics monitored on a regular cadence. In addition, how to tap into more data and run solutions for issues frequently encountered by engineers is shared as tribe knowledge among our users. From there the product idea of Diagnostics and Remediation was generated based on a challenge: How might we streamline the process of finding out more data and running solutions at the time of certain infrastructure issues?
Diagnostics and Remediation brings values to our Systems Engineers, both power users and novice users.
For novice users:
It skips codes.
Users don't need to ask colleagues about detailed codes or knowledge to address common issues. It reduce users' efforts from a time-consuming process of analysis and communication to a quick-and-easy way by clicking a button to run the remediation.
For power users:
Users are not able to view the data at the time of alerts especially when the alerts were cleared later when the value of triggering metrics drops to normal. It deprived them the opportunity to observe infrastructure health and find out root causes.
Remediation could save users from repetitive trivial remediation efforts for use cases where they know certain solutions would work. By using Remediation, LM system will take care of the banal tasks with automated solutions under users' control. By doing so, users could focus on the "real" challenges and avoid waking up by alerts that automation could address.
Boost revenue with this new feature being part of LM Advanced and Signature features
Improve users' adoption of this feature
Reduce the time spent on troubleshooting
Intuitive, scalable, and flexible
Usability testing: Success rate, easiness to complete, and helpfulness
Before diving into high-fidelity execution, I identified several strategic hurdles that required resolution to ensure a meaningful design outcome:
Defining the Troubleshooting Lifecycle:
The initial scope focused narrowly on reactive troubleshooting (triggered by alerts). However, stakeholder feedback revealed a critical need for proactive monitoring. I realized the solution needed to span multiple LogicMonitor products and accommodate distinct permission sets for administrative and read-only roles. To find out how the product would fit into users' troubleshooting journey and accommodate to users' habits, we conducted a user research with contextual interview.
Validating Value Propositions:
While the project originated from sales feedback, the core user and business value remained unverified. To move forward with confidence, I advocated for deeper business intelligence to transform a "feature request" into a validated product strategy.
Bridging the Ambiguity Gap: A significant disconnect existed between the high-level product documentation and concrete use cases. To uncover design opportunities that truly resonate with users, I needed to define the end-to-end user flow and address potential user friction points that hadn't yet been explored.
Balancing Ambition with Technical Constraints: With limited front-end resources and a condensed timeline, it was imperative to define a lean MVP. My focus shifted to identifying the highest-impact features that would provide a viable feedback loop for our beta customers.
To resolve these unknowns, I led a two-week discovery sprint. This included conducting contextual interviews to ground our decisions in user reality, followed by a competitive analysis and a user story mapping workshop to align the cross-functional team on a unified vision.
Collaborating with Carmelo Ayala (Principal UX Researcher), I conducted contextual interview along with a concept testing with 5 customers, to understand users' troubleshooting journey and explore whether users found this feature helpful.
Key insights:
Users took proactive and reactive troubleshooting paths.
Users want to avoid alert fatigue and focus on the actual issue.
What brings values to users is a history record of data at the time of alert.
Users have concerns about how the tool will actually help them save the efforts. They are concerned that they may end up in the server regardless.
Score of helpfulness: 4.5 out of 5 (5 being most helpful)
The insights are transferred to major product decisions:
We will provide access to Diagnostics and Remediation on Resources and Alerts.
Users would make multiple efforts to find out root causes. Users should be able to run multiple Diagnostics and Remediation.
Being able to view the history record is also important for users to troubleshoot and should be included in the initial product release.
Collaborating with Ed (PM) and Kevin (Principal designer on LM's AI product), I led a workshop session to break down the use cases in their troubleshooting journey. I used the 3 major use cases from PM as the starting point of user activities and prepared the framework for this user story mapping working session.
As a System Engineer, I need to be able to select out-of-the-box diagnostic and remediation scripts that can be run against my resources to pull diagnostic data, and associate those scripts to specified resources, instances, and datapoints, so when those datapoint alert thresholds are met, my scripts will automatically run.
As a Monitoring Engineer / IT Operator, I need to be able to see automated diagnostic data as part of my troubleshooting workflow. This workflow includes the LM alert details, LM Logs, and also as enriched alert details in tickets we create in third-party ITSM platforms. I will need to see a history of diagnostic data associated to alerts that have triggered in the past.
As a LM Administrator who cares about keeping my infrastructure secure and performance, I need the ability to manage the availability and access to automated Diagnostics & Remediation at a very granular level to limit unintentional or destructive commands running on my environment. I need control over which resources are available for automated scripts.
From there we listed out the end-to-end user steps and tasks. While working on this, we are gradually clear about the interaction details we need to pay attention to at each step or the design goals we want to achieve. The requests got clearer now.

When working on scalable tools which could be helpful in multiple scenarios and when users share part of the creativity of using them, I was always asked: can you give me a more specific example how it can be helpful in troubleshooting?
Coming from a designer background, I turned to my engineer teammates for these questions. But due to the time zone difference and a tight timeline, I used LLM tools instead to run a quick research on online community and compile a very detailed use case. I found it super helpful too when generating diagrams with mock data to prove that the data visualization and interaction helps users to spot the anomaly and dive into troubleshooting. In the real world, a detailed diagram with mock data is way more persuasive than a grey box reading "diagram".
I prompted ChatGPT with the framework of infrastructure troubleshooting stages based on our research, and required it to refer to online community discussion on Reddit and LM community. With a couple of attempts and editing, I have a very detailed user story to share with stakeholders to show that diagnostics is helpful. It helps a lot when presenting my designs to the whole product team and I received great feedback from the team. So here is the story:
Salvio, a Systems Engineer responsible for keeping infrastructure up and remediating issues, rolled out Diagnostics and Remediation Services across our environment. One of the main tasks is ensuring applications perform smoothly and that cpuBusyPercent stays below 95%. To speed up troubleshooting, Salvio set up automated diagnostics so that whenever a high CPUBusyPercent alert triggers, LM automatically collects the top CPU-consuming processes. From there, if a spike is caused by a runaway application, the system can trigger an automated remediation workflow to restart that process.
Lo-fidelity mockups map out the high-level flows and help get initial alignment with PMs and engineers. There are four major questions I aim to address with lo-fi rapid prototypes:
1. HMW enable users to select DiagnosticSources and view outputs?
2. HMW enable users to view DiagnosticSources outputs at the time of alerts?
3. HMW present DiagnosticSources and RemediationSource to users?
4. HMW help users understand whether the RemediationSources work or not?









For proactive troubleshooting flow where users would routinely run diagnostics to do health check, users will run diagnostics at the device level on Resources. As what matters most to users is the diagnostics outputs, I used the major viewport to show the diagnostic outputs. In addition, users could check the script by clicking Script preview before running the diagnostics.

Users are able to view the history of a single DiagnosticSource. History is helpful for users to understand the baseline and any anomaly. Given it's placed in a complex product structure, I used collapsed rows to help users navigate the record while also enable users to view the outputs. Meanwhile, after users locate the record that they are interested in, they could use Fullscreen to view the output in details.

Users could view automated diagnostic outputs after they configured their Diagnostic and Remediation rules. Similar to Resources, Alerts has a complex information architecture. To give users what they want to see immediately, i.e. the diagnostic outputs at the time of alerts, the design is changed to tiles of widgets displaying the information that matters most to users, instead of a list of DiagnosticSources that require select, history, and view outputs among other steps for users to find the data.

While the outputs of diagnostics matter most to users, the use case for remediation is different - what users care most is whether the remediation actions are effective on their alerts. To visualize the effects of Remediation, I added Remediation timestamps on Alert charts.

The design could be bolder. We delivered MVP designs of this feature at that time, but we should prioritize its integration with LM's AI agent - Edwin. The data could not only support Edwin for finding out more about the root cause, but also with Edwin, it could adopt truly AI-powered automation. Users' security concern and the benefits of AI automation could be balanced by presenting users the options and collecting users' consent before moving forward.
Meanwhile, we could make the diagnostics outputs more accessible to users. Restricted by the timeline and development resources, we are not allowed to organize the diagnostics outputs in a text outputs. But with the development of LLMs, we could enable users to access a quick highlights of the outputs more easily.
We receive a lot of constructive feedback from Beta testing: Dashboard widget for Remediation outputs and results, tracking Diagnostics & Remediations at both Resource & Group level, implementing Visual indicators for Diagnostics and remediation in their alert system, and tracking diagnostic activities and building timelines for incident reviews.
When working on the MVP features, I focused much on enabling users to run diagnosis and remediation. Looking at the whole product ecosystem, actually we could fit D&R better in the whole product and the whole troubleshooting journey, like integrating it to Dashboard, and providing access at the group level so users could find the resources with Diagnostics and Remediation features better.
Success on this project was driven by high-frequency communication and ruthless prioritization. By integrating the AI and engineering teams into the early design phase, we synchronized our workflows and avoided the common pitfalls of siloed development. This collaborative approach transformed a complex set of requirements into a manageable, phased release that met our deadline without sacrificing design integrity.


Get in touch for more details 📭