Simple Science

Cutting edge science explained simply

# Computer Science# Distributed, Parallel, and Cluster Computing

ARCHER2 Supercomputer: Monitoring Success

ARCHER2's monitoring system ensures seamless operation for researchers in Edinburgh.

― 6 min read


ARCHER2 MonitoringARCHER2 MonitoringInsightsmonitoring system.Examining the success of ARCHER2's
Table of Contents

ARCHER2 is a powerful supercomputer located in Edinburgh, designed to assist researchers in their calculations and simulations. It has an impressive 750,080 cores, which allows it to perform complex tasks at high speeds. The computer was fully operational by December 2021 after a lengthy setup process complicated by the COVID-19 pandemic.

A critical part of putting ARCHER2 into service was the monitoring system. This system helps ensure everything runs smoothly by regularly checking the computer's health and performance. Given that ARCHER2 was one of the first supercomputers to use HPE Cray EX technology, setting up the monitoring took careful planning and collaboration with HPE.

Deployment Challenges

The deployment of ARCHER2 faced several challenges. Originally, the plan was to shut down the previous ARCHER system in February 2020 and start using ARCHER2 in May of the same year. However, issues with technology development and the pandemic led to delays. Instead of launching the complete system all at once, a smaller, 4-cabinet version was deployed first in July 2020. This version allowed users to begin testing while the full system was being prepared.

Eventually, in February 2021, the full 23-cabinet ARCHER2 was delivered, and by November, it was available for all users. Throughout this period, Automated Monitoring was integrated into the deployment from the start to address problems effectively.

Monitoring Overview

The monitoring system used for ARCHER2 is based on Checkmk. This tool allows the team at Edinburgh to see the health of all aspects of the supercomputer from one central location. Before Checkmk, monitoring required checking multiple systems manually, which was time-consuming and complicated.

With Checkmk, various checks can be set up to monitor the system's status, performance metrics, and any critical errors. This means that if something goes wrong, the team can be alerted immediately. Over time, the system has been fine-tuned to meet specific needs, including checks for particular hardware and software components.

Key Components of the Monitoring System

Checkmk and Graphite

Checkmk is a monitoring tool that allows teams to determine how well systems are operating. It tracks vital statistics about power usage, memory, and system load among other things. Graphite is used alongside Checkmk to create visual representations of the data, making it easier to understand trends and anomalies.

The data collected is continuously fed into a database where it can be analyzed, graphed, and displayed on dashboards. This ensures that all stakeholders have access to the information they need in real-time.

Special Checks

One strength of Checkmk is how easily it allows the team to create new checks for monitoring. For instance, custom checks have been developed to track health statuses of specific servers, monitor job statuses, and even check for issues with the network that carries data.

These special checks have proven useful for maintaining the performance of ARCHER2, helping to identify problems early. When an issue arises, the monitoring team can quickly access the relevant data to diagnose and fix the problem.

Implementation of Monitoring During ARCHER2 Setup

Power Monitoring

One critical area of monitoring is the power consumption of ARCHER2. The system uses a significant amount of power, so it’s vital to keep track of its usage to ensure everything operates within design limits. Data is collected from rectifiers that supply power, providing readings every five seconds.

This information is displayed in real-time graphs, allowing the team to see how much power each cabinet is using and to monitor the overall power draw. Such detailed tracking helps manage the system's energy demands effectively.

Node State Monitoring

Tracking the state of nodes, or individual processing units, is another essential aspect of the monitoring system. This means keeping an eye on which nodes are functioning well and which ones may be experiencing issues. By using the Slurm scheduler, a popular tool for managing resources in supercomputers, the monitoring system can report on the status of all compute nodes.

This information is collected automatically and helps the team maintain high availability for users by quickly identifying nodes that are "down" and addressing the issues.

Login Availability Monitoring

Ensuring users can access ARCHER2 is key to its operation. A specific check was created to monitor login availability by testing access at regular intervals. This involved setting up a test user account that could only be accessed from the monitoring server. The system checks the ability to log in and reports any failures immediately.

Impact of Monitoring on ARCHER2 Deployment

The initial setup and testing phases of ARCHER2 were significantly aided by the monitoring systems in place. For example, the team encountered various problems with internal and external domain name systems (DNS). With monitoring in place, they were quickly alerted to these issues, allowing them to investigate and fix them promptly.

Monitoring also proved beneficial when testing the high-performance Linpack (HPL) benchmarks. During these tests, issues related to power cycling (where power use unexpectedly dropped) were spotted quickly, allowing the team to identify and address faulty nodes.

In successful runs, ARCHER2 achieved impressive benchmark scores, finally ranking 22nd on the Top500 list of supercomputers with a performance of 19.5 PFlop/s.

Automated Monitoring for Contractual Obligations

To meet contractual obligations with research funding bodies, a system was developed to automate the monitoring of essential metrics like node availability and overall service performance. Data collected by the monitoring tools is compiled and made available for reporting. This allows project managers to generate comprehensive reports on the system's availability for audits and evaluations.

Real-time graphs showing node availability and service performance are accessible to relevant stakeholders, providing transparency and assurance that the system is functioning as intended.

Future Developments in Monitoring

As ARCHER2 moves forward, plans are in place to enhance monitoring capabilities. This includes the introduction of new tools for log analysis, deeper insights into error reporting, and per-job statistics. These developments aim to increase the usability and functionality of the monitoring system.

Additionally, making monitoring data more accessible to users will help encourage a collaborative approach to system management and troubleshooting.

Conclusion

In summary, the deployment of ARCHER2 and its monitoring system showcases a well-planned strategy that combines technology and teamwork. By using tools like Checkmk and Graphite, the team at Edinburgh has created a robust environment that supports high-level research activities.

The continuous monitoring of system health and performance not only improves service reliability but also ensures that all users can access and utilize the supercomputer effectively. As the system matures, ongoing enhancements and adaptations to the monitoring strategy will play an integral role in its success.

Original Source

Title: Automated service monitoring in the deployment of ARCHER2

Abstract: The ARCHER2 service, a CPU based HPE Cray EX system with 750,080 cores (5,860 nodes), has been deployed throughout 2020 and 2021, going into full service in December of 2021. A key part of the work during this deployment was the integration of ARCHER2 into our local monitoring systems. As ARCHER2 was one of the very first large-scale EX deployments, this involved close collaboration and development work with the HPE team through a global pandemic situation where collaboration and co-working was significantly more challenging than usual. The deployment included the creation of automated checks and visual representations of system status which needed to be made available to external parties for diagnosis and interpretation. We will describe how these checks have been deployed and how data gathered played a key role in the deployment of ARCHER2, the commissioning of the plant infrastructure, the conduct of HPL runs for submission to the Top500 and contractual monitoring of the availability of the ARCHER2 service during its commissioning and early life.

Authors: Kieran Leach, Philip Cass, Steven Robson, Eimantas Kazakevicius, Martin Lafferty, Andrew Turner, Alan Simpson

Last Update: 2023-03-21 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2303.11731

Source PDF: https://arxiv.org/pdf/2303.11731

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

Similar Articles