ARCHER2 Supercomputer: Monitoring Success

Table of Contents

Deployment Challenges
Monitoring Overview
Key Components of the Monitoring System
Implementation of Monitoring During ARCHER2 Setup
Impact of Monitoring on ARCHER2 Deployment
Automated Monitoring for Contractual Obligations
Future Developments in Monitoring
Conclusion
Original Source
Reference Links

ARCHER2 is a powerful supercomputer located in Edinburgh, designed to assist researchers in their calculations and simulations. It has an impressive 750,080 cores, which allows it to perform complex tasks at high speeds. The computer was fully operational by December 2021 after a lengthy setup process complicated by the COVID-19 pandemic.

A critical part of putting ARCHER2 into service was the monitoring system. This system helps ensure everything runs smoothly by regularly checking the computer's health and performance. Given that ARCHER2 was one of the first supercomputers to use HPE Cray EX technology, setting up the monitoring took careful planning and collaboration with HPE.

Deployment Challenges

The deployment of ARCHER2 faced several challenges. Originally, the plan was to shut down the previous ARCHER system in February 2020 and start using ARCHER2 in May of the same year. However, issues with technology development and the pandemic led to delays. Instead of launching the complete system all at once, a smaller, 4-cabinet version was deployed first in July 2020. This version allowed users to begin testing while the full system was being prepared.

Eventually, in February 2021, the full 23-cabinet ARCHER2 was delivered, and by November, it was available for all users. Throughout this period, Automated Monitoring was integrated into the deployment from the start to address problems effectively.

Monitoring Overview

The monitoring system used for ARCHER2 is based on Checkmk. This tool allows the team at Edinburgh to see the health of all aspects of the supercomputer from one central location. Before Checkmk, monitoring required checking multiple systems manually, which was time-consuming and complicated.

With Checkmk, various checks can be set up to monitor the system's status, performance metrics, and any critical errors. This means that if something goes wrong, the team can be alerted immediately. Over time, the system has been fine-tuned to meet specific needs, including checks for particular hardware and software components.

Key Components of the Monitoring System

Checkmk and Graphite

Checkmk is a monitoring tool that allows teams to determine how well systems are operating. It tracks vital statistics about power usage, memory, and system load among other things. Graphite is used alongside Checkmk to create visual representations of the data, making it easier to understand trends and anomalies.

The data collected is continuously fed into a database where it can be analyzed, graphed, and displayed on dashboards. This ensures that all stakeholders have access to the information they need in real-time.

Special Checks

One strength of Checkmk is how easily it allows the team to create new checks for monitoring. For instance, custom checks have been developed to track health statuses of specific servers, monitor job statuses, and even check for issues with the network that carries data.

These special checks have proven useful for maintaining the performance of ARCHER2, helping to identify problems early. When an issue arises, the monitoring team can quickly access the relevant data to diagnose and fix the problem.

Implementation of Monitoring During ARCHER2 Setup

Power Monitoring

One critical area of monitoring is the power consumption of ARCHER2. The system uses a significant amount of power, so it’s vital to keep track of its usage to ensure everything operates within design limits. Data is collected from rectifiers that supply power, providing readings every five seconds.

This information is displayed in real-time graphs, allowing the team to see how much power each cabinet is using and to monitor the overall power draw. Such detailed tracking helps manage the system's energy demands effectively.

Node State Monitoring

Tracking the state of nodes, or individual processing units, is another essential aspect of the monitoring system. This means keeping an eye on which nodes are functioning well and which ones may be experiencing issues. By using the Slurm scheduler, a popular tool for managing resources in supercomputers, the monitoring system can report on the status of all compute nodes.

This information is collected automatically and helps the team maintain high availability for users by quickly identifying nodes that are "down" and addressing the issues.

Login Availability Monitoring

Ensuring users can access ARCHER2 is key to its operation. A specific check was created to monitor login availability by testing access at regular intervals. This involved setting up a test user account that could only be accessed from the monitoring server. The system checks the ability to log in and reports any failures immediately.

Impact of Monitoring on ARCHER2 Deployment

The initial setup and testing phases of ARCHER2 were significantly aided by the monitoring systems in place. For example, the team encountered various problems with internal and external domain name systems (DNS). With monitoring in place, they were quickly alerted to these issues, allowing them to investigate and fix them promptly.

Monitoring also proved beneficial when testing the high-performance Linpack (HPL) benchmarks. During these tests, issues related to power cycling (where power use unexpectedly dropped) were spotted quickly, allowing the team to identify and address faulty nodes.

In successful runs, ARCHER2 achieved impressive benchmark scores, finally ranking 22nd on the Top500 list of supercomputers with a performance of 19.5 PFlop/s.

Automated Monitoring for Contractual Obligations

To meet contractual obligations with research funding bodies, a system was developed to automate the monitoring of essential metrics like node availability and overall service performance. Data collected by the monitoring tools is compiled and made available for reporting. This allows project managers to generate comprehensive reports on the system's availability for audits and evaluations.

Real-time graphs showing node availability and service performance are accessible to relevant stakeholders, providing transparency and assurance that the system is functioning as intended.

Future Developments in Monitoring

As ARCHER2 moves forward, plans are in place to enhance monitoring capabilities. This includes the introduction of new tools for log analysis, deeper insights into error reporting, and per-job statistics. These developments aim to increase the usability and functionality of the monitoring system.

Additionally, making monitoring data more accessible to users will help encourage a collaborative approach to system management and troubleshooting.

Conclusion

In summary, the deployment of ARCHER2 and its monitoring system showcases a well-planned strategy that combines technology and teamwork. By using tools like Checkmk and Graphite, the team at Edinburgh has created a robust environment that supports high-level research activities.

The continuous monitoring of system health and performance not only improves service reliability but also ensures that all users can access and utilize the supercomputer effectively. As the system matures, ongoing enhancements and adaptations to the monitoring strategy will play an integral role in its success.

ARCHER2 Supercomputer: Monitoring Success

ARCHER2's monitoring system ensures seamless operation for researchers in Edinburgh.

Deployment Challenges

Monitoring Overview

Key Components of the Monitoring System

Checkmk and Graphite

Special Checks

Implementation of Monitoring During ARCHER2 Setup

Power Monitoring

Node State Monitoring

Login Availability Monitoring

Impact of Monitoring on ARCHER2 Deployment

Automated Monitoring for Contractual Obligations

Future Developments in Monitoring

Conclusion

Reference Links

Referenced Topics

ARCHER2 Supercomputer: Monitoring Success

ARCHER2's monitoring system ensures seamless operation for researchers in Edinburgh.

#Deployment Challenges

#Monitoring Overview

#Key Components of the Monitoring System

#Checkmk and Graphite

#Special Checks

#Implementation of Monitoring During ARCHER2 Setup

#Power Monitoring

#Node State Monitoring

#Login Availability Monitoring

#Impact of Monitoring on ARCHER2 Deployment

#Automated Monitoring for Contractual Obligations

#Future Developments in Monitoring

#Conclusion

Reference Links

Referenced Topics

Deployment Challenges

Monitoring Overview

Key Components of the Monitoring System

Checkmk and Graphite

Special Checks

Implementation of Monitoring During ARCHER2 Setup

Power Monitoring

Node State Monitoring

Login Availability Monitoring

Impact of Monitoring on ARCHER2 Deployment

Automated Monitoring for Contractual Obligations

Future Developments in Monitoring

Conclusion