This post was co-written by Dirk Michel, SVP SaaS and Digital Technology at MYCOM OSI, Josh Hart, Sr. Solutions Architect at AWS and Chris Williams, Solutions Architect at AWS.
“Data on Kubernetes” is a fast-evolving innovative area that’s essential to building cloud-native, microservices-based software solutions that optimize performance, resilience, reliability, and total cost of ownership (TCO). Many Kubernetes applications require access to persistent storage and data services, including block storage, shared file systems, and object storage.
This post discusses how MYCOM OSI has improved the cost-performance dimension by adopting Amazon FSx for NetApp ONTAP. It explores how the storage option was evaluated to identify a solution that optimizes the handling of large and complex telco assurance datasets.
MYCOM OSI is an AWS Specialization Partner that offers assurance, automation, and analytics software-as-a-service (SaaS) applications for communications service providers (CSPs) in the digital era.
MYCOM OSI’s Assurance Cloud Service (ACS) provides critical network performance, fault, and service quality management. The services support artificial intelligence (AI) and machine learning (ML)-driven closed loop assurance for hybrid, physical, and virtualized networks, across all domains within a SaaS and a bring-your-own-cloud (BYOC) model.
Data Management Challenges
Telco assurance datasets are large, often in the 10s or 100s of terabytes (TB). The equipment and network functions that constitute today’s telco networks emit large volumes of high telemetry data in a variety of protocols and formats. Assurance applications ingest this data stream, which typically consists of high cardinality time-series numerical data, event data, log-formatted data, and configuration data.
Telco operators use the near real-time nature of the incoming assurance data streams to inform time-sensitive triage and incident response activities within surveillance and operations centers. Telcos also retain and accumulate the incoming data over time to build up valuable historical data repositories essential for network analytics and ML workloads.
Importantly, many time-sensitive assurance use cases rely on low-latency access to mid-range historical data. The longer-range historical data also drives batch processes for decision-making in other domains, such as network capacity planning and enterprise resource planning (ERP).
Petabyte (PB)-scale datasets of potentially high-fidelity and high-granularity are a particular challenge to work with due to the significant costs involved. One of the primary cost factors is storage, especially for telcos with large access networks and long data retention requirements. As dataset sizes increase, the cost to acquire and maintain the necessary storage capacity and meet low-latency performance requirements grows substantially.
In addition to direct storage resource costs, transferring and accessing large datasets between systems and AWS Availability Zones (AZs) can be expensive. Moving such large amounts of data over networks often involves high bandwidth requirements, which can lead to substantial data transfer costs.
MYCOM OSI’s cloud-based SaaS solutions deploy as microservice applications onto Kubernetes and heavily use network file system (NFS) storage for its numerical data store. Amazon Elastic Kubernetes Service (Amazon EKS) is the core compute platform, which provides several storage options specifically designed to work seamlessly with Kubernetes clusters on AWS.
One of the long-standing and popular NFS storage options for EKS is Amazon Elastic File System (Amazon EFS), a fully managed, elastic, highly available, and scalable NFS file system service. Due to its elasticity and serverless implementation, Amazon EFS removes the need for capacity planning as it automatically grows and shrinks the file system.
The EFS intelligent tiering feature is another important aspect, which dynamically moves files between the EFS Standard and EFS Infrequent Access storage classes based on access patterns.
The Amazon EKS team provides the EFS CSI Driver—a Kubernetes container storage interface add-on—which enables EKS to integrate with and manage the lifecycle of Amazon EFS file systems efficiently and securely. Kubernetes pods that are distributed across worker nodes and AZs can simultaneously use read-write-many (RWM) persistent volumes (PVs) that are backed by EFS.
MYCOM OSI’s cloud-based SaaS solutions use EFS extensively, as illustrated in the diagram below.
Figure 1 – Architecture with Amazon EFS-backed Kubernetes persistent volumes.
As telco customers seek to build more extensive assurance and telemetry data repositories and retain historical data for longer, the need to counteract the growing NFS storage footprint became evident. This resulted in an initiative to evaluate the Amazon FSx portfolio, with the intention to avoid application-side implementation of data efficiency features.
Additionally, the initiative also took the opportunity to explore different availability configurations. As such, the team began to explore alternative managed NFS storage solutions.
Proposed Solution
One of the early decisions was to focus on the Amazon FSx family in search of a managed NFS file system. Amazon FSx has expanded over time and support for new managed file systems were launched, such as FSx for ONTAP. The MYCOM OSI cloud development team had prior experience with FSx for ONTAP and began to evaluate its implementation across various dimensions, with a specific focus on performance and efficiency.
Some of the key features evaluated include:
- Flexible performance configuration options: By independently provisioning throughput capacity and IOPS, MYCOM OSI achieve up to 260% faster data ingestion speeds.
- Data compression and de-duplication features: These allow MYCOM OSI to reduce storage requirements by 80% on average.
- Two storage tiers, the “primary storage” and “capacity pool storage” tiers: This improves cost efficiency by moving data into lower-cost storage media based on data access patterns.
- Data protection features, such as snapshot-based backups and restores: Taking backups and triggering restores from NFS file systems with AWS Backup can be cost and time-intensive operations. To side-step this challenge with snapshots can make a significant difference and allow for fast and efficient recovery in case of data loss or application failures.
The following diagram illustrates the high-level setup:
Figure 2 – Architecture with Amazon FSx-backed Kubernetes persistent volumes.
The team wanted to support multiple storage options and provide implementation choices based on individual customer needs when required.
Telco customers come in all shapes and sizes, and there’s rarely a one-size-fits-all solution. By supporting flexible storage solution options as part of the SaaS solution, MYCOM OSI could deliver the best possible value for customers with large historical datasets.
Results
As part of the early validation phases, the team reviewed the Amazon FSx for ONTAP-provided Kubernetes CSI Driver, which manages the lifecycle of the file system volumes. FSx for ONTAP enables seamless integration of persistent volumes (PVs) and persistent volume claims (PVCs) within Kubernetes clusters.
This is where the immediate challenge arose: FSx for ONTAP recommends the use of the AstraTrident CSI Driver for Kubernetes. This driver does not include NFS libraries and in fact makes the implicit assumption the relevant NFS libraries are available on the Kubernetes worker nodes. This is not always the case with the modern container-optimized Linux operating systems such as BottlerocketOS. These are purpose-built, lightweight, security hardened, and do not contain NFS libraries either.
The MYCOM team and AWS Solutions Architects worked together to identify an alternative CSI driver and validated the compatibility between the standard nfs-csi-driver and FSx for ONTAP. The team progressed with the security benefits of the BottlerocketOS and FSx for ONTAP as the storage solution.
Performance
The file system performance validation campaign was executed on a test harness in the MYCOM OSI Labs and used a SaaS tenant deployment backed by an FSx for ONTAP file system with efficiency features enabled.
The chosen benchmark drives two particular areas: Data ingestion performance, which creates a write-dominant load on the file system, and data analytics performance, that’s read-intensive. The results show an improvement across both benchmark dimensions as shown below.
Benchmark Type | Benchmark Variants | Benchmark Result |
Data Ingestion | 1B record transforms and writes | 110% – 260% Faster |
Data Analytics | 1B record reads and compute | 10% – 23% Faster |
This result also validates that sufficiently low latency is achieved with the FSx for ONTAP deployment pattern.
Efficiency
To validate the cost-effectiveness of the solution, the data compression ratio had to be proven. Typically, AWS has observed that customers achieve up to 65% reduction in storage capacity when using FSx for ONTAPs compression, compaction, and deduplication functionality.
Across multiple test runs, an average of 80% reduction in storage capacity was found, compared to an NFS file system without system-side compression. The difference between the uncompressed and compressed file system size of about 10TB / 2TB yields a compression ratio of 5:1. The achieved compression ratio varies, and depends on multiple factors such as the composition and sparsity of the dataset.
Benchmark Type | Benchmark Variants | Benchmark Result |
Dataset Type 1 | Synthetic telco dataset | 85% Compression |
Dataset Type 2 | Synthetic telco dataset | 75% Compression |
Availability
Another feature of FSx for ONTAP is inter-AZ connectivity: The active storage system can be accessed by distributed applications from any Availability Zone. This is not the case for EFS One Zone, as an EFS One Zone deployment is only mountable by the EFS CSI Driver from within the same the AZ it resides in.
To maintain a multi-AZ EKS topology, the EFS file system must be deployed with EFS Standard, which is distributed across multiple AZs. This increases resiliency and availability, but with a trade-off of increased cost.
With FSx for ONTAP, the sngle-AZ deployment type file system can be mounted via NFS from any Availability Zone. This means the EKS cluster maintains its availability and standard deployment template, whilst the NFS storage is offered at a reduced availability. This can be important for cloud migration projects; for example, when comparing on-premises deployments to the cloud. For more details, see reduce storage costs with single availability zone FSx for ONTAP datastores.
The conversation around cloud migration often leads with cost, which makes it key to compare like-for-like. If there’s no multi-site replication on-premises, then a One Zone file system is the applicable equivalent. Another important consideration are the requirements of the specific application—the trade-off here is between availability, cost, and sustainability.
A reduction in the storage footprint with a One Zone deployment pattern reduces the underlying infrastructure, and therefore the carbon footprint of the solution.
Conclusion
MYCOM OSI was able to evolve its storage solutions in a data-intensive telco assurance context and adopt the best storage solution for unique customer needs. The evolved storage architecture simultaneously improves performance and reduces cost, especially for telco customers with large historical datasets. Supporting both Amazon EFS and Amazon FSx for ONTAP provides the flexibility to choose the right tool for the right job.
Innovating on behalf of customers can be a key differentiator, and MYCOM OSI increased its ability to offer flexible tenant options that meet specific requirements whilst still providing a standard set of patterns to maintain a consistent SaaS architecture. By evaluating the historical data size and availability requirements across telco customers, MYCOM OSI was able to provide the best possible cost-performance and return on investment (ROI) for its customer base.
Telco providers can improve their overall total cost of ownership (TCO) for assurance applications by adopting cloud-native SaaS solutions such as MYCOM OSI’s Assurance Cloud Service (ACS) platform, which are well architected and AWS-optimized solution stacks that evolve over time.
This blog was first posted on aws.amazon.com and co-written by Dirk Michel, SVP SaaS and Digital Technology at MYCOM OSI, in partnership with Josh Hart, Sr. Solutions Architect at AWS and Chris Williams, Solutions Architect at AWS: https://aws.amazon.com/blogs/apn/how-mycom-osi-optimized-saas-storage-with-amazon-fsx-for-netapp-ontap/