设计工具
存储

现实生活中的工作负载允许更高效的数据粒度,并支持非常大的固态硬盘容量

卢卡·伯特| 2023年9月

大容量ssd (1).e., 30TB+) bring a new set of challenges. 最相关的两个是:

  1. Large capacity 固态硬盘s are enabled by high density NAND, like QLC (quad-level cell NAND that stores 4 bits of data per cell), which brings more challenges compared to TLC NAND (triple-level cell, 每单元存储3位).
  2. 固态硬盘容量的增长要求本地DRAM内存的同等增长,传统上是1:1000的比例(DRAM与存储容量)。.

Currently, we are at a point where the ratio of 1:1000 is no longer sustainable. 但我们真的需要它吗? 为什么不是1:400的比例呢? Or 1:8000? They would reduce DRAM demand by a factor of 4 or 8 respectively. What prevents us from doing this?

This blog explores the thought process behind this approach, and tries to map a way forward for large capacity 固态硬盘s.

Firstly, why does DRAM need to be in a 1:1000 ratio with NAND capacity? 固态硬盘需要将来自系统的逻辑块地址(LBA)映射到NAND页,并且需要保留所有这些地址的实时副本,以便知道可以向何处写入数据或回读数据. LBA size is 4KB and the map address is generally 32 bit (4 bytes), so we need one entry of 4 bytes every LBA of 4KB; hence the 1:1000 ratio. Note that very large capacities would need a bit more than this but, 为简单起见, 我们将坚持这个比例,因为它使推理更简单,不会实质性地改变结果.

为每个LBA提供一个映射条目是最有效的粒度,因为它允许系统编写(i.e., create a map entry) at the lowest possible granularity. This is often benchmarked with 4KB random writes, 通常用来衡量和比较固态硬盘写性能和耐用性的是什么.

然而, this may not be tenable in the long term. Instead, what if we had one map entry every 4 LBAs? 或者8 16 32+ LBAs? If we use one map entry every 4 LBAs (i.e.(每16KB写入一个条目)我们可以节省DRAM的大小,但是当系统想要写入4KB时会发生什么呢? 假设条目是每16KB, the 固态硬盘 will need to read the 16KB page, modify the 4KB that are going to be written, and write back the entire 16KB page. This would impact performance (“read 16KB, 修改4 kb, 回写4KB”, rather than just “write 4KB”) but, 最重要的是, 这将影响持久性(系统写入4KB,但固态硬盘最终将向NAND写入16KB),从而将固态硬盘寿命减少4倍. 当这种情况发生在具有更大续航能力的QLC技术上时,这是令人担忧的. For QLC, if there is one thing that cannot be wasted it is endurance!

So, 常见的理由是,映射粒度(或更正式的术语“IU”)不能更改,否则固态硬盘寿命(持久时间)将严重下降.

While all the above is true, do systems really write data at 4KB granularity? 多久一次? 当然可以买一个系统,只运行FIO与4KB RW配置文件,但现实, people don’t use systems this way. They buy them to run applications, databases, file systems, object stores, etc. 它们中有使用4KB写入的吗?

我们决定测量一下. We picked a set of various application benchmarks, from TPC-H (data analytics) to YCSB (cloud operations), running on various databases (Microsoft® SQL Server®, RocksDB, Apache Cassandra®), 各种文件系统(EXT4), XFS)和, 在某些情况下, entire software defined 存储 solutions like Red Hat® Ceph® 存储, 并测量发出了多少次4KB写入,以及它们对写入放大的贡献, i.e., extra writes that dent device life.

在深入分析的细节之前,我们需要讨论为什么写入大小在持久性受到威胁时很重要.

4KB写入将创建“写入16K以修改4K”,因此写入放大系数(WAF)为4倍. 但如果我们有8K的写入? 假设在相同的IU内,它将是“写16K修改8K”,因此WAF=2. 好一点了. 如果我们写16K? It may not contribute to WAF at all as one “writes 16K to modify 16KB”. So, only small writes contribute to WAF.

There is also a subtle case where writes may not be aligned, 所以总是会有不对齐导致WAF,但它也会随着尺寸的增加而迅速减少.

下图显示了这一趋势:

图1:16K IU诱导WAF,显示较大的IOs影响较小 图1:16K IU诱导WAF,显示较大的IOs影响较小

 

Large writes have minimal WAF impact. 例如,256KB如果对齐,可能没有影响(WAF=1x),或者影响很小(WAF=1).06x)如果没有对齐. Way better than the dreaded 4x coming from 4KB writes!

然后,我们需要分析所有写入固态硬盘的操作,并在IU中查找它们的一致性,以计算每个操作对WAF的贡献. 而且越大越好. To do this, we instrumented the system to trace IOs for several benchmarks. 我们获得20分钟的样本(每个基准通常在1亿到3亿个样本之间),然后对它们进行后处理以查看大小, IU对齐, and add every single IO contribution to WAF.

The below table shows how many IOs fit in each size bucket:

Luca 博客 IU Figure 2: Real life data of WAF IU from benchmarks (by IO count) Luca 博客 IU Figure 2: Real life data of WAF IU from benchmarks (by IO count)

 

如图所示, 大多数写操作要么适合4-8KB(坏)的小桶,要么适合256KB以上(好)的桶.

If we apply the above WAF chart assuming all such IOs are misaligned, we get what is reported in “Worst case” column: most WAF is in the 1.X范围内,一些在2.x and very exceptionally in the 3.x范围. Way better that expected 4x but not as good to make it viable.

然而, not all IOs are misaligned. 为什么呢?? 为什么现代文件系统创建的结构与如此小的粒度不一致呢? 答案:他们没有.

我们针对每个基准测试测试了超过1亿个IOs,并对它们进行了后处理,以确定它们如何与16KB IU保持一致. The result is in the last column “Measured” WAF. 一般小于5%.e., WAF >=1.05x which means that one can grow the IU size by 400%, make large 固态硬盘 using QLC NAND and existing, smaller DRAM technologies at a life cost that is >5% and not 400% as postulated! 这些都是惊人的结果.

有人可能会说“有很多4KB和8KB的小写入,他们确实有400%或200%的个人WAF贡献. 由于IOs的贡献虽小但数量众多,聚合WAF不应该更高吗?”. 真正的, 有很多, 但是它们很小, so they carry a small payload and their impact, 就体积而言, 是最小. 在上表中, 4KB的写入和256KB的写入都算作一次写入,但后者携带的数据量是前者的64倍.

If we adjust the above table accounting for the IO Volume (i.e., accounting for each IO size and data moved), 不按IO计数, we come to the following representation:

Luca 博客 IU Figure 3: Real life data of WAF IU from benchmarks (by Volume) Luca 博客 IU Figure 3: Real life data of WAF IU from benchmarks (by Volume)

 

我们可以看到, the color grading for more intense IOs is now skewed to the right, 这意味着大型IOs正在移动大量数据,因此WAF的贡献很小.

最后要注意的一点是,并非所有固态硬盘工作负载都适合这种方法. 最后一行, 例如, represents the metadata portion of a Ceph 存储 node which does very small IO, 造成高WAF=2.35x. Large IU drives are not suitable for metadata alone. 然而, 如果我们在Ceph中混合数据和元数据(NVMe ssd的一种常见方法),数据的大小和数量将超过元数据的大小和数量,因此合并后的WAF受到的影响最小.

Our testing shows that in actual apps and most common benchmarks, moving to 16K IU is a valid approach. 下一步是说服业界停止用FIO对4K RW的ssd进行基准测试,这从来都是不现实的, 此时此刻, 对进化是有害的.

不同IU大小的影响

One of the most obvious follow up question is: why 16KB IU size? Why not 32KB or 64KB, does it even matter?

这是一个非常公平的问题,需要具体的调查,应该变成一个更具体的问题:不同的IU大小对任何给定基准的影响是什么?

Since we already have traces that are unaffected by the IU size, we just have to run them through the appropriate model and see the impact.

Figure 4 shows the impact of IU sizes to WAF:

Luca 博客 IU Figure 4: Impact of IU sizes to WAF Luca 博客 IU Figure 4: Impact of IU sizes to WAF

 

There are a few outcomes that can be evinced from the chart:

  • IU size matters and WAF degrades with IU size. 解决方案没有好坏之分, everybody has to look to the different tradeoffs based on its needs and targets.
  • WAF的退化并不像我们在上面看到的许多情况下所担心的那样糟糕. Even in the worst case of 64KB IU and most aggressive benchmark, it is less than 2x as opposed to a feared 16x
  • 元数据, 如前所述, is always a bad pick for large IU and the larger the IU, 情况越糟.
  • JESD 219, an industry standard profile to benchmarks WAF, 不是很好,但在4KB IU下可以接受,额外的3% WAF通常是可以容忍的,但在较大的IU下变得不寻常,在64K IU时几乎是9倍

DMTS -系统架构

卢卡·伯特

Luca是固态硬盘系统架构的杰出成员,拥有超过30年的企业存储经验. 他主要关注创新特性及其在系统中的应用,以进一步提高固态硬盘的价值. 他拥有都灵大学(意大利)固体物理学硕士学位。.