Erasure Coded Data Durability

Introduction

Whether they are SSDs or HDDs, physical storage devices fail. Once cannot afford to lose data because of such failures, hence the need for redundancy arises. The simplest and the most well-known redundancy is a back up which is a clone of the original data. However, this is highly inefficient since the number of devices is doubled. That is why RAID(Redundant Array of Inexpensive Drives) was invented.

A RAID 5 system creates one parity out of data and stores in on an additional physical drive. The original data with the additional parity bit creates a redundant system which is protected against a single drive failure, i.e., in the case of such a failure, the original data can be reconstructed from the healthy drives. The system typically has an idle drive or user is required to swap the failed drive with a new one so that the data recovery starts immediately. Once the recovery is done, the system is back to its 1-redundant state. As drive capacities reached 20TB, the recovery process takes very long. Consider a failed \(20\text{TB}\) drive and a typical recovery rate of \(50\text{MB/s}\). In this case it will take \(4.7\, \text{days}\) to complete the recovery. What if another drive fails in that time frame? You lose data!

Enter RAID 6, which is a system that creates two pieces of parities to protect against two simultaneous failures. This certainly increases the durability of the data, however certain applications require durability levels that cannot be met by RAID6. In such applications, more redundancy is introduced to create erasure coding.

The math of creating redundancy involves Galois fields and it is a fascinating branch of math! However, I will save that for another post and concentrate on the durability calculations for a given erasure coding. I will want to dive into an aspect which seems to be ignored in most calculators/posts: it is the unrecoverable read error (URE). In this post you will see that UREs effectively kill the last layer of redundancy. In the particular case of RAID6, we will see that its protection is not much stronger than RAID5 once the corrupted sectors are included.

I will also show that there exist good models to calculate data durability along with bad and ugly ones. I hope that people will stop using models that are so wrong that they get the answer off by a factor of \(10^8\)!

History of Erasure Coding

Figure 1 shows the evolution of the erasure coding methods from replication to more advanced technologies. The image is taken from a Seagate presentation and has certain Seagate-specific content.
A brief timeline of the erasure coding. <span class='plus'>... [+]</span> <span class='expanded-caption'>  Replication has very poor capacity efficieny. RAID5(6) provides protection against one(two) failure(s). Seagate enclosures can be used to declustered parity as well as  erasure coding across network. Durability and availablity can be further improved by implementing Multi-layer erasure coding.  </span> Credit:John Bent of Seagate.

Figure 1: A brief timeline of the erasure coding. … [+] Replication has very poor capacity efficieny. RAID5(6) provides protection against one(two) failure(s). Seagate enclosures can be used to declustered parity as well as erasure coding across network. Durability and availablity can be further improved by implementing Multi-layer erasure coding. Credit:John Bent of Seagate.

  • The simplest way of creating redundancy is replication, but this has a very poor capacity efficiency.
  • RAID 5 and Raid 6 introduce parities to protect data against device failures.
  • More advanced erasure coding schemes include declustered RAID6, and multi-layers.

Creating Redundancy

  • Simplest example(RAID5): compute and store the parity data.
    • Given two bytes of data: \(B_1=01001001\) and \(B_2=11011010\)
    • The parity data: \(P=B_1\mathbin{\oplus}B_2=10010011\)
  • Assume the drive that stored \(B_1\) fails:
    • Compute \(P\mathbin{\oplus}B_2\), which is equal to \(B_1\).
  • The data on the failed drive can be reconstructed.
    • Fault tolerant to one drive failure.
  • \(c\) pieces of redundancy data:
    • Parities computed using Reed-Solomon algorithm.
    • Fault tolerant to \(c\) simultaneous drive failures.
  • Main Question: What is the probability of having \(c+1\) simultaneous failures?

RAID with URE

Figure 2 shows a symbolic representation of RAID5/6 states and transitions. You can change the sliders to adjust the numbers.


# of data drives
# of parity drives
A Symbolic representation of the  <span id='surgeryraidspan'> RAID </span>  including URE failures. Use the input sliders to change the number of drives or the redundancy. <span class='plus'>... [+]</span> <span class='expanded-caption'> We will later derive the formulas for data durability for RAID5 in Eq. <span class='myeqref' > \@ref(eq:m2f) </span>, and RAID6 in Eq. <span class='myeqref' > \@ref(eq:r2f) </span>. </span>

Figure 2: A Symbolic representation of the RAID including URE failures. Use the input sliders to change the number of drives or the redundancy. … [+] We will later derive the formulas for data durability for RAID5 in Eq. (17) , and RAID6 in Eq. (20) .

  • The red arrows represent drive failures. Rate is scaled with the total number of drives:
    • \(\lambda\) is the failure rate per drive, and \(n\lambda\) is the total failure rate for \(n\) drives.
    • The arrow denoted with \(h\) represents the data loss due to UREs- to be discussed in more detail later.
  • The green arrows represent repairs. Failed drives are replaced, and data is rebuilt.
    • The repair time, \(1/\mu\), depends on capacity and the repair rate. It may be as long as several days.
  • Data is lost when the system moves to the right-most states.
    • Key metric: Mean Time to Data Loss (MTTDL): it is a function of \(n\), \(c\), \(\lambda\), \(\mu\), drive capacity and \(\text{ UER}\).

The durability is usually a number which is very close to one. It is typically reported as the number of nines which is defined as the instances of leading 9’s in reliability. For example 0.998 has 2 nines, 0.9991 has 3 nines. Mathematically it is defined as follows:

\[\begin{equation} \text{number of nines}=\text{Floor}\left(-log10(1-\text{Durability})\right). \tag{1} \end{equation}\]

Let’s take a look at the internet and see how people compute the number of nines.

A survey of models

Erasure coding was invented decades ago, and the associated data durability calculations have been well established in the literature. However, this doesn’t stop people from inventing their own methods to compute data durability, not that anything is wrong with that! The problem is that, as I will show below, some of these methods are so wrong that they end up being off by several orders of magnitude.
Let’s get started.

The Wasabi Model

This model is shows up in a white paper from cloud storage provider Wasabi[1]:

When an object is written to Wasabi, the Wasabi erasure coding algorithm converts the object into a series of data and parity fragments, and stores each fragment on 20 different drives. When you read an object from storage, Wasabi reassembles the object using the fragments. In the event one or more drives fail (up to 4 drives), an object can be fully reconstructed using any 16 of the 20 fragments. This means that Wasabi can withstand the failure of up to any 4 drives within a storage slice without losing data.
For AFR, Wasabi uses industry-accepted guidelines and assumes a conservative drive AFR of \(5\%\). In practice, our observed AFRs are much lower. For MTTR, Wasabi estimates \(3.4\) days per drive, based on the calculation below:
MTTR = 14 TB drive capacity * 50 MB/s drive write speed = 3.4 days to fully write a replaced 14 TB drive
As stated earlier in the erasure coding discussion, in order for an object to be lost, more than 4 drives in a Wasabi storage slice would have to fail. To understand the probability of this, the first step is to understanding the probability a single drive failing using the calculation below:
Probability of a 1 drive failing = AFR (5% year) * MTTR (3.4 days) * (1/365 year/days) = 4.66 x 10-4
The next step would be to understand the probability of four drives failing while another drive in the storage slice was rebuilding. This would be a potential data loss scenario because five drives in a storage slice would not be available (one in rebuild mode, plus four new failures). To calculate this probability, this formula applies:
Probability of 4 drives failing = Probability of 1 drive failing (4.66 x 10 -4 ) to the 4th power = 4.7 x 10 -14
The final step in calculating data durability is to factor in the probability of 4 drives failing using the following formula:
Data Durability = 1 - (probability of 4 drives failing) = 1 - 4.7 x 10-14 = .99999999999995
As seen in the above calculations, Wasabi storage architecture provides greater than 11 x 9s of durability. The calculated number is actually 13 x 9s but for the sake of taking a conservative approach to the calculations and to align with how most of the hyperscalers position their data durability, Wasabi uses 11 x 9s as the published data durability metric.

This model is very problematic! Consider the following generic set up where you can change the numbers if you would like, and follow the Wasabi model:

(\(n\)) drives: capacity(\(C\) ) in TB: redundancy(\(c\)): recovery speed (\(S\) ) in MB/s: AFR in %:

  • At this recovery rate, the recovery time from a drive failure is: days.
  • The probability of losing a single drive in days is:
  • The system will lose data when there are failures in days, which has the probability: () .
  • Data durability= 1- Probability of failure(s) in days=1- () = .

Why is it wrong? It is because the metric calculated is not the data durability for a system of drives with -redundant EC! It is for 1+ [original+ mirror(s)]. Note that the total number of drives, , did not even enter the equations! There are binomial( , )= combinations to choose failure(s) out of drives. The reported number is not even the durability for over a year, it is just over days. There are such frames in a year. With the binomial coefficients and the # frames/year included, this model over-reports durability by at least nines. Furthermore, if the UREs were included, that would have reduced the durability by \(3\) more nines (more on that later). Therefore, this model is off by about \(8\) orders overall.

The BackBlaze Model

This is a model introduced by BackBlaze[2], [3]. Let’s look at their Poisson-based model:

For our Poisson calculation, we use this formula:

\[\begin{eqnarray} P(k\, \text{events in interval})=e^{-\lambda}\frac{\lambda^k}{k!} \tag{2} \end{eqnarray}\] The values for the variables are:
  • Annual average failure rate = 0.0041 per drive per year on average
  • Interval or “period” = 156 hours (6.5 days)
  • Lambda = ((0.0041 * 20)/((365*24)/156)) =0.00146027397 for every “interval or period”
  • e = 2.7182818284
  • k = 4 (we want to know the probability of 4 “events” during this 156 hour interval)

Here’s what it looks like:

\[\begin{eqnarray} P(k\, \text{events in interval})&=&e^{-\lambda}\frac{\lambda^k}{k!}\nonumber\\ P(k\, \text{events in interval})&=&2.71828^{-0.00146027397}\frac{0.00146027397^k}{4*3*2*1} \tag{3} \end{eqnarray}\]

If you’re following along at home, type this into an infinite precision calculator:[3]

(2.7182818284^(-0.00146027397)) * (((0.00146027397)^4)/(432*1))

= (1 – 1.89187284e-13)^56
= (0.999999999999810812715)^56
= 0.99999999999 (11 “nines”)

The sub result for 4 simultaneous drive failures in 156 hours = 1.89187284e-13. That means the probability of it NOT happening in 156 hours is (1 – 1.89187284e-13) which equals 0.999999999999810812715 (12 nines).

But there’s a “gotcha.” You actually should calculate the probability of it not happening by considering that there are 56 “156 hour intervals” in a given year. That calculation is:

= (1 – 1.89187284e-13)^56
= (0.999999999999810812715)^56
= 0.99999999999 (11 “nines”)

[3] The complexity will break most standard calculators.

First of all, note that BackBlaze indeed incorporates the number of drives into the model with the effective \(\lambda\). That is great. The other good thing is that they know that the first calculated number is the durability for a single time frame, i.e., a single repair time window. They extend it to the full year, as illustrated in Fig. 3.

An illustration of the time frames in the BackBlaze model.

Figure 3: An illustration of the time frames in the BackBlaze model.

The binomial model they construct is similar, and is not fundamentally different than the Poisson model. Since the chosen distribution is not critical for the main point I will argue below, I will re-write their binomial model in a slightly different form which, I believe, will make it easier to follow: Let us first make the input numbers interactive, and repeat the BackBlaze Binomial model:

(\(n\)) drives: capacity(\(C\)) in TB: redundancy(\(c\)): recovery speed (\(S\) ) in MB/s: AFR in %:

  • The recovery time from a drive failure is: \(T\equiv\frac{C}{S}\)= days.
  • The AFR value can be converted to the failure rate as: \(\lambda=-\frac{\text{ln}(1-\text{AFR})}{\text{ year}}\simeq\frac{\text{AFR}}{\,\text{ year}}\)= .

  • The probability of a given drive to fail in days is: \(p_f\simeq \lambda T\)= .
  • The system will not lose data when there are at most failures in days.
    • The probability of not losing data in days is: \(p_\text{no-loss}=\sum_{i=0}^c \binom{n}{i} (p_f)^i(1-p_f)^{n-i}\) =
  • There are \(N_F=\frac{365}{T}\)= such frames in a year.
    • The probability of not losing data in a year is: \((p_\text{no-loss})^{N_F}\) =
The BackBlaze model is still simple, but… it has a few issues:
  • The segmentation of a year into chunks of days implies that data is lost only when failures happen in a given frame. The cases of failures within days but spanning two subsequent frames are missed.
  • It is implicitly assumed that every frame starts with a -redundant system. In reality, there may be up to ongoing repairs exiting the previous frame, and a single drive failure early in the next frame will cause data loss. The mistake they make is to assume that \(p_\text{no-loss}\) will be the same at every window, and it clearly is not. In fact, \(p_\text{no-loss}\) at an arbitrary frame depends on all the previous frames, that is, it is time dependent. We can also say that it will increase as \(t\) increases. Therefore one cannot simply raise it to the power of \(N_F\).

Ignoring UREs, BackBlaze model works reasonably well in the low AFR limit (still off by at least factor of \(\sim2\)). If we include UREs, which we should, we can say that BackBlaze model overestimates the durability at the inputs above by nines.

The most intuitive model

Figure 4 shows the time scales in a repairable system.
An illustration of the time scales in a repairable system.

Figure 4: An illustration of the time scales in a repairable system.

  • A system of \(n\) drives has the Mean Time To Failure: \(\text{MTTF}=1/(n \lambda).\)
  • A failed drive is replaced, and rebuilt. Mean Time To Repair: \(\text{MTTR}=1/\mu.\)
  • Fraction of the time spent for repair: \(\frac{\text{ MTTR}}{\text{ MTTF}+\text{ MTTR} }\)
  • Fraction of the time spent for repair: \(\frac{\text{ MTTR}}{\text{ MTTF}+\text{ MTTR} }\simeq \frac{\text{ MTTR}}{\text{ MTTF}}=\frac{n\lambda}{\mu}\) (MTTF \(\gg\) MTTR).
  • There are \(n-1\) drives left running. The rate of failure: \((n-1)\lambda\)
  • Multiply this rate with the fraction of time in recovery: \(\,\,\frac{n\lambda}{\mu}(n-1)\lambda=\frac{n (n-1)\lambda^2}{\mu}\)
    • This is the rate of data loss. Inverting the expression: \(\,\,\, \text{ MTTDL}_1=\frac{\mu}{n(n-1)\lambda^2}\)
  • When there are two parity drives, we can iterate to get: \(\,\,\,\,\text{ MTTDL}_2=\frac{\mu^2}{n(n-1)(n-2)\lambda^3}\)
  • For \(c\) parity drives, we can recursively iterate to get: \(\text{ MTTDL}_c=\left[\frac{\mu}{\lambda}\right]^c\frac{(n-c-1)!}{\lambda n!}\), which is the standard expression for durability[4].
  • \(c=1\) is a RAID5, and \(c=2\) is a RAID6 setup. \(c\geq3\) cases are refereed to as Erasure Coding in general.
  • This model is very intuitive, and works great, but…
    • It is not clear how to include UREs,
    • It is not extendable to more complicated problems.

The Markov chain model

This chapter will outline the methodology to calculate the data durability. We will start with a system with no redundancy to introduce the concept of Markov chains, and later will extend the analysis to repairable systems with redundancy. This methodology will enable us to address the UREs.

\(0\) redundancy

Just to warm up, let’s consider a system with \(n\) drives and no redundancy. This means that user data will be lost with any drive failure. What is the reliability of such a system given the reliability characteristics of individual drives? To keep the math simple, let’s assume that drives themselves have exponentially distributed failure times, i.e., they fail at a constant rate \(\lambda\), or equivalently with a certain annual failure rate AFR. Now we are putting \(n\) of these drives in a box. How can we describe failures of such systems? The answer is rather obvious: it will still be an exponential distribution with rate \(n\lambda\). So, the reliability of such a system is \[\begin{equation} \text{ R}(t)=e^{-n \lambda t}. \tag{4} \end{equation}\]

The probability density and the cumulative probability functions for an exponentially distributed random variable \(T_i\) are \[\begin{equation} f(t)=\lambda e^{-\lambda t},\tag{5} \end{equation}\] \[\begin{equation} F(t)= \int_{0}^{t} d\tau f(\tau)=1-e^{-\lambda t},\tag{6} \end{equation}\] where \(\lambda\) is the failure rate. A system of \(n\) storage devices with no redundancy will lose data as soon as a device fails. Formally, we have \(n\) independent and identically distributed failure times \(T_i\), \(i\in [1,n]\), with density \(f\) and the corresponding cumulative distribution \(F\) as defined in Eqs. (5) and (6) , respectively. We want to calculate the distribution of earliest failure time, i.e., \(T={\rm min}(T_1,T_2,\cdots,T_n)\). We can follow the formal definitions of cumulative probability distribution[5] to construct \(F_T(t)\) as follows: \[\begin{eqnarray} F_T(t)&=&P(T\leq t)=1-P(T > t)\nonumber\\ &=&1-P\{{\rm min}(T_1,T_2,\cdots,T_n)>t\}\nonumber\\ &=&1-[1-F(t)]^n=1-e^{-n\lambda t},\tag{7} \end{eqnarray}\] where we make use of the fact that if \({\rm min}(T_1,T_2,\cdots,T_n)>t\) then each \(T_i > t\). Since there is no redundancy, Eq. (7) is also the cumulative probability density. The reliability can be calculated as \(R=1-F_T\), which results in Eq. (4).

Just to get a sense of the result, let’s plug in some typical numbers: \(n=20\), AFR\(=0.5\%\). Note that for AFR \(\ll 1\), \(\lambda \sim \frac{AFR}{8766}\) (1/hours), where \(8766\) is the number of hours in a year. Durability in this case is \(0.904\), which is unacceptably poor. This is why we need to add redundancy to the system. Before we move onto the systems with redundancy, let’s solve this problem one more time in the context of Markov chains. Markov chain is a stochastic model that describes sequence of possible events, such as drive failures and drive restorations where the probability of each event depends on the previous state of the system [6]. Let’s draw the Markov chain for the system that we considered earlier, in Fig. (5).

Markov chain for a system with no redundancy.

Figure 5: Markov chain for a system with no redundancy.

The red arrow in Fig. (5) shows the transition from a healthy \(n-\)drive state to the failure state. The rate of this event is \(n \lambda\). Let’s call the probability of the system to be in the \(n-\)drive state \(p_{n}\), and the other state \(p_{n-1}\). The differential equations governing the time evolution of the system can be written as \[\begin{eqnarray} \dot{p}_{n}(t)&=&-n \lambda p_{n}(t)\nonumber\\ \dot{p}_{n-1}(t)&=&+n \lambda p_{n}(t), \tag{8} \end{eqnarray}\] which simply show that there is an out-flux from \(p_{n}\) to \(p_{n-1}\). This will be a total overkill, but let’s Laplace transform these equations with the initial conditions being \(p_{n}(0)=1\) and \(p_{n-1}(0)=0\),i.e., we start with \(n-\)functioning drives. We get:

\[\begin{eqnarray} s P_{n}(s)-1&=&-n \lambda P_{n}(s)\nonumber\\ s P_{n-1}(s)&=&+n \lambda P_{n}(t), \tag{9} \end{eqnarray}\] where \(P\)’s show the Laplace transformed functions. Converting this into a matrix equation \[\begin{equation} \begin{bmatrix} s +n \lambda & 0 \\ -n\lambda & s \end{bmatrix} \begin{bmatrix} P_{n}(s) \\ P_{n-1}(s) \end{bmatrix}= \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \tag{10} \end{equation}\] which is solved by inverting the matrix \[\begin{eqnarray} \begin{bmatrix} P_{n}(s) \\ P_{n-1}(s) \end{bmatrix}&=& \begin{bmatrix} s +n \lambda & 0 \\ -n\lambda & s \end{bmatrix}^{-1} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} \frac{1}{s+n\lambda} \\ \frac{1}{s} -\frac{1}{s+n\lambda} \end{bmatrix}. \tag{11} \end{eqnarray}\] Finally we inverse Laplace transform to get \[\begin{equation} \begin{bmatrix} p_{n}(t) \\ P_{n-1}(t) \end{bmatrix}= \begin{bmatrix} e^{-n\lambda t} \\ 1 -e^{-n\lambda t} \end{bmatrix}. \tag{12} \end{equation}\] So the reliability of the system is \[\begin{eqnarray} \text{ R}(t)&=&1-p_{n-1}(t) =e^{-t/\text{ MTTDL}_0}\,\text{with}\, \text{ MTTDL}_0=\frac{1}{n\lambda}. \tag{13} \end{eqnarray}\] We went through excessive math, but this is basically the procedure for analyzing a generic Markov chain: Plot the chain with incoming and outgoing arrows, construct the transition differential equation, Laplace transform and convert to a matrix form, invert the matrix and transform back to time domain.

\(1\) redundancy

Let’s apply this to an \(n-\)drive system with one redundancy, which will have the Markov chain in Fig. (6).
Markov chain for a system with one redundancy.

Figure 6: Markov chain for a system with one redundancy.

The red arrows in Fig. (6) represent drive failures, the rate of which scales with the number of healthy drives, and the green arrow shows rebuild of drives with a rate that we will call \(\mu\). The transition equations: \[\begin{eqnarray} \dot{p}_{n}+n \lambda p_{n} -\mu p_{n-1}&=&0 \nonumber\\ \dot{p}_{n-1}+(n-1) \lambda p_{n-1} +\mu p_{n-1}-n \lambda p_{n} &=&0 \nonumber\\ \dot{p}_{n-2}-(n-1) \lambda p_{n-1} &=&0 . \tag{14} \end{eqnarray}\] After Laplace transforming and converting to a matrix equation, we get: \[\begin{eqnarray} \hspace{-0.5em} \begin{bmatrix} P_{n}(s) \\ P_{n-1}(s)\\ P_{n-2}(s) \end{bmatrix}\hspace{-0.5em}=\hspace{-0.5em} \begin{bmatrix} s +n \lambda & -\mu &0 \\ -n\lambda & s +(n-1)\lambda +\mu &0\\ 0 &-(n-1)\lambda & s \end{bmatrix}^{-1} \begin{bmatrix} 1 \\ 0\\ 0 \end{bmatrix}\hspace{-0.4em} . \tag{15} \end{eqnarray}\] We are interested in \(p_{n-2}\), which is the probability that system will have \(2\) simultaneous failures, hence data will be lost. Inverse Laplace transforming, we can calculate the reliability as \[\begin{eqnarray} \text{ R}(t)&=&1-p_{n-2}(t)\nonumber\\ &=&\frac{\mu + (2 n-1) \lambda + \sqrt{(\lambda - \mu)^2 + 4 \lambda \mu n}}{2 \sqrt{(\lambda - \mu)^2 + 4 \lambda \mu n})} e^{-\frac{1}{2}\left(\mu + \lambda (2n-1) -\sqrt{(\lambda - \mu)^2 + 4 \lambda \mu n}\right) t}\nonumber\\ &-&\frac{\mu + (2 n-1) \lambda - \sqrt{(\lambda - \mu)^2 + 4 \lambda \mu n}}{2 \sqrt{(\lambda - \mu)^2 +4\lambda \mu n})} e^{-\frac{1}{2}\left(\mu + \lambda (2n-1)+\sqrt{(\lambda - \mu)^2 + 4 \lambda \mu n}\right) t}\nonumber\\ &\sim& \left[1+n(n-1)\frac{\lambda^2}{\mu^2}\right]e^{-\frac{t}{\mu/[\lambda^2n(n-1)]}} -n(n-1)\frac{\lambda^2}{\mu^2}e^{-t[\mu+\lambda(2n-1]}, \tag{16} \end{eqnarray}\] where we expanded in powers of \(\lambda/\mu\) and kept the leading terms in the coefficients and exponents. We can actually further simplify the equation since \(\frac{\lambda^2}{\mu^2}\ll 1\), for a typical storage device, we can use \(\text{ AFR}<1\%\) and the restore time is at the order of days, which gives \(\frac{\lambda}{\mu}\sim 10^{-4}\). Therefore, we can safely drop \(\frac{\lambda^2}{\mu^2}\) terms to get \[\begin{eqnarray} \text{ R}(t)&\sim& e^{-t/\text{ MTTDL}_1}\,\text{with}\, \text{ MTTDL}_1=\frac{1}{n(n-1) \lambda}\frac{\mu}{\lambda}. \tag{17} \end{eqnarray}\]

\(2\) redundancies

It will get more complicated but, we will be getting better at extracting the relevant term. Let’s apply this to an \(n-\)drive system with two redundancies, which will have the Markov chain in Fig. (7).

Markov chain for a system with two redundancies.

Figure 7: Markov chain for a system with two redundancies.

The state equation in the \(s\)-domain is \[\begin{eqnarray} \hspace{-1em} \begin{bmatrix} P_{n}(s) \\ P_{n-1}(s)\\ P_{n-2}(s)\\ P_{n-3}(s) \end{bmatrix}&=& \mathcal{S} \begin{bmatrix} 1 \\ 0\\ 0\\ 0 \end{bmatrix}, \end{eqnarray}\] where we defined \(\mathcal{S}\) as: \[\begin{eqnarray} \hspace{-1em} \hspace{-0.5em}\begin{bmatrix} s +n \lambda & -\mu &0 &0 \\ -n\lambda & s +(n-1)\lambda +\mu &-\mu&0\\ 0 &(n-1)\lambda & s +(n-2)\lambda +\mu &0\\ 0&0&(n-2)\lambda&s \end{bmatrix}^{-1}. \tag{18} \end{eqnarray}\]

We are interested in \(P_{n-3}\). One can study the roots of the inverse \(s\) matrix to find that \(P_{n-3}\) has a pole at \(s=0\) with coefficient \(1\) and another one with coefficient \(1\) at \(s=S_1\) where \[\begin{equation} S_1\simeq \lambda n (n-1)(n-2)\left(\frac{\lambda}{\mu}\right)^2 \tag{19} \end{equation}\] Inverse Laplace transforming, we can calculate the reliability as \[\begin{eqnarray} \text{ R}(t)&=&1-p_{n-1}(t) \simeq1-(1-e^{-t/\text{ MTTDL}_2})=e^{-t/\text{ MTTDL}_2},\nonumber\\ \text{ MTTDL}_2&=&\frac{1}{n(n-1)(n-2) \lambda}\left(\frac{\mu}{\lambda}\right)^2. \tag{20} \end{eqnarray}\]

\(c\) redundancies

Recognizing the patterns in Eqs. (13), (16) and (17) we can write the reliability formula for \(c\)-redundant system \[\begin{eqnarray} \text{ R}(t)&\sim& e^{-t/\text{MTTDL}_c},\nonumber\\ \text{ MTTDL}_c &=&\left(\frac{\mu}{\lambda}\right)^c \frac{(n-c-1)!}{\lambda n!}, \tag{21} \end{eqnarray}\] Eqn. (21) is what we will use to calculate the reliability in the following sections. The most important observation here is that the system is defined by a reliability function which is that of an exponential distribution! Therefore we can use the properties of exponential distribution to calculate other metrics such as availability, which will be done in the following sections.

Parallel repairs

We assumed that repair rate is constant no matter how many failures you have. In reality restoration may be going on in parallel, so if you have two failures to restore and you work on them in parallel, the repair rate is effectively doubled, as illustrated in Fig. (8).

Markov chain for a system with two redundancies

Figure 8: Markov chain for a system with two redundancies

In this case the \(\mu\) entries in the \(s\) matrices need to be scaled appropriately. Following the same steps one gets

\[\begin{eqnarray} \text{ R}(t)&\sim& e^{-t/\text{ MTTDL}_{c(p)}},\nonumber\\ \text{ MTTDL}_{c(p)} &=&\left(\frac{\mu}{\lambda}\right)^c \frac{c! (n-c-1)!}{\lambda n!}, \tag{22} \end{eqnarray}\] which differs from Eqn. (21) only by the \(c!\) factor, and this makes a difference only when \(c>1\), as expected.

Hard Errors

Now we want to include a failure mode that we have ignored so far: unrecoverable errors (UER), i.e., hard errors. The tricky part about hard errors is that you will not know if a sector is corrupted until you attempt to read the very sector. Some systems do scrubbing of drives. When the system is idle, it reads the entire disk to locate and fix UERs. Obviously this causes performance loss and one can only afford it at a limited frequency[7]. If a sector is corrupted, then you can recover it from its parities stored. However, one can easily see this will cause immediate data loss if the system is in the critical mode. For example a system with \(c\) redundancy is in critical mode when it has \(c\) failures, that is, it is trying to recover \(c\) shards of data by reading from all of the remaining shards. If a UER is encountered at this point, there is no parity left to reconstruct the data.

Let’s start with the simplest case: a system with one redundancy, as in Fig. (9).

Markov chain for a system with one redundancy. Hard errors cause failures in the critical mode.

Figure 9: Markov chain for a system with one redundancy. Hard errors cause failures in the critical mode.

\(h\) is the probability of getting a UER during the rebuild of a failure, and it enters the link with a coefficient of \(\mu\) since \(\mu\) sets the data reading rate. i.e \(\mu h\) is the rate of observing UERs when \(n-1\) drives are read. If UER is observed during the rebuild, it will be system failure. Therefore, the recovery rate reduces to \((1-h)\mu\) whereas the transition rate from \(n-1\) to \(n-2\) state increases to \((n-1)\lambda+ h\mu\). It is important to explicitly define what \(h\) is. To reconstruct a failed drive of capacity \(C_\text{bits}\), the system will need to read all the data in all \(N-1\) healthy drives and push the data through Reed Solomon inversion calculations to reconstruct the data in the failed drive. So the total amount of data to be read is \((n-1) C_\text{bits}\). And \(h\) is the probability of getting one or more UERs for the entire data, which can be calculated as follows: \[\begin{equation} h=1-(1-\text{ UER})^{(n-1)C_\text{bits}}\simeq 1-e^{-UER (n-1)C_\text{bits}}. \tag{23} \end{equation}\]

Did you notice the difficulty in the expression \((1-\text{ UER})^{(n-1)C_\text{bits}}\) in Eq. (23)? UER is a tiny number \(\propto 10^{-15}\). \((n-1)C_\text{bits}\) is a huge number \(\propto 10^{14}\). If the expression is computed numerically, it may just exceed the precision of the computation. Let’s do ourselves a favor, and use a trick taught in the early days of first calculus classes. Consider the following limit: \[\begin{equation} \text{lim}_{N\rightarrow\infty}\left(1-\frac{x}{N}\right)^N=\text{lim}_{N\rightarrow\infty}e^{N \text{ln}\left(1-\frac{x}{N}\right)}=\text{lim}_{N\rightarrow\infty}e^{N(-\frac{x}{N}+\mathcal{O} (1/N^2))}=e^{-x}, \tag{24} \end{equation}\] which we used to in in Eq. (23). Note that now we have the multiplication of a very small number and a very large one, \(UER (n-1)C_\text{bits}\), which will result in a numerical value that will not require high precision in the calculations.

Equation (23). shows that \(h\) is a function of the amount of drives to be read, i.e. \(n-1\), and we will use \(h_{n-1}\) to emphasize this fact.

The \(S\)-matrix for the system becomes \[\begin{equation} \begin{bmatrix} s +n \lambda & -\mu (1-h_{n-1}) &0 \\ -n\lambda & s +(n-1)\lambda +\mu &0\\ 0 &-(n-1)\lambda -h_{n-1} \mu& s \end{bmatrix}, \tag{25} \end{equation}\] determinant of which will have \(3\) roots. We will be interested in smallest non zero root:

\[\begin{eqnarray} s_3&=& \frac{\mu+ (2n-1)\lambda }{2} +\frac{1}{2}\sqrt{\lambda^2 - 2 \lambda \mu + \mu^2 + 4 \lambda \mu n - 4 h_{n-1}\lambda \mu n}\nonumber\\ &\simeq&\frac{n\lambda[(n-1)\lambda+\mu h_{n-1}]}{\mu+(2n-1)\lambda} \simeq\frac{n\lambda[(n-1)\lambda+\mu h_{n-1}]}{\mu}, \tag{26} \end{eqnarray}\] where we assumed \((2n-1)\lambda\ll\mu\). \(s_3\) is what will go into the exponent once we inverse Laplace transform. The corresponding MTTF becomes \[\begin{eqnarray} \text{ MTTF}&=& \frac{1}{s_3}\simeq\frac{\mu }{n\lambda[(n-1)\lambda+\mu h_{n-1}]}. \tag{27} \end{eqnarray}\] And as expected, this reduces to Eq. (16) when \(h_{n-1}=0\). It is illustrative to break out the MTTF into two modes: the drive failure mode (DF) and the mode related to UER. This decomposition allows us to write the total MTTF as follows: \[\begin{eqnarray} \frac{1}{\text{ MTTF}}&=&\frac{1}{\text{ MTTDL}_{\text{ DF}}}+\frac{1}{1/(n\lambda)}h_{n-1},\nonumber\\ \text{ MTTDL}_{\text{ DF}}&=& \frac{\mu }{n\lambda[(n-1)\lambda]}, \tag{28} \end{eqnarray}\] and we will recognize the factor \(1/(n\lambda)\) as the MTTDL of the system with no redundancy. And in fact this is a general pattern: \[\begin{eqnarray} \frac{1}{\text{ MTTF}_c}&=&\frac{1}{\text{ MTTDL}_{c,\text{ DF}}}+\frac{h_{n-1}}{\text{ MTTDL}_{c-1,\text{ DF}}}, \tag{29} \end{eqnarray}\] where \(\text{ MTTDL}_c\) is given in Eq. (21). To get a feel for \(h_{n-1}\), let’s plug in typical numbers: \(\text{ UER}=10^{-15}\), \(n=20\) and \(C=10\)TB. \[\begin{equation} h_{n-1}=1-(1-10^{-15})^{19*8*10^{13}}=1-e^{19*8*10^{13}*10^{-15}}=0.78, \tag{30} \end{equation}\] which is again the probability of having at least one bad sectors when all \(n-1\) drives are read to reconstruct the failed one(s). This tells us that with significant probability, we are going to lose data if the system ever comes down to the critical mode.

If \(h_{n-1}\) is not small enough, \(\text{ MTTF}_c\) will be dominated by the second term in Eq. (29) \[\begin{eqnarray} \text{ MTTF}_c \simeq \frac{\text{ MTTDL}_{c-1,\text{ DF}}}{h_{n-1}}, \tag{31} \end{eqnarray}\] which highlights the fact that redundancy is effectively reduced by one unit due to UERs.

Simulation

I have written a couple of Monte Carlo codes to simulate the durability. You can find them in my repository or copy them below. If you want to play with the code without having to fork it or copy it to run on your local computer, I also pulled the Python code into the colab of Google as a Jupyter notebook. You can modify/run it on your browser here. The codes below are essentially identical other than the fact that they are in different languages. The following two codes run Monte Carlo simulation without UREs.

#-------------------------------------------------------------------------------
# Name:        durability_simulation
# Purpose:      Monte Carlo simualtion to estimate the probability of data loss in erasure coded systems
# Author:      TQ, quarktetra@gmail.com
# Created:     25/08/2021
# Requires: Python3
#-------------------------------------------------------------------------------
# run this from the cmd window as: durability_simulation.py 20 2 1 20 50 1
# in IDE, use command line parameters: 20 2 1 20 50 1
# see https://tetraquark.netlify.app/post/raid_durability/   for details
import argparse
import math
import numpy as np

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)


def sifter(systems,p_shards,repair_time,failure_rate):   # feed this with the surviving systems. It will return the instances of data losses, and the systems that might still lose data
    instances=0
    systems_to_keep=[]
    for sysid in range(0,len(systems)):
        failuretimes=systems[sysid]
        failuretimes=failuretimes[failuretimes<365.25]  # we are interested in the first year
        sys_lost_data=0
        if failuretimes.size>p_shards: # for data loss you need to have more than p_shards failures, otherwise the system will not lose any data
            failuretimes=np.sort (failuretimes)   # time order the failures for each system. We will carefully count how many will accumulate over time
            toberecovered=np.array([failuretimes[0]])    # the first failure enters into the toberecovered bucket.
            for findex in range(1,failuretimes.size):
                this_time= failuretimes[findex]
                toberecovered=toberecovered[toberecovered>this_time-repair_time]   # keep the ones that have not been repaired yet. (this_time< t_failure+repair_time )
                toberecovered=np.append(toberecovered,[this_time])  # add the current failure to the toberecovered list.

                if len(toberecovered)>p_shards: # at any give point, if the list contains more items than the parity count, data is lost
                    sys_lost_data=1
            if sys_lost_data==0 or 0:
                failuretimes[0]=failuretimes[0]+np.random.exponential(failure_rate, 1) #the first failed drive will be replaced. Create a new failure time for it. (add it on top of the current time)
                systems_to_keep.append(failuretimes)  # keep track of these systems, they can still lose data if a replacement fails soon enough
            instances=instances+  sys_lost_data
    return   systems_to_keep,instances

def simulate(t_shards, p_shards, afr, d_cap, r_speed,simulation_size_scale):
    d_shards =t_shards -p_shards
    repair_time=10**6*d_cap/r_speed/(60*60*24)
    failure_rate= -365.25/math.log(1-afr/100)     # this is in 1/days
    MTTDL_th=failure_rate*(failure_rate/repair_time)**p_shards /((d_shards)*factorial(t_shards)/(factorial(d_shards)*factorial(p_shards)));
    nines_th= -math.log10(365/MTTDL_th)
    sim_size=simulation_size_scale*max(10000,min(50*10**math.ceil(nines_th) ,20000000) )
    if p_shards>1:
        ptext=" parities"
    else:
        ptext=" parity"
    print("Please Wait: Running "+str(sim_size)+" simulations: "+str(t_shards) +" total shards with "+
        str(p_shards)+ptext+", i.e.,"+ str(d_shards) +"+"+str(p_shards)+ ",  "+str(afr)+"% AFR, "+str(d_cap)+"TB drive capacity, and "+
        str(r_speed)+"MB/s recovery speed "+ "("+str(round(repair_time,1)) +" days). Theoretical prediction is "+
        str(round(nines_th,2))+" nines. " )

    #initiate the full simulation data
    systems=[]
    for sysid in range(0,sim_size):
        systems.append(np.random.exponential(failure_rate, t_shards))

    totalinstances=0
    while len(systems)>0:
        returned=sifter(systems,p_shards,repair_time,failure_rate)
        systems=returned[0]
        totalinstances=totalinstances+returned[1]

   # print("Simulated "+str(sim_size)+ " systems with " +str(totalinstances ))
    if totalinstances>0:
        print("Simulation Results:"+str(totalinstances )+"  data loss instances: that is "+str(round(-math.log10(totalinstances/sim_size),2))+" nines, with one sigma="
        +str(round(1/(totalinstances**0.5),2)))
    else:
        print("No failures detected. Try to increase the simulation size!")

class Args:
  number_of_total_shards = 20
  number_of_parity_shards = 2
  annual_failure_rate_in_pct = 2
  drive_capacity_in_TB = 20
  recovery_speed_in_MBps=50
  simulation_size_scale=1

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('number_of_total_shards', type=int),
    parser.add_argument('number_of_parity_shards', type=int),
    parser.add_argument('annual_failure_rate_in_pct', type=float),
    parser.add_argument('drive_capacity_in_TB', type=float),
    parser.add_argument('recovery_speed_in_MBps', type=float),
    parser.add_argument('simulation_size_scale', type=int),
    args = parser.parse_args() # disable this if you do not want to use command line arguments. Enable the next line to use the inputs in Args class
    #args=Args()
    simulate(args.number_of_total_shards, args.number_of_parity_shards, args.annual_failure_rate_in_pct, args.drive_capacity_in_TB, args.recovery_speed_in_MBps,args.simulation_size_scale)


if __name__ == '__main__':
    main()
scr_path=Get Default Directory();scr_path=substr(scr_path,2,length(scr_path));scr_name = Current Window() << get window title;

//August 2021, A Monte Carlo sim to compute data durability, contact quarktetra@gmail.com for questions
//see https://tetraquark.netlify.app/post/raid_durability/   for details

start_time=today();
num_simulations=200000;
num_drives_per_system=12;
redundancy=1;
max_year=1;
AFR_val=1; //This is in %
lambda_val=-365.25/(log(1-AFR_val/100));
print("Lambda in years:"|| char(lambda_val/365.25)||",  MTBF:"||char(round(24*lambda_val/1000000,2))||" Million hours, "||char(round(lambda_val/365,1))||" years"  );

repair_time= 10; //in days

MTDL_th_val0_generic=lambda_val*(lambda_val/repair_time)^redundancy/((num_drives_per_system-redundancy)*Factorial(num_drives_per_system)/(Factorial(num_drives_per_system-redundancy)*Factorial(redundancy)));
theory_dppm=round(1000000*(1-exp(-max_year*365.25/MTDL_th_val0_generic))); //in %
//print("theory appr dppm @ redundancy="||char(redundancy)||":   "||char(round(theory_dppm)));
print("Expected number of failures with "||char(num_simulations)||" simulations:"||char(round(num_simulations*(1-exp(-max_year*365.25/MTDL_th_val0_generic))))||". Theoretical  reli non: "
||char(round(-log10(1-exp(-max_year*365.25/MTDL_th_val0_generic)),1) ));
//print("theory  reli: "||char(exp(-max_year*365.25/MTDL_th_val0_generic)));


stepsize=12; //in months
try(dtsum<<close window());try(dtt<<close window());
newname="Reli simulation "||substitute(substitute(MDYHMS(today()),"/","_"),":","_");
dtt=new table(newname,invisible(1));
dtt<< add rows(num_simulations*num_drives_per_system);
eval(parse( "dtt<<new column(\!"sys_id\!",formula( 1+ Floor( (Row() - 1) / "||char(num_drives_per_system)||" )))"  ));
eval(parse( "dtt<<new column(\!"drv_id\!",formula( Row() - Floor( (Row() - 1) / "||char(num_drives_per_system)||" ) * "||char(num_drives_per_system)||"))"  ));
 dtt<<delete column(1); wait(1);:drv_id<<delete formula();:sys_id<<delete formula();

colname="drv_failure_time"; //_lambda_val="||char(lambda_valv)||"_beta="||char(betav);
eval(parse( "dtt<<new column(\!""||colname||"\!",formula("||char(lambda_val)||"*Random Exp()))"  ));
column(colname)<<delete formula();

return_nf=function({},
        dtt<<select where(:drv_failure_time>max_year*365);
        dtt<<delete rows();  // these failed after the time frame we are looking at
        dtmin=dtt<<Summary( Group( :sys_id ));
        dtmin<<select where(:N Rows<=redundancy);   dtmin<<close window();dtt<<delete rows();
        if(nrows(dtt)>0,
            dtt << Sort( By(:sys_id,column(colname)), Order(Ascending,Ascending), Replace Table);
            dtt<<new column("num_failures",numeric);dtt<<new column("num_max_failures",numeric);

            :num_failures[1]=1;:num_max_failures[1]=1;
            maxseen=1;toberecovered={};
            for(rr=2,rr<=nrows(dtt),rr++,
                rr_time=:drv_failure_time[rr];
                rr_m1_time=:drv_failure_time[rr-1];
                if(:sys_id[rr-1]!=:sys_id[rr],:num_failures[rr]=1;  toberecovered={};maxseen=1;:num_max_failures[rr]=1);

                if(:sys_id[rr-1]==:sys_id[rr],
                    Insert Into( toberecovered,:drv_failure_time[rr-1]);
                    tosubtracted=0;
                    for(rc=nitems(toberecovered),rc>=1,rc--,
                      if(:drv_failure_time[rr]-toberecovered[rc]>repair_time, tosubtracted++;remove from(toberecovered,rc))
                      );
                        
                        :num_failures[rr]=:num_failures[rr-1]-tosubtracted+1;
                        if(:num_failures[rr]>maxseen,maxseen=:num_failures[rr]);
                        :num_max_failures[rr]   =maxseen
                );

                
            );
            
            try(dtt<<delete column("failed"));
            dtt<<new column("failed",numeric);
            Eval(Substitute(Expr(:failed<<set formula(if(num_max_failures>_red,1,0))), Expr(_red), redundancy));

            dts=dtt<<summary(group(:sys_id,:failed));
            dts<<select where(:failed==0);dts<<delete rows();
            n_failed=nrows(dts);
            
            if(n_failed==0,dts<<close window();dtt<<delete column("failed");dtt<<delete column("num_failures");dtt<<delete column("num_max_failures");,
                
                dts<<new column("failed_system",formula(1));:failed_system<<delete formula();dts<<delete column("N Rows");dts<<delete column("failed");

                dtt <<Update(With( dts),Match Columns( :sys_id = :sys_id ));dts<<close window();

                dtt<<select where(:failed_system==1);dtt<<delete rows(); //take out the failed systems
                if(nrows(dtt)>0,
                            dtt<<delete column("failed_system");
                            dtt<<delete column("failed");

                            dtmin=dtt<<Summary( Group( :sys_id ),Min( :drv_failure_time ));
                            column("Min(drv_failure_time)")<<set name("drv_failure_time_1");
                            dtmin<<select where(:N Rows==1);dtmin<<delete column("N Rows"); dtt<<delete rows();
                            dtt <<Update(With( dtmin),Match Columns( :sys_id = :sys_id )); dtmin<<close window();

                            dtt<<delete column("num_failures");dtt<<delete column("num_max_failures");

                            colnameR="drv_failure_timeR"; //_lambda_val="||char(lambda_valv)||"_beta="||char(betav);
                            eval(parse( "dtt<<new column(\!""||colnameR||"\!",formula("||char(lambda_val)||"*Random Exp()))"  ));
                            column(colnameR)<<delete formula();
                                    
                            :drv_failure_time<<set name("drv_failure_time0");

                            eval(parse("dtt<<new column(\!"drv_failure_time\!",formula(if(drv_failure_time_1==drv_failure_time0,drv_failure_timeR+"||char(repair_time)||",drv_failure_time0)))"));

                            :drv_failure_time<<delete formula();
                            dtt<<delete column("drv_failure_time_1");dtt<<delete column("drv_failure_time0");dtt<<delete column("drv_failure_timeR");
                        );  
                    );  
        );
        n_failed;
);
rounds={};cumul=0;iimax=0;
for(ii=1,ii<=2,ii++,
if(nrows(dtt)>0,iimax=ii;rounds[ii]=return_nf();
cumul=cumul+rounds[ii]));
//print("calculated "||char(iimax)||" iterations");try(dtt<<close window());

print("MC  number of sysfailures @ redundancy="||char(redundancy)||"= "||char(cumul)||" out of "||char(num_simulations) );


print("theory  reli non: "||char(round(-log10(1-exp(-max_year*365.25/MTDL_th_val0_generic)),1))|| " vs MC "||char(round(-log10(cumul/num_simulations),1)) );
Figure 10 shows the simulation results for various input parameters(see the caption) and at various times of the system.
The comparison of theory vs Monte Carlo results with various input parameters. The results show that Markov chain model is very successful. Number of nines is defined in Eq. <span class='myeqref'> \@ref(eq:nondefinition) </span> where flooring is omitted. <span class='plus'>... [+]</span> <span class='expanded-caption'>  The parameters swept are: redundancy($c$) $\in$ {0, 1}, AFR(%) $\in$ {1} ,total number of drives $\in$ {8} repair time(days) $\in$ {1.4}, and at various times of the lifetime of the systems. </span>

Figure 10: The comparison of theory vs Monte Carlo results with various input parameters. The results show that Markov chain model is very successful. Number of nines is defined in Eq. (1) where flooring is omitted. … [+] The parameters swept are: redundancy(\(c\)) \(\in\) {0, 1}, AFR(%) \(\in\) {1} ,total number of drives \(\in\) {8} repair time(days) \(\in\) {1.4}, and at various times of the lifetime of the systems.

Let’s now turn to the case of UREs. Below is a code to simulate data durability including UREs. The essential change from the earlier code is that as a system is recovering from a critical failures, the URE failures are simulated by creating a uniform random number \([0,1]\) and comparing it against \(h_{n-1}\) in Eq. (23). The flow chart of the simulation is illustrated in Fig. 11. You can copy the code below, or get it from my repository. You can even modify/run it on your browser at google colab (you are welcome.)
#-------------------------------------------------------------------------------
# Name:        durability_simulation with unrecoverable errors
# Purpose:      Monte Carlo simualtion to estimate the probability of data loss in erasure coded systems
# Author:      TQ, quarktetra@gmail.com
# Created:     25/08/2021
# Requires:    Python3
#-------------------------------------------------------------------------------
# run this from the cmd window as: durability_simulation.py 20 2 1 1 20 50 1  1
# in IDE, use command line parameters: 20 2 1 1 20 50 1 1
# the last integer enables/disables the simulation.
# see https://tetraquark.netlify.app/post/raid_durability/   for details
import argparse
import math
import numpy as np
import random

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)


def sifter(systems,p_shards,repair_time,failure_rate,hval):   # feed this with the surviving systems. It will return the instances of data losses, and the systems that might still lose data
    instances=0
    instances_wuer=0
    systems_to_keep=[]
    for sysid in range(0,len(systems)):
        failuretimes=systems[sysid]
        failuretimes=failuretimes[failuretimes<365.25]  # we are interested in the first year
        sys_lost_data=0
        sys_lost_data_wuer=0
        if failuretimes.size>p_shards-1: # for data loss you need to have more than p_shards failures, otherwise the system will not lose any data
            failuretimes=np.sort (failuretimes)   # time order the failures for each system. We will carefully count how many will accumulate over time
            toberecovered=np.array([failuretimes[0]])    # the first failure enters into the toberecovered bucket.
            for findex in range(1,failuretimes.size):
                this_time= failuretimes[findex]
                toberecovered=toberecovered[toberecovered>this_time-repair_time]   # keep the ones that have not been repaired yet. (this_time< t_failure+repair_time )
                toberecovered=np.append(toberecovered,[this_time])  # add the current failure to the toberecovered list.

                if len(toberecovered)>p_shards: # at any give point, if the list contains more items than the parity count, data is lost
                    sys_lost_data=1
                    sys_lost_data_wuer=1
                if len(toberecovered)>p_shards-1 and random.random()<=hval:   # pull a random number. If it is smaller than hval, then it is URE failures
                    sys_lost_data_wuer=1

            if sys_lost_data_wuer==0:
                failuretimes[0]=failuretimes[0]+np.random.exponential(failure_rate, 1) #the first failed drive will be replaced. Create a new failure time for it. (add it on top of the current time)
                systems_to_keep.append(failuretimes)  # keep track of these systems, they can still lose data if a replacement fails soon enough
            instances=instances+  sys_lost_data
            instances_wuer=instances_wuer+  sys_lost_data_wuer
    return   systems_to_keep,instances,instances_wuer

def MTTDL_calc(t_shards, p_shards, failure_rate,repair_time):
       d_shards=t_shards-p_shards
       cval=failure_rate*(failure_rate/repair_time)**p_shards /((d_shards)*factorial(t_shards)/(factorial(d_shards)*factorial(p_shards)));
      # print("MTTDL for "+str(t_shards) +" total shards with "+
       # str(p_shards)+" shards, i.e.,"+ str(d_shards) +"+"+str(p_shards)+ ",  "+str(failure_rate)+"% AFR,  recovery speed "+ "("+str(round(repair_time,1)) +" days). -->"
       # +str(cval)+" ~"+str(-math.log10(365/cval)) )
       return   cval


def MTTDL_calc_uer(t_shards, p_shards, failure_rate,repair_time,hval):
       MTTDL_th_cred=MTTDL_calc(t_shards, p_shards-1, failure_rate,repair_time)
       MTTDL_th_c=MTTDL_calc(t_shards, p_shards, failure_rate,repair_time)
       invMTTL= 1/MTTDL_th_c + hval/MTTDL_th_cred
       return 1/invMTTL;
def h_calc(d_shards,uer,d_cap):

       return 1-math.exp(-uer*d_shards*d_cap*8/1000)


def simulate(t_shards, p_shards, afr,uer, d_cap, r_speed,simulation_size_scale,simulation_enabled):
    d_shards =t_shards -p_shards
    repair_time=10**6*d_cap/r_speed/(60*60*24)
    failure_rate= -365.25/math.log(1-afr/100)     # this is in 1/days
    hval= h_calc(d_shards,uer,d_cap)
    MTTDL_th=MTTDL_calc(t_shards,p_shards, failure_rate,repair_time)
    MTTDL_th_uer=MTTDL_calc_uer(t_shards,p_shards, failure_rate,repair_time,hval)

    nines_th= -math.log10(365/MTTDL_th)
    nines_th_uer= -math.log10(365/MTTDL_th_uer)
    sim_size=simulation_size_scale*max(10000,min(50*10**math.ceil(nines_th) ,40000000) )
    if p_shards>1:
        ptext=" parities"
    else:
        ptext=" parity"
    if simulation_enabled:
        print("Please Wait: Running "+str(sim_size)+" simulations: "+str(t_shards) +" total shards with "+
            str(p_shards)+ptext+", i.e.,"+ str(d_shards) +"+"+str(p_shards)+ ",  "+str(afr)+"% AFR, uer="+ str(uer)+" * 10^-{15}, "+str(d_cap)+"TB drive capacity, and "+
            str(r_speed)+"MB/s recovery speed "+ "("+str(round(repair_time,1)) +" days)." )
    print(  "prob of UER="+str(round(hval,2)   )    +     ". Theoretical predictions:  NoN= "+    str(round(nines_th,2))+" nines, NoN wUER= " +    str(round(nines_th_uer,2))+" nines. " )

    #initiate the full simulation data
    if simulation_enabled:
        systems=[]
        for sysid in range(0,sim_size):
            systems.append(np.random.exponential(failure_rate, t_shards))

        totalinstances=0
        totalinstances_wuer=0
        while len(systems)>0:
            returned=sifter(systems,p_shards,repair_time,failure_rate,hval)
            systems=returned[0]
            totalinstances=totalinstances+returned[1]
            totalinstances_wuer=totalinstances_wuer+returned[2]

       # print("Simulated "+str(sim_size)+ " systems with " +str(totalinstances ))
        if totalinstances>0:
            print("Simulation Results:"+str(totalinstances )+"  data loss instances: NoN= "
            +str(round(-math.log10(totalinstances/sim_size),2))+" nines, NoN wUER= "+ str(round(-math.log10(totalinstances_wuer/sim_size),2))+ " nines" )
        else:
            print("No failures detected. Try to increase the simulation size!")

     #with one sigma="+str(round(1/(totalinstances**0.5),2))

class Args:
  number_of_total_shards = 20
  number_of_parity_shards = 2
  annual_failure_rate_in_pct = 2
  uer=1
  drive_capacity_in_TB = 20
  recovery_speed_in_MBps=50
  simulation_size_scale=1
  simulation_enabled=1 #make this 0 to operate the code in the calcualator mode

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('number_of_total_shards', type=int),
    parser.add_argument('number_of_parity_shards', type=int),
    parser.add_argument('annual_failure_rate_in_pct', type=float),
    parser.add_argument('uer', type=float),   # in the units of 10^{-15}
    parser.add_argument('drive_capacity_in_TB', type=float),
    parser.add_argument('recovery_speed_in_MBps', type=float),
    parser.add_argument('simulation_size_scale', type=int),
    parser.add_argument('simulation_enabled', type=int),
    args = parser.parse_args() # disable this if you do not want to use command line arguments. Enable the next line to use the inputs in Args class
    #args=Args()
    simulate(args.number_of_total_shards, args.number_of_parity_shards, args.annual_failure_rate_in_pct,args.uer, args.drive_capacity_in_TB, args.recovery_speed_in_MBps,args.simulation_size_scale,args.simulation_enabled)


if __name__ == '__main__':
    main()
The flow chart of the simulation.

Figure 11: The flow chart of the simulation.

Battle of models

We have 3 contenders: Wasabi, BackBlaze and Markov. Which one is the most accurate, and how do you even judge? I will use the simulation results as the golden standard. And, as data durability means the probability of not losing any data, I will use the durability with URE as the deciding metric. The simulation code above reports durability with and without UREs, so we can still compare the durability numbers without URE, however it is not the most appropriate one. One would be still upset to lose a sector of data since it might corrupt a much larger data set.

Furthermore, I will run the simulation with fewer parities and a bit higher failure rates so that simulation captures enough data loss events to estimate the probabilities. For higher redundancies, the simulation will take too long to complete (In order to make the script work at higher redundancies, I added an argument to disable the simulation and report the numbers based on Markov chain model, that is, it can be used as a calculator by skipping the simulation.)

Below is the output for a sample run: Running 40000000 simulations: 20 total shards with 2 parities, i.e.,18+2, 1.0% AFR, uer=1.0 * 10^-{15}, 20.0TB drive capacity, and 50.0MB/s recovery speed (4.6 days).

prob of UER=0.94. Theoretical predictions: NoN= 6.25 nines, NoN wUER= 3.34 nines.

Simulation Results:21 data loss instances: NoN= 6.28 nines, NoN wUER= 3.34 nines which shows a remarkable agreement between the simulation and the Markov chain model. The true durability number is the one with URE, that is \(3.3\) nines. UREs ignored, you get \(6.3\) nines.

Compare this against BackBlaze number, \(7\) nines, which you can compute yourself by going back to BackBlaze model and adjusting the inputs. This shows that Backblaze Model is over-estimating the durability without UREs at these inputs by \(0.7\) nines, which is a factor of \(5\). The bigger issue is that the real durability must include UREs and it is \(3.3\) nines.

And compare this against Wasabi number, \(12\) nines, which you can compute yourself by going back to Wasabi model and adjusting the inputs. This shows that Wasabi Model is not so good!

Conclusions

Calculating data durability is a nontrivial business and requires great care in the math. It is always a good idea to verify models with simulations, and that is typically what people fail to do. The detailed mathematical analysis and simulations show that Markov chain based model predicts the data durability very accurately, whereas some other models are way off.

It is also important to note that any model is as good as its assumptions, a.k.a. garbage in garbage out. The most important assumption we made was that failures can be described by an exponential distribution. That may not be the valid for certain cases. For example, if the drives are failing at higher rates as the population ages, a Weibull distribution might be a better fit. Markov chain analysis can be extended to address such cases.

[1]
W. C. Storage, “Wasabi tech brief 2019,” 2019 [Online]. Available: https://wasabi.com/wp-content/uploads/2019/10/Durability-Tech-Brief-2019.pdf
[2]
B. Wilson, “Cloud storage durability,” 2018 [Online]. Available: https://www.backblaze.com/blog/cloud-storage-durability/
[3]
B. Beach, “Python code to calculate the durability of data stored with erasure coding,” 2018 [Online]. Available: https://github.com/Backblaze/erasure-coding-durability/
[4]
W. A. Burkhard and J. Menon, “MDS disk array reliability,” Technical Report CS92-269, Nov. 1993 [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.2063&rep=rep1&type=pdf
[5]
A. Papoulis and U. Pillai, Probability, random variables and stochastic processes, 4th ed. McGraw-Hill, 2001.
[6]
Wikipedia, Markov chainWikipedia, the free encyclopedia.” http://en.wikipedia.org/w/index.php?title=Markov%20chain&oldid=901189015, 2019.
[7]
I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, “Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems,” vol. 36, no. 1, pp. 241–252, Jun. 2008, doi: 10.1145/1384529.1375485. [Online]. Available: https://doi.org/10.1145/1384529.1375485

Related