沒有人喜歡自己的電腦算出來的資料是 "有可能錯誤" 的.
但這是文字上的詭辯. 實證上, 目前絕大多數的實際生活應用, 包含商業往來
都不再需要用有 ECC 的 RAM 系統來作.
這也和歷史上 RAM 的可靠性有關係.
再者, RAM 也不是唯一有可能出現資料錯誤的地方,
像硬碟, 其實也有 Error probability.
但實際上在於會在你的系統上出問題的機率有多少? 預算怎麼花比較值得?
這個求預算使用的最佳解是科學問題,
不是 "我不喜歡我的電腦有可能會出錯, 就算機率非常非常微小" 的認知問題.
選哪邊不一定對, 就看你的需求.
另外於偵測與修正錯誤的方法也不限於使用 RAM 上面的ECC.
上面有人指出過要高正確性資料 (可重複驗證性) 時,
頂多再於同機器或不同機器上再跑一遍相同的東西.
i.e: 如果你就算不安心, 一定要有 ECC, 但如果花的預算變成兩倍以上.
那麼大部份的情況下, 就乾脆弄成兩台來跑, 或同一台上必要時跑第二次.
--
另外作研究會出問題的, 通常都不在你的電腦運算的機器可靠性, 而是在於你的方法和系統設計.
DRAM Errors in the Wild: A Large-Scale Field Study (pdf)
"The consequence of a memory error is system dependent.
In systems using memory without support for error correction and detection, a memory error can lead to a machine crash or applications using corrupted data. Most memory systems in server machines employ error correcting codes (ECC) [5], which allow the detection and correction of one or multiple bit errors. If an error is uncorrectable, i.e. the number of affected bits exceed the limit of what the ECC can correct, typically a machine shutdown is forced."
BTW: 這篇的年代有點早, 調查期間 2006~2008, DDR1 ~ DDR2 時期的.
2. BACKGROUND AND METHODOLOGY
(略, 很重要, 但太長了自己看. 包含母體與使用定義)
note: Google is known to be using alternative (cheaper) methods to build their servers
-
"Conclusion 1: We found the incidence of memory errors vand the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. Our per-DIMM rates of correctable errors translate to an aver-age of 25,000–75,000 FIT (failures in time per billion hours of operation) per Mbit and a median FIT range of 778 –25,000 per Mbit (median for DIMMs with errors), while pre-vious studies report 200-5,000 FIT per Mbit. The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others..."
"Conclusion 2: Memory errors are strongly correlated.
We observe strong correlations among correctable errors within the same DIMM."
"Conclusion 3: The incidence of CEs increases with age, while the incidence of UEs decreases with age (due to re-placements)"
"Conclusion 4: ...the DIMMs used in the three most recent platforms exhibit lower CE rates, than the two older platforms, despite generally higher DIMM ca-pacities. This indicates that improvements in technology are able to keep up with adversarial trends in DIMM scaling."
etc.

既然是教授和研究生, 類似的研究或論文自己多找一下.
然後判斷一下總合起來這些人講的資料的可信度, 以及可能偏向是高估或是低估.
不要弄成尋求解答的方式和其他人一樣, 就是一步 "去問人"
內文搜尋

X