A few years ago I designed a way to detect bit-flips in Firefox crash reports and last year we deployed an actual memory tester that runs on user machines after the browser crashes. Today I was looking at the data that comes out of these tests and now I'm 100% positive that the heuristic is sound and a lot of the crashes we see are from users with bad memory or similarly flaky hardware. Here's a few numbers to give you an idea of how large the problem is. 🧵 1/5
There’s a jump instruction by an address read from RAM, a bit flip occurred so a condition “if friend greet else kill” worked as “if friend rape else kill”. Absolutely anything can happen, that wasn’t determined by program design flaws and errors. A digital computer is a deterministic system (sometimes there are intentional non-deterministic elements like analog-based RNGs), this is non-deterministic random changes of the state.
In concrete terms - things break without reason. A perfect program with no bugs, if such exists, will do random wrong things if bit flips occur. Clear enough?
Exactly, one of the ‘nerd edge cases’ (as the now removed comment mentioned) is that I use ZFS on my NAS.
There’s lots of checksumming and encryption. Errors in that process are not acceptable and could potentially cause data loss. Since the one of the points of using ZFS is the enhanced data integrity, not using ECC means losing out on that guarantee.
Nobody fucking cares my man. Not important. Nobody in the regular world has ever been effected by not having ECC. You’re inventing edge cases that most cares about. Linus suffers from not understanding normal people.
You can’s speak about not having frequent corruption of files when you are not using tools detecting it. I can guarantee you have plenty of already corrupt stuff on your hard drives. RAM bit flips do contribute to that.
You have bugs (leading to broken documents, something failing, freezes, crashes) in applications you use and part of them is not due to developer’s error, but due to uncorrected memory errors.
If you’d try using a filesystem like ZFS with checksumming and regular rescans, you’d see detected errors very often. Probably not corrected, because you’d not use mirroring to save space, dummy.
And if you were using ECC, you’d see messages about corrected memory errors in dmesg often enough.
At what% does this effect the average consumer. And additionally in a critical easy. Can you cite, literally one case, where the presence of ECC would have been critical beyond an occasional annoyance. 1.
The exact numbers for when it messes something up, but keeps running, are unknown and highly ubpredictable.
According to above post, about 10% of firefox crashes (more numbers found in the post) are caused by this stuff. It’s not unreasonable to say those crashes could’ve had the bitflip happen on content instead, changing maybe a character on the page or something.
Note that it’s not 10% of users, as that’s reslly hard to figure out. Someone with bad RAM will likely crash more often.
Bit rot is real, I’ve seen it first hand in plenty of cases. While I tend to blame the storage device, for infrequently accessed files that have been copied multiple times across different drives, I can’t rule out RAM or some other source of the corruption.
Improved overall system stability and data accuracy? With error correction, you can also push performance farther, since you can tolerate a certain amount of errors, instead of needing to aim for 0% error rate.
Probably not the use case you’d want to buy ECC for. I considered it for my homebuild because I figured I might process a lot of data at once, and I would appreciate the piece of mind… but I still decided no because I could get more ram for the same price if it were not ECC.
Let’s spend a ton of extra money minimizing edge case crashing in a browser!!!
🙄
I always love it when folks who don’t actually know what they’re talking about, comment like they do…
It’s not just the browser. This example is the browser, but it’s your entire system stability that is affected by random bit flips.
Removed by mod
There’s a jump instruction by an address read from RAM, a bit flip occurred so a condition “if friend greet else kill” worked as “if friend rape else kill”. Absolutely anything can happen, that wasn’t determined by program design flaws and errors. A digital computer is a deterministic system (sometimes there are intentional non-deterministic elements like analog-based RNGs), this is non-deterministic random changes of the state.
In concrete terms - things break without reason. A perfect program with no bugs, if such exists, will do random wrong things if bit flips occur. Clear enough?
In practice preftct programs do exist, they just have to be small enough to do formal verification
Blah blah blah, made up use cases.
I don’t want to use the M-word or the T-word, but those “made up use cases” constitute every computer program in existence.
Sorry let me correct. Use cases that normal people give two fucks about and based on reality.
Each and every one of them, moron. Everything you do on a computer every moment.
Guy, you can’t even separate your metal tracks on what people care about vs what is possible.
Point to me where you live in the spectrum.
I don’t know about you, but I use my RAM for a lot more than a browser.
Removed by mod
Simple stuff like a calculator can be just as broken by a bitflip as more complex things. You wouldn’t want your calculator to say 1 + 1 = 2049.
If you want to rely on your computer, ECC RAM is required.
Exactly, one of the ‘nerd edge cases’ (as the now removed comment mentioned) is that I use ZFS on my NAS.
There’s lots of checksumming and encryption. Errors in that process are not acceptable and could potentially cause data loss. Since the one of the points of using ZFS is the enhanced data integrity, not using ECC means losing out on that guarantee.
Nobody fucking cares my man. Not important. Nobody in the regular world has ever been effected by not having ECC. You’re inventing edge cases that most cares about. Linus suffers from not understanding normal people.
Based on the article, it looks like at least 10% of crashes are caused by not having ECC.
Well, you are demonstrating that you’re an expert people person so I’ll just have to take your word.
👌
You can’s speak about not having frequent corruption of files when you are not using tools detecting it. I can guarantee you have plenty of already corrupt stuff on your hard drives. RAM bit flips do contribute to that.
You have bugs (leading to broken documents, something failing, freezes, crashes) in applications you use and part of them is not due to developer’s error, but due to uncorrected memory errors.
If you’d try using a filesystem like ZFS with checksumming and regular rescans, you’d see detected errors very often. Probably not corrected, because you’d not use mirroring to save space, dummy.
And if you were using ECC, you’d see messages about corrected memory errors in dmesg often enough.
At what% does this effect the average consumer. And additionally in a critical easy. Can you cite, literally one case, where the presence of ECC would have been critical beyond an occasional annoyance. 1.
The exact numbers for when it messes something up, but keeps running, are unknown and highly ubpredictable.
According to above post, about 10% of firefox crashes (more numbers found in the post) are caused by this stuff. It’s not unreasonable to say those crashes could’ve had the bitflip happen on content instead, changing maybe a character on the page or something.
Note that it’s not 10% of users, as that’s reslly hard to figure out. Someone with bad RAM will likely crash more often.
So no. You’re optimizing around an edge case and something users don’t give a fuck about. Got it. 👍
Bit rot is real, I’ve seen it first hand in plenty of cases. While I tend to blame the storage device, for infrequently accessed files that have been copied multiple times across different drives, I can’t rule out RAM or some other source of the corruption.
Improved overall system stability and data accuracy? With error correction, you can also push performance farther, since you can tolerate a certain amount of errors, instead of needing to aim for 0% error rate.
DDR5 pretty much has ECC built in.
Linus would disagree with you there. It’s got a form of ECC, but it isn’t the same as server RAM ECC.
When ECC no longer costs a mortgage, I will look into upgrading.
I have lga 1356 xeon 2470v2 with 64gb ddr3 ecc ram, cheap and good setup
You enthusiasts server people, the dozens of you, are not the average consumer.
Who is talking about average consumers? We’re not trying to market something here.
Linus was. The answer is Linus. 🤦♂️ Jesus Christ guys. Two of you.
Yes and? Us here are talking over a federated social platform. None of us are the average consumer.
Removed by mod
Removed by mod
Removed by mod
Removed by mod
Yeah I can’t remember the last time my browser crashed. No way I’m upgrading all that hardware to avoid something that happens that seldom.
Probably not the use case you’d want to buy ECC for. I considered it for my homebuild because I figured I might process a lot of data at once, and I would appreciate the piece of mind… but I still decided no because I could get more ram for the same price if it were not ECC.