committed files with carefully crafted bad data is extremely common for testing how your code handles invalid data, especially with regression tests. and lzma absolutely needs to test itself against bad, possibly-malicious data.
Yes, but the carefully crafted bad data should be explainable.
Instead of committing blobs, why not commit documented code which generates those blobs? For example, have a script compress a bunch of bytes of well-known data, then have it manually corrupt the bytes belonging to file size in the archive header.
Maybe, but it is generally much more work to do than including a minimized test case which may already exist. But I would argue that binary blobs for testing are not the problem but the code that allows things from a binary blob to be executed during building and/or later at run-time.
I agree, but perhaps OP is suggesting that the hand-crafted data can be generated in a more transparent way. For example, via a script/tool that itself can be reviewed.
should not have been in any commit, which is basically necessary to prevent this case, almost definitely not. it's normal, and requiring all data to be generated just means extremely complicated generators for precise trigger conditions... where you can still hide malicious data. you just have to obfuscate it further. which does raise the difficulty, which is a good thing, but does not make it impossible.
I completely agree that it's a good/best practice, but hard-requiring everywhere it has significant costs for all the (overwhelmingly more common) legitimate cases.
It would be reasonable for error case data though to be thoroughly explained, and it must be explainable since otherwise what are you testing and why does the test exist?
The xz exploit depended on the absence of that explanation but accepting that it was necessary for unstated reasons.
Whereas it's entirely reasonable to have a test that says something like: "simulate an error where the header is corrupted with early nulls for the decoding logic" or something - i.e. an explanation, and then a generator which flips the targeted bits to their values.
Sure: you _could_ try inserting an exploit, but now changes to the code have to also surface plausible data changes inline with the thing they claim is being tested.
I wouldn't even regard that as a lot of work: why would a test like that exist, if not because someone has an explanation for the thing they want to test?