It is generally frowned upon to have LLMs precisely regurgitate part of their training set, but it is an interesting question how you could use LLM training to nearly losslesly compress a huge corpus like the entirety of the Internet Archive.
The Hutter Prize is for perfect compression, but only one GB. There would be different trades at the PB level, and it gets much more interesting when it doesn’t have to be bit-accurate.
显示更多