Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Someone probably already suggested this, but I haven’t seen it yet, so I’ll throw a wild speculation into the mix:

I saw a comment (that I can’t find now) wondering if Sam might have been fired for copyright reasons. Pretty much all the big corpuses that are used in LLM training contain copyrighted material, but that’s not a surprise and I really don’t think they’d kick him out over that. But what if he had a team of people deliberately adding a ton of copyrighted material - books, movies, etc - to the training data for ChatGPT? It feels like it might fit the shape of the situation.



GPT-3 had "books1" and "books2" among its training material and "books2" never had its actual source disclosed :https://arxiv.org/pdf/2005.14165.pdf

Speculations about these source materials can be traced back as far as 2020: https://twitter.com/theshawwn/status/1320282152689336320

I don't think this issue would've flown under the radar for so long, especially with the implication that Ilya sided with the rest of the board to vote against Sam and Greg.


books2 = libgen imo


That matches with their extreme hurry to get rid of Sam, but it seems like this would be something the CTO would have had knowledge of and she seems to be trusted.


Also, it isn't uniquely attributable to Sam. They all do it, use copyrighted material, for training data. By "all", I mean all LLMs (to my knowledge). They don't do it intentionally, but it gets scooped up with everything else.

Hmmm, just thinking... Adam d'Angelo is one of the board members of OpenAI. He has the entire corpus of Quora content to use as training data, i.e. the rights to it are his. But I doubt that only Quora content was used by OpenAI during the past 8 years or so since it was founded! And the content on Quora isn't that great anyway...


Honestly, it feels like OpenAI is not taking the copyright trolls* very seriously for this to be the case. I don't think US has the luxury to set this dangerous AI precedent.

* You can disagree but no copyright lawsuit by mega corporations is doing it for the good of the law framework. They just want money.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: