Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why does OpenAI collect and retain for 30 days^1 chats that the user wants to be deleted

It was doing this prior to being sued by the NYT and many others

OpenAI was collecting chats even when the user asked for deletion, i.e., the user did not want them saved

That's why a lawsuit could require OpenAi to issue a hold order, retain these chats for longer and produce them to another party in discovery

If OpenAI was not collecting these chats in the ordinary course of its business before being sued by the NYT and many others, then there would be no "deleted chats" for OpenAI to be compelled by court order to retain and produce to the plaintiffs

1. Or whatever period OpenAI decides on. It could change at any time for any reason. However OpenAI cannot change their retention policy to some shortened period after being sued. Google tried this a few years ago. It began destroying chats between employees after Google was on notice it was going to be sued by the US government and state AGs



I'd trust Sam Altman about as far as I could throw him and there is absolutely no way OpenAI should be having sensitive private conversations with anybody. Sooner or later all that data will end up with Microsoft who can then correlate it with a ton of data they already have from other sources (windows, office online, linkedin, various communications services including 'teams', github and so on).

This is an intelligence service's wet dream.


> […] there is absolutely no way OpenAI should be having sensitive private conversations with anybody. Sooner or later all that data will end up with Microsoft who can then […]

I don't think you even need to go as far as to Microsoft (who have earned zero points in the Privacy Protection league), just have a look at Altmans "I want to create a biometric database of every human" Orb/World-coin eye-scanning project: https://www.ft.com/content/0c5c2b8d-b185-40b6-9221-b80ee130b...


I'm not commenting on the core point of your comment, only the "why retain for 30 days" question.

Im an age of automated backups and failovers, deleting can be really hard. Part of the answer could simply be that syncing a delete across all the redundancies (while ensuring those redundancies are reliable when a disaster happens and they need to recover or maintain uptime) may take days to weeks. Also the 30 days could be the limit, as oppose to the average or median time it takes.


The most likely explanation is whatever storage solution they’re using has a built in “recycle bin” functionality and deleted data stays the for 30 days before it’s actually deleted. I see this a lot in very large databases. The recycle bin functionality is built in to the data store product.


I'm doubtful that a data store product used at their scale can't be configured to not keep data for 30 days; for large clients that could be TB of deleted data or more. This would be neither cheap or easy to manage.


oh i realize that but deviating from those defaults they have now would require so much testing and all the risk that goes along with it that they'll avoid it at all costs.


That sounds very plausible.


The problem when dealing with any company that has proven itself untrustworthy is that by default the innocent "plausible" option is probably no longer the "likely" one.

And I say this knowing that intentionally deleting data is harder than it looks.


that doesn't sound quite right to me.

Something about game theory, art of war, and the difference between stated intentions and actual intentions.

Trustworthiness comes from alignment of stated intentions, actual intentions, abilities and actions. Someon can have integrity between stated and actual intentions, but fail to follow through. In this case I think we doubt the integrity between openais stated and actual intentions.

So Sam can be saying stuff and then we find out he wasn't being honest. We can learn over time about his intentions by watching actions instead of listening to what he says. Then we can make new assumptions based on what his actual intentions seem like.

Based on what I assume Sam's intentions to be (with some healthy suspicion of the alignment between his stated intentions and actual intentions), I'm still skeptical that the reason for the 30 day thing goes far beyond quality control, the difficulty of balancing deletion and redundancy and the features of the tech stack they are using.


> I'm not commenting on the core point of your comment, only the "why retain for 30 days" question. Im an age of automated backups and failovers, deleting can be really hard.

I doubt it's that. Deletion is hard, but it's not "exactly 30 days" hard.

The most likely explanation is that OpenAI wants the ability to investigate abuse and / or publicly-made claims ("ChatGPT told my underage kid to <x>!" / "ChatGPT praised Hitler!"). If they delete chats right away, they're flying blind and you can claim anything you want.

Now, whether you should have a "delete" button that doesn't really delete stuff is another question.


What is the standard way of being forced to restore from backup while ensuring deleted data does not also become restored? Is every delete request stored so that it can be replayed against any restore?


I have only had to manage this in a startup context with relatively low stakes and it was hard and messy. I don't know what best practice is at the scale that openai operates, but from my limited experience I have an intuition that the challenge is not trivial.

Also I suspect there is a big gap between best practice and common practice. My guess is common practice is dysfunctional. I would also suspect there is no standard way, but there are established practices within different technology stacks that vary between performative, barely compliant and effective at scale.

In one case I saw there was a substantial manual effort to load snapshots into instances run the delete and then save new snapshots. This was over 10 years ago though and it was more of a "we just need to get this done" than a "what's the most elegant way to do this at scale"


> Why does OpenAI collect and retain for 30 days^1 chats that the user wants to be deleted

When working on an e-commerce gig we would get "delete my data" requests from customers, which we're legally obliged to comply with. A script would delete everything we could from the DB immediately. Since we had 30 day backups, their data would only be deleted from the backups on day 31. I think this was acceptable to the GDPR consultant.

Going in to the backups to delete their data there in insane.


> Going in to the backups to delete their data there in insane.

If I was legally obliged to delete data then I'd make sure I deleted, regardless of the purpose or location of the storage. If you can't handle a delete request you shouldn't collect the data in the first place.


People expect to see their past orders, save their address, keep a shopping cart, a list of favorites etc.

If you don't want your data online then don't put it there.


What you want to do is encrypt/anonymize per user information using a translation layer that also gets backed up. In case of a gdpr request, you delete this mapping / key and voila: data cleanup. The backup data becomes unusable.

But this obviously means building an extensive system to ensure the encoded identifier is the only thing used across your system (or a giant key management system).

In the past I’ve been a part of systems at exabyte scale that had to implement this. Hard but not impossible. I can see how orgs try to ‘legalese’ their way out of doing this though because the only forcing function is judicial.


Maybe an append only data store where actual hard deletes only happen as an async batch job? Still 30 days seems really long for this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: