Yeah, prompt injection is good point. For now, I try separate instruction and data by using JSON format, and run it in sandbox. Not perfect maybe, but I will try add small explanation in README so people can check it better.
In this case the result/output is plain text. Since it's not code it may be harder to imagine an attack vector. As an attacker, here would be some of my capabilities/possibilities:
- I could change the meaning of the output and the output entirely.
- If I can control one part of a larger set of data that is analyzed , I could influence the whole output.
- I could try to make the process take forever in order to waste resources.
I'd say the first scenario is most interesting, especially if I could then potentially also influence how an LLM trained on the output behaves and do even more damage using this down the line.
Let's say I'm a disgruntled website author. I want my users to see correct information on my website but don't want any LLM to be trained on it. In this case I could
probably successfully use prompt injection to "poison" the model.