Questions for y'all. (Rather, I'm soliciting broad technical and business advice.)
I have built a very fast and efficient CPU-only neural TTS engine in Rust/Torch JIT that is the synthesis of three different models. I've got a bunch of celebrity and cartoon voices I've trained. The selling point is that this runs on cheap, commodity hardware and doesn't require GPUs. I can easily horizontally scale it as a service.
I've currently got it running in a Kubernetes autoscaling group on DigitalOcean, but I'm worried about the bandwidth costs of serving up potentially thousands of hours of generated audio. I haven't thrown any real traffic at it beyond load testing, but I think it can survive heavy traffic. The thing that worries me is the bandwidth bill.
Does anyone have experience with other hosts that are cheap for bandwidth intensive apps? Are there hosts that provide egress bandwidth on the cheap for dynamically generated (non-CDN) content?
Subsequent to this, I would really like to sell or monetize this app so I can fund the R&D / CapEx intensive startup I really want to undertake.
Who might be the market to buy a TTS system like this?
I was thinking Cartoon Network might want "Rick and Morty" TTS, but despite my engineering to scale this and make it sound really good, I doubt they'd pay me much for the product. I suppose $2M would give me runway to hire a few engineers and buy a lot of the equipment I need, but I have no idea who would pay for this.
Glass for optics is surprisingly expensive, and beyond that I have other extremely high R&D costs.
Alternatively, I also have a "real time" (~800ms delay) neural voice conversion system. I thought about running a Kickstarter campaign and selling it to gamers / the discord demographic. It's relatively high fidelity with no spectral distortion, and I have a bunch of hypothetical mechanisms to make it an even better fit.
I've also thought about slapping a cute animation system on top of my TTS service let people animate characters interacting. (Value add?) An earlier non-neural TTS system I built before the last presidential election cycle had something like this, but more primitive: http://trumped.com (The audio quality of this concatenative system is absolute garbage. The new thing I've built is unrelated.)
I have built a very fast and efficient CPU-only neural TTS engine in Rust/Torch JIT that is the synthesis of three different models. I've got a bunch of celebrity and cartoon voices I've trained. The selling point is that this runs on cheap, commodity hardware and doesn't require GPUs. I can easily horizontally scale it as a service.
I've currently got it running in a Kubernetes autoscaling group on DigitalOcean, but I'm worried about the bandwidth costs of serving up potentially thousands of hours of generated audio. I haven't thrown any real traffic at it beyond load testing, but I think it can survive heavy traffic. The thing that worries me is the bandwidth bill.
Does anyone have experience with other hosts that are cheap for bandwidth intensive apps? Are there hosts that provide egress bandwidth on the cheap for dynamically generated (non-CDN) content?
Subsequent to this, I would really like to sell or monetize this app so I can fund the R&D / CapEx intensive startup I really want to undertake.
Who might be the market to buy a TTS system like this?
I was thinking Cartoon Network might want "Rick and Morty" TTS, but despite my engineering to scale this and make it sound really good, I doubt they'd pay me much for the product. I suppose $2M would give me runway to hire a few engineers and buy a lot of the equipment I need, but I have no idea who would pay for this.
Glass for optics is surprisingly expensive, and beyond that I have other extremely high R&D costs.
Alternatively, I also have a "real time" (~800ms delay) neural voice conversion system. I thought about running a Kickstarter campaign and selling it to gamers / the discord demographic. It's relatively high fidelity with no spectral distortion, and I have a bunch of hypothetical mechanisms to make it an even better fit.
I've also thought about slapping a cute animation system on top of my TTS service let people animate characters interacting. (Value add?) An earlier non-neural TTS system I built before the last presidential election cycle had something like this, but more primitive: http://trumped.com (The audio quality of this concatenative system is absolute garbage. The new thing I've built is unrelated.)