> Could this be solved by software instead of expensive hardware? Yes. Imagine a...

LorenPechtel · on April 3, 2024

But do you need AI? How about this for a screw-in-hole machine:

Assuming the hole is in the Z plane: Camera in the X plane, observes the screw against a high contrast background. Camera in the Y plane, observes the screw against a high contrast background. The motors need not know their exact position, just be of controllable speed. As the screw gets close to alignment the speed on the motor is stepped down, it stops when it's aligned. When both cameras report that it's in position a motor in the Z plane pushes the screw towards the hole, stopping when a plunger next to the screw reports the correct depth.

If you have to be concerned with the Z axis alignment you make the X and Y backgrounds striped, the alignment of the screw is measured compared to those stripes and it's rotated accordingly.

This is how a human would handle it--we do not have anything like the motor precision to get the screw in the hole directly, but we can use our eyes to refine it without *needing* the motor precision. Reliably identifying the screw from the background is hard but this approach doesn't require *identifying* anything. You're just mapping the bounding box of the object of a very different color.

If you have a large movement field and a high precision requirement you might need two cameras, the second with a much narrower field of view.

ActorNightly · on April 3, 2024

Id consider it more a fundamental problem - lack of a way to introduce new data to a model without repeating training runs and waiting for the error to converge. Humans seem to have an understanding of things from a one-shot learning run (this is perhaps due to our vast experience with the world and ability to run simulations in our head, but a subset of that should be possible for ML quite easily)

If you solve this problem, teaching a robot arm to be accurate should be pretty easy. You would just have stereoscopic cameras that map to a 3d world, and "program" in a trajectory of the object, and the model should use that trajectory to figure out where to move and how to compensate based on visual feedback.

tomp · on April 3, 2024

No, I'm telling you, you're assuming way too much. The problems are lower-level

> stereoscopic cameras that map to a 3d world

The current state of the art for this is completely atrocious.

Take a look at this very recent research: https://makezur.github.io/SuperPrimitive/

The idea that robots can "understand" the 3D world from vision is, right now, completely illusory.

ActorNightly · on April 3, 2024

I basically agree, I dont think you understood my comment.

If you look at transformers in llm, you have a input matrix, some math in the middle (all linear), and an output matrix. If you take a single value of the output matrix, and write the algebraic expression for it, you will get something that looks like a linear layer transformation on the input.

So a transformer is simply a more efficient simplification of n connected layers, and thus is faster to train. But its not applicable to all things.

For the following examples, lets say you hypothetically had cheap power with good infrastructure to deliver it, and A100s that cost a dollar each, and same budget as OpenAI.

First, you could train GPT models as just a shitload of fully connected, massively wide deep layers.

Secondly, you could also do 3d mapping quite easily with fully connected deep layers.

First you would train a Observer model to take an image from 2 cameras and reconstruct a virtual 3d scene with an autoencoder/decoder. Probably through generating photorealistic images with raytracing.

Then you would train a Predictor model to predict the physics in that 3d scene given a set of historical frames. Since compute is so cheap, you just have rng initialization of initial conditions with velocities and accelerations, and just run training until the huge model converges.

Then you would train a Controller model to move a robotic arm, with input being the start and final orientation, and output being the motion.

Then hook them all together. For every cycle in the robot controller, Controller sends commands to move along a path, robot moves, Observer computes the 3d scene, history of scenes is fed to Predictor that generates future position, which gets some error, and controller adjusts accordingly.

My point is, until we reach that point with power and hardware, there have to be these simplification discoveries like the transformer made along the way. One of which is how to one shot or few shot adjust parameters for a set of new data. If we can do that, we can basically fine tune shitty models on specific data quite fast to make it behave well in a very limited data set.