Even though it's likely earlier, the first example I know where someone used automatic differentiation in a gradient method is Arthur Bryson. He used the adjoint method, where you compute backprop with Lagrange mulitpliers. It's equivalent to backprop. Bryson called it the "Steepest-Ascent Method in Calculus of Variations." The earliest reference I found with him using this is the 1962 paper
Bryson, A. E., and W. F. Denham. “A Steepest-Ascent Method for Solving Optimum Programming Problems.” Journal of Applied Mechanics 29, no. 2 (June 1, 1962): 247–57.
Why no one used this for pattern recognition is not clear to me. But one potential explanation is that in the 1960s, people were finding other algorithms paths towards nonlinear perceptrons, like in the potential functions work of Aizerman.
Thanks for that reference! I didn’t know about this one at all, and I’ll have a look.
More on the topic of pattern recognition, there’s also an Amari paper (of course there is) from 1967 in which he shortly discusses the idea of using gradient learning for non-linear pattern classifiers (https://ieeexplore.ieee.org/document/4039068). However, there aren’t any examples of whether and how this works, and there’s no details on how one might compute the gradients. Overall I definitely share feeling that “why no one used this for pattern recognition is not clear”.
I haven't really read it, just had a brief look (pretty sure it was after I heard you talking about it in the "historical thoughts on modern prediction" lecture). I should probably read some of it.
I can't stop being amazed by how advanced the ideas were so "early on". I guess this is the main inspiration for this whole attempt of a blog :)
I've learned about Highleyman's story from the previous incarnation of argmin blog, its fascinating. Did you ever get a copy of that data eventually?
Even though it's likely earlier, the first example I know where someone used automatic differentiation in a gradient method is Arthur Bryson. He used the adjoint method, where you compute backprop with Lagrange mulitpliers. It's equivalent to backprop. Bryson called it the "Steepest-Ascent Method in Calculus of Variations." The earliest reference I found with him using this is the 1962 paper
Bryson, A. E., and W. F. Denham. “A Steepest-Ascent Method for Solving Optimum Programming Problems.” Journal of Applied Mechanics 29, no. 2 (June 1, 1962): 247–57.
https://asmedigitalcollection.asme.org/appliedmechanics/article-abstract/29/2/247/386190/A-Steepest-Ascent-Method-for-Solving-Optimum?redirectedFrom=fulltext
Why no one used this for pattern recognition is not clear to me. But one potential explanation is that in the 1960s, people were finding other algorithms paths towards nonlinear perceptrons, like in the potential functions work of Aizerman.
(yay, a first substack comment!)
Thanks for that reference! I didn’t know about this one at all, and I’ll have a look.
More on the topic of pattern recognition, there’s also an Amari paper (of course there is) from 1967 in which he shortly discusses the idea of using gradient learning for non-linear pattern classifiers (https://ieeexplore.ieee.org/document/4039068). However, there aren’t any examples of whether and how this works, and there’s no details on how one might compute the gradients. Overall I definitely share feeling that “why no one used this for pattern recognition is not clear”.
Have you read the first edition of "Pattern Recognition" by Duda and Hart? It gives a good sense of what practice was like by 1969.
This paper by them is also amazing, just to see how sophisticated techniques were in the 60s. Only data and compute was lacking.
https://ieeexplore.ieee.org/document/1687355
I haven't really read it, just had a brief look (pretty sure it was after I heard you talking about it in the "historical thoughts on modern prediction" lecture). I should probably read some of it.
I can't stop being amazed by how advanced the ideas were so "early on". I guess this is the main inspiration for this whole attempt of a blog :)
I've learned about Highleyman's story from the previous incarnation of argmin blog, its fascinating. Did you ever get a copy of that data eventually?