The Future of Deep Learning: From Texts to Images and Hierarchies 

Yann LeCun, Chief AI Scientist at Meta AI Research and Silver Professor at the Courant Institute for Mathematical Sciences at New York University, one of the foremost minds and pioneers in the field of Deep Learning Deep Learning, gave a fascinating insight into the future of this emerging technology last Friday at the Bavarian Academy of Sciences in Munich.

His inspiring talk summarized the evolution from Deep Learning to today's large linguistic models (LLMs), such as ChatGPT, while highlighting the challenges and potentials that await us in the coming years.

LeCun reduces the problem of today's large language models like ChatGPT to two main challenges. First, there is not enough text data to train the ever-growing LLMs, and text data conveys only limited "world knowledge" in the sense of physics. For another, LLMs are based on recursive prediction of the next word, without a planning component. LeCun therefore sees today's LLMs as an impressive breakthrough, but one that is structurally limited.

Extension of LLMs to image data

LeCun proposes to train large meshes similar to LLMs with image data by removing image regions that the mesh should then add back.

On the one hand, this approach solves the problem of limited text, since there is much more image data than text. On the other hand, images can easily be generated by observing the real world, while texts can only be written by humans. Similar to LLMs, Yann LeCun expects the emergence of a "foundation model", i.e., a model that builds emergent world knowledge about images and can be easily "refined" for specific applications.

Among others, Meta AI has trained and successfully "refined" such a model "DINOv2" for determining tree heights for environmental monitoring from satellite imagery with comparatively little data, demonstrating the idea of a foundation model for image data.

Hierarchical networks for planning thinking

LeCun proposes a new hierarchical architecture (H-JEPA) that allows starting with rough planning that is then progressively refined. This is intended to overcome the limitation of LLMs to think only from word to word. This idea is still in its early stages, but promises to be an exciting new direction for AI development.

Our conclusion

The history of neural networks is a history of structural innovations, from CNN and RNN to LSTM and Transformer. LeCun's proposals could be the next evolutionary step - but there are other promising developments as well. While we eagerly await to see how his ideas evolve, one thing is certain: the integration of LLMs and image data is an important milestone.
Image-based "foundation networks" could revolutionize the automatic understanding and processing of image data in business processes. However, the hardware requirements for "foundation" models are generally significantly higher than for specialized models.

With our new nVidia H100 GPUs, at CIB we are ideally positioned to track and evaluate such issues in our research department and make the most of them in our products.

Dr. Tobias Abthoff

Your CIB AI expert