Why Fine Tuning is Dead w/Emmanuel Ameisen

34,808
0
Publicado 2024-07-01
Arguments for why fine-tuning has become less useful over time, as well as some opinions as to where the field is going with Emmanuel Ameisen.

Slides, notes, and additional resources are here: parlance-labs.com/education/fine_tuning/emmanuel.h…

00:00: Introduction and Background
01:23: Disclaimers and Opinions
01:53: Main Themes: Trends, Performance, and Difficulty
02:53: Trends in Machine Learning
03:16: Evolution of Machine Learning Practices
06:03: The Rise of Large Language Models (LLMs)
08:18: Embedding Models and Fine-Tuning
11:17: Benchmarking Prompts vs. Fine-Tuning
12:23: Fine-Tuning vs. RAG: A Comparative Analysis
25:03: Adding Knowledge to Models
33:14: Moving Targets: The Challenge of Fine-Tuning
38:10: Essential ML Practices: Data and Engineering
44:43: Trends in Model Prices and Context Sizes
47:22: Future Prospects of Fine-Tuning

Todos los comentarios (21)
  • @elvissaravia
    Very interesting and thought-provoking talk. I understand Emmanuel's take on why fine-tuning might be dead. However, in my opinion, maybe we still need it but not as much as we used to as the LLMs get more powerful at analyzing, extracting, summarizing and all the other capabilities that a wide range of tasks rely on. I prefer to think more deeply about the relationship between fine-tuning, RAG, prompt engineering and how to leverage all to build highly-performant and reliable systems. Great talk! Keep it up!
  • @davidwright6839
    The conceptual analogy that I like to use comes from cartography. The LLM is a map of regions called "concepts" that are projected into the multidimensional tensor space of tokens. Fine-tuning is a conformal map projection of this tensor space to create a "view" appropriate to a user's domain. Prompts are tokens that adjust the zoom level of the conformal map to view greater detail and narrow the possible output responses from the tensor space. RAG is like "street-view" images or satellite data that adjust the temporal window of the map beyond its training cutoff date. Prompts can be optimized for either the LLM or the fine-tuned map. If the prompt tokens are optimized for the LLM, fine-tuning is superfluous. If the prompt tokens are domain-specific for a conformal "view," the fine-tuned map should perform somewhat better.
  • Around 8:30 to 10:10 - The RAG picture absolutely turns the problem into a search problem that is at least as important as the prompting problem. This is a far less trivial problem than most people realize. Using RAG requires deep thinking about the retrieval part, and this is notoriously difficult using embeddings only, at least if you want to optimize your token consumption, and overall inference time of your prompt chain. You'd greatly boost your RAG-based workflow by not only using embeddings but considering sticking a real search index behind it that is configured for the retrieval that you care about. That's a kind of LLM workflow optimization that I feel is not being talked about.
  • @poketopa1234
    Man’s argument literally falls apart 5 minutes in. That being said, it’s still an interesting discussion and I appreciate the effort put into it.
  • @thegrumpydeveloper
    I like the questions but really wished they were asked at the end of the presentation rather than break the flow and keep on having the answer being a few slides or talking points down the way.
  • Around 24:00 Another insight here wrt the question of the life science guy, is that when we say "RAG", we tend to assume "out of the box", embeddings match RAG. But RAG is in many special cases best implemented with dedicated software parts that take the LLM query output and use other domain-specific NLP and business rules software to actually do the retrieval of what you care about. In other words, build LLM workflows that are not only using LLMs. Get the LLM to do a task, then use that output to get your advanced semantic retrieval stuff that you know works and embeds a lot of your subject matter expertise to take care of the next step of your workflow, then use that output, which will typically be much more precise than a vanilla embedding, to build your next LLM prompt. I would have given the advice to the life sciences guy that he's very likely not going to get much benefit from fine tuning. You can't train a knowledge representation into an LLM using fine tuning. Fine tuning helps with task-specific input and output simplification and formatting, pruning, compliance, that sort of thing, not with the actual "logical" inference that the model does.
  • @tufcat722
    I think what is misleading here is conflating machine learning with LLMs. The scope of LLMs is not the same as machine learning overall. Fine tuning of foundation models is not dead. Furthermore, aren’t the big LLM companies like Anthropic already doing extensive fine tuning on their own base models before releasing to the public? How does that fit with this idea?
  • @user-qg8qc5qb9r
    00:00:00 - Introduction and Purpose of the Talk 00:00:38 - Emmanuel's Background and Experience 00:01:12 - Disclaimer and Scope of the Talk 00:01:39 - Overview of Fine-Tuning: Trends, Performance, and Difficulty 00:02:13 - Observed Trends in Machine Learning Over the Years 00:05:22 - The Shift from Training to Fine-Tuning to Prompting 00:06:16 - Future of Fine-Tuning in Context of LLMs 00:06:45 - Extrapolating Trends in Fine-Tuning 00:07:10 - Questions from the Audience on Trends 00:08:26 - Comparing Fine-Tuning vs. Retrieval-Augmented Generation (RAG) 00:11:12 - Importance of Context Injection and RAG 00:14:21 - Detailed Comparison of Fine-Tuning and RAG 00:16:33 - Audience Questions on Fine-Tuning vs. RAG 00:18:26 - Performance Comparisons and Paper References 00:21:34 - Limits of Fine-Tuning for Knowledge Embedding 00:23:14 - Audience Example: Fine-Tuning for Precision Oncology 00:25:04 - Adding Knowledge Through Fine-Tuning: Discussion 00:26:47 - Challenges with Fine-Tuning for Specific Knowledge 00:28:46 - Audience Question on Multilingual Fine-Tuning 00:29:59 - Fine-Tuning for Specific Tasks like Code Models 00:32:01 - Future of Model Training and Context Handling 00:33:36 - Pre-Training vs. Fine-Tuning in Domain-Specific Models 00:35:24 - Cost Considerations in Fine-Tuning 00:37:13 - Examples of Effective Fine-Tuning 00:38:09 - Evaluating Practical Utility of Fine-Tuning 00:38:45 - Key Focus Areas in Machine Learning 00:41:42 - Importance of Data Work and Infrastructure in ML 00:43:28 - AI Engineering vs. Traditional ML Approaches 00:45:58 - Trends in Model Pricing and Context Sizes 00:48:04 - Dynamic Few-Shot Examples and RAG 00:49:11 - Practical Uses and Best Practices in Prompt Engineering 00:49:42 - Conclusion and Final Thoughts
  • @esantirulo721
    In LLM, fine-tuning just makes more complicated the problem of grounding: where the output comes from? the base model, the learned data or from nothing (hallucination) ? that's why embedding-based search is great: you know what data you're generating your output from. In some industries (e.g., medical), being able to justify ("ground") an answer is mandatory. There are a few use-cases for fine-tuning, if the cost for transforming the data in pair (prompt -> completion) is not too expensive.
  • @lpls
    Can't fine tuning be used to get smaller models to perform a specific task like a bigger model would, but faster and cheaper?
  • Around 46:00 - context size is mostly a vanity metric AFAICT. I'd like to see data about how accuracy varies with the percentage of total nominal context that is actually used by the prompts. In fact, this could be one of the most beneficial uses of fine-tuning, for avoiding filling up the context with very long instructions.
  • @MagusArtStudios
    Been working on RAG in vieo games engineering prompts using different models and it has been going super smoothly. I trained a Retrieval model on data and incorporate static elements into it for the video game environment. Been soo much fun.
  • @Bootcody
    I love your work. Totally agree, RAG is the way to go. And especially agree with the "Average time spent per task". Where preparing the initial data is the most critical step. 👍
  • @enriquebruzual1702
    The success of a RAG app (All things being equal) comes from the context sent to the model, that is, having a good vector db and good search results.
  • @zeryf4780
    great and informative conversation! I wonder if there are more channels like yours!
  • @andrewcbuensalida
    When gpt 3 upgrades to 3.5 or 4, is that upgrade caused by fine-tuning? Or a different mechanism? Or is it completely trained from scratch? Thanks for the talk by the way.
  • @Steve-lu6ft
    When you say we should be spending days working on prompts, how so? I'm assuming you have a high level overview of how these prompts should be structured in mind, but can you break it down and simplify it for me?
  • Nice discussion, thanks for sharing. I am 70% into it and still didn’t hear examples or justification why fine tuning should be avoided. Lots of evaluation results, but that does not make sense if you are fine tuning. You are doing that to work on your custom data mostly and therefore generic evaluation models may not apply nor portray the real performance of the fine tuned model. I fine tune for example to do better detection of service requests into categories and potential solutions.
  • @peterbizik224
    Nice session. Thank you. I would love to see a reliable stable base model understanding the languages. But the domain knowledge is questionable always in my opinion, as most/some? of the books used for base model training (technical books, advanced papers) are quite complex and I am still not truly convinced that text + pictures + math was captured with very high precision.