How did the Open Source community catch up to OpenAI? [Mixtral-8x7B]

OpenAI got some competition when open source released Mixr-8x7B after just a year. This model uses a unique approach called "mixture of experts" to outperform larger, more generalized models. The key is an intelligent router that selects the best expert models for each question. The result? A faster, more efficient AI model that’s turning heads. But don’t get too excited, running Mixr still requires a hefty 86 GB of vram. Ready to dive into the world of AI breakthroughs? Don’t miss Nvidia’s GTC 24 conference – you might even win an RTX 480 super! Connect with industrial experts, learn about generative AI, computer vision, and more. Now, go sign up and stay ahead of the game! πŸš€

Mix of Experts Model 🧩

It only took open source a year to make a model that reaches the level of GPT-3.5. While some may say openAI has no mods, a year in AI time is really on a different scale than in real life. Nonetheless, this model discussion all came back into the picture thanks to Moll publishing Mixtral-8x7B. Right off the bat, you will notice that it has a really unique naming scheme instead of a whole number that would represent a model’s parameter count, which is what people usually do. They were doing maths on its name because they are referring to a new architecture paradigm which they introduced for this model called mixture of experts, which is completely different from how most L operate.

Unique Naming Scheme πŸ“‹

Parameter CountModel Name
512Mixtral-8x7B
13B47B Model Speed

The main idea of mixture of experts is nothing new, but it has not been a prominent method for LL, especially at scale. However, people at Mistral were able to make this method work and perform better than Gemini Pro Claude 2.1 and GPT-3.5. So what exactly is this mixture of experts approach well, rather than using a neuron that is, for example, 512 wide, it’s now split into eight neural networks of 64. If something like a router can pick the correct network for each inference, then we technically only have to run 1/8 of the neurons on every forward pass.

Mixture of Experts Approach πŸ› οΈ

Expert ModelsModel Speed
47B Model13B Speed
13B Model47B Performance

The core idea is that there’s eight expert models that specialize in different topics, and instead of combining results from all the models, a router decides which two expert models to trust when given a question or a prompt. By only using two models, it reduces the computational cost and increases the speed of generation. This then combines the strength of multiple smaller expert models to solve a problem when the user throws at it.

Seamless Model Interactions πŸ”„

However, the researchers do not get to decide which expert specializes in what. The process of gradient descent does, which makes it kind of like a black box. But as long as it’s better, they said that surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. Also, this suggests that the router does exhibit some structured syntactic behavior, so the experts appear to be more aligned with the syntax and semantics rather than the knowledge domain.

Token Assignment Example 🎨

TokenExpert Model Assigned
selfExpert Model 1
defExpert Model 2

Some people even made 52, and oh wow, these open source people do work really fast. And if you do want to run Mixr, I have some bad news for you. While it is claimed to only use 13 billion parameters when running, 86 GB of VRAM is still the recommended VRAM size to run it without quantizing it. Good luck collecting those VRAM in the wild.

Early heads up for Nvidia’s GTC 24, if you attend the digital session anytime between March 18th and 21st, you may have a chance to win an RTX 480 Super from me. So if you are interested in Nvidia’s upcoming AI breakthroughs and announcements or just want to win a brand new GPU.

Conclusion:
This year’s GTC conference has topics such as generative AI, computer vision, and innovative workflows. So don’t miss out this chance to learn from the global experts. Thank you so much for watching!

Key Takeaways:

  • Open Source’s Mixtral-8x7B model has employed a unique mixture of experts approach to reach impressive performance levels.
  • Nvidia’s GTC conference is a must-attend event for those interested in AI innovations and breakthroughs.

About the Author

About the Channel:

Share the Post:
en_GBEN_GB