Chinese firms continue to release AI models that rival the capabilities of systems developed by OpenAI and other U.S.-based AI companies.
This week, MiniMax, an Alibaba- and Tencent-backed startup that has raised around $850 million in venture capital and is valued at more than $2.5 billion, debuted three new models: MiniMax-Text-01, MiniMax-VL-01, and T2A-01-HD. MiniMax-Text-01 is a text-only model, while MiniMax-VL-01 can understand both images and text. T2A-01-HD, meanwhile, generates audio — specifically speech.
MiniMax claims that MiniMax-Text-01, which is 456 billion parameters in size, performs better than models such as Google's recently unveiled Gemini 2.0 Flash on benchmarks like MMLU and SimpleQA, which measure the ability of a model to answer math problems and fact-based questions. Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.
As for MiniMax-VL-01, MiniMax says that it rivals Anthropic's Claude 3.5 Sonnet on evaluations that require multimodal understanding, like ChartQA, which tasks models with answering graph- and diagram-related queries (e.g., "What is the peak value of the orange line in this graph?"). Granted, MiniMax-VL-01 doesn't quite best Gemini 2.0 Flash on many of these tests. OpenAI's GPT-4o and an open model called InternVL2.5 beat it on several as well.
Of note, MiniMax-Text-01 has an extremely large context window. A model’s context, or context window, refers to input (e.g., text) that a model considers before generating output (additional text). With a context window of 4 million tokens, MiniMax-Text-01 can analyze around 3 million words in one go — or just over five copies of "War and Peace."
For context (no pun intended), MiniMax-Text-01's context window is roughly 31 times the size of GPT-4o's and Llama 3.1's.
The last of MiniMax's models released this week, T2A-01-HD, is an audio generator optimized for speech. T2A-01-HD can generate a synthetic voice with adjustable cadence, tone, and tenor in around 17 different languages, including English and Chinese, and clone a voice from just 10 seconds of an audio recording.
MiniMax didn't publish benchmark results comparing T2A-01-HD to other audio-generating models. But to this reporter's ear, T2A-01-HD's outputs sound on par with audio models from Meta and startups like PlayAI.
With the exception of T2A-01-HD, which is exclusively available through MiniMax's API and Hailuo AI platform, MiniMax's new models can be downloaded from GitHub and the AI dev platform Hugging Face.