What is the difference between a reranker and an embedding model?

Embedding models encode query and document separately, computing similarity from two vectors. Rerankers run both texts together through transformer layers, allowing mutual attention — more accurate but more expensive per pair. Production systems use retrieve-then-rerank: embeddings find top-K candidates cheaply, rerankers re-order them accurately.

Why release six model sizes instead of one?

Different use cases have different latency and accuracy budgets. A 17M-parameter reranker is suitable for edge or real-time constraints; a 1B-parameter model maximizes accuracy for batch pipelines. Offering the full range lets teams choose the tradeoff that fits their infrastructure.

How were these models trained?

Via knowledge distillation: each Ettin reranker was trained to match relevance scores from a larger teacher model (mixedbread-ai's mxbai-rerank-large-v2) using pointwise mean-squared-error loss on a curated dataset.

Hugging Face Releases Six Open-Weights Ettin Reranker Models, From 17M to 1B Parameters

Hugging Face published six open-weights reranker models spanning 17 million to 1 billion parameters, each trained via knowledge distillation and paired with state-of-the-art performance benchmarks at their respective model sizes. According to the Hugging Face Blog, the Ettin Reranker family builds on ModernBERT encoders and ships with the complete training recipe, enabling teams to reproduce or customize the approach on their own datasets.

What Distinguishes Rerankers in Production Retrieval

A reranker (or pointwise cross-encoder) takes a query-document pair and outputs a single relevance score by running both texts through all transformer layers together—allowing the query and document to attend to each other. This contrasts with embedding models, which encode query and document independently, then compute similarity from the resulting vectors. The joint encoding is substantially more accurate but also more computationally expensive: a reranker must run once per candidate pair rather than once per text.

Production systems address this cost asymmetry with a retrieve-then-rerank pattern. According to Hugging Face, a fast embedding model retrieves the top-K candidates (computationally cheap), and a cross-encoder then re-ranks only those K results with high accuracy. The total inference cost remains bounded while the final ranking approaches what an exhaustive reranker pass over the entire corpus would produce.

The Ettin Model Lineup and Training Approach

The six released models are cross-encoder/ettin-reranker-17m-v1, cross-encoder/ettin-reranker-32m-v1, cross-encoder/ettin-reranker-68m-v1, cross-encoder/ettin-reranker-150m-v1, cross-encoder/ettin-reranker-400m-v1, and cross-encoder/ettin-reranker-1b-v1. Each was trained via distillation, using pointwise mean-squared-error loss to match relevance scores from mixedbread-ai’s mxbai-rerank-large-v2 teacher model. According to Hugging Face, the training dataset combines a subset of LightOn’s embeddings pre-training corpus with a reranked subset of its fine-tuning data.

The models are packaged as standard Sentence Transformers CrossEncoder objects, requiring only three lines of code to load and run. Hugging Face also released the full training script and hyperparameters, enabling teams to replicate or adapt the distillation recipe for domain-specific datasets.

Performance and Practical Sizing

On the MTEB (English, v2) Retrieval benchmark paired with Google’s Embedding Gemma 300M embedder, the six Ettin rerankers achieved state-of-the-art performance at their respective sizes, according to Hugging Face. The release includes benchmark results across five additional embedder pairings, allowing practitioners to evaluate compatibility with their existing retrieval stacks.

The size range reflects a deliberate design choice: the 17M-parameter model suits latency-critical applications (edge inference, real-time constraints), while the 1B-parameter variant maximizes accuracy for batch-processing pipelines. This spectrum reduces the need for custom model pruning or distillation when a team’s infrastructure already supports one of the intermediate sizes.

Why This Matters

For teams building retrieval-augmented generation (RAG) systems or semantic search products, reranking is a high-impact, low-cost accuracy multiplier—but only if the reranker’s inference cost scales with the corpus size being filtered. By releasing six open-weights models at different scales with transparent training recipes, Hugging Face removes the friction of choosing between proprietary reranking APIs and building one in-house. Teams can now select a model size that fits their latency budget, deploy it without vendor lock-in, and customize the training data if domain performance lags. The published distillation approach also serves as a template for organizations needing to retrain on proprietary corpora or adapt to new ranking signals.