intfloat/e5-small

微草AIGC录1年前 (2024)发布 873b2a563b3acc92

E5-small

Text Embeddings by Weakly-Supervised Contrastive Pre-training.
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
This model has 12 layers and the embedding size is 384.

Usage

Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset.
import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] # Each input text should start with "query: " or "passage: ". # For tasks other than retrieval, you can simply use the "query: " prefix. input_texts = ['query: how much protein should a female eat', 'query: summit define', "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.", "passage: Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."] tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-small') model = AutoModel.from_pretrained('intfloat/e5-small') # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) # (Optionally) normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist())

Training Details

Please refer to our paper at https://arxiv.org/pdf/2212.03533.pdf.

Benchmark Evaluation

Check out unilm/e5 to reproduce evaluation results
on the BEIR and MTEB benchmark.

Citation

If you find our paper or models helpful, please consider cite as follows:
@article{wang2022text, title={Text Embeddings by Weakly-Supervised Contrastive Pre-training}, author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu}, journal={arXiv preprint arXiv:2212.03533}, year={2022} }

Limitations

This model only works for English texts. Long texts will be truncated to at most 512 tokens.

收录说明：
1、本网页并非 intfloat/e5-small 官网网址页面，此页面内容编录于互联网，只作展示之用；
2、如果有与 intfloat/e5-small 相关业务事宜，请访问其网站并获取联系方式；
3、本站与 intfloat/e5-small 无任何关系，对于 intfloat/e5-small 网站中的信息，请用户谨慎辨识其真伪。
4、本站收录 intfloat/e5-small 时，此站内容访问正常，如遇跳转非法网站，有可能此网站被非法入侵或者已更换新网址，导致旧网址被非法使用,
5、如果你是网站站长或者负责人，不想被收录请邮件删除：i-hu#Foxmail.com （#换@）

前往AI网址导航

文章版权归作者所有，未经允许请勿转载。

intfloat/e5-small

E5-small

Usage

Training Details

Benchmark Evaluation

Citation

Limitations

扇贝

12999英语

相关文章