FlagEmbedding: Powerful Low-Dimensional Text Mapping for Retrieval and Classification Tasks
FlagEmbedding is an innovative text-mapping technique that enables the transformation of any text into a low-dimensional dense vector. This vector can be utilized for various tasks, including retrieval, classification, clustering, and semantic search. Additionally, FlagEmbedding can be integrated into vector databases for Language Model Models (LLMs).
One of the key advantages of FlagEmbedding is its ability to leverage all available GPUs during the encoding process, allowing for efficient and high-performance operations. Users can easily designate the preferred GPU for optimal performance.
Implementing FlagEmbedding is straightforward with the installation of the sentence-transformers library. For retrieval tasks, queries should begin with specific instructions, which can be found in the Model List.
To use the model with the transformers package, the input should be passed through the transformer model. The last hidden state of the first token (i.e., [CLS]) serves as the sentence embedding.
The FlagEmbedding model undergoes a two-step process: pre-training and fine-tuning.
During pre-training, the model follows the retromae method, which has demonstrated impressive results in retrieval tasks. The pre-training phase employs 24 A100(40G) GPUs with a batch size of 720. The encoder and decoder mask ratios are set to 0.3 and 0.5, respectively. The optimization algorithm used is AdamW, with a learning rate of 2e-5.
Finetuning is the subsequent step, employing a contrastive objective. The input data format consists of a triple, with the presence of in-batch negatives. Cross-device negatives sharing is implemented to enhance the number of negatives available for training. The model is trained on 48 A100(40G) GPUs with a substantial batch size of 32,768. This configuration ensures each query in a batch has 65,535 negatives. Similar to pre-training, the AdamW optimizer is used, with a learning rate of 1e-5. The contrastive loss has a temperature setting of 0.01.
For retrieval tasks, the training version includes an instruction appended to the query. For English, the instruction is ; for Chinese, it is . However, during evaluation, the instruction should only be added for sentence-to-passages retrieval tasks and excluded for other tasks.
The repository flag_embedding provides access to the finetune script, facilitating easy customization of the model for specific requirements.
Training data is currently being collected and will be made available in the future. Regular updates to the embedding models and training codes are expected, as the aim is to foster the development of the embedding model community.
For any questions, suggestions, or project-related issues, users can raise them via issues and pull requests or reach out via email to Shitao Xiao (stxiao@baai.ac.cn) and Zheng Liu (liuzheng@baai.ac.cn).
FlagEmbedding is licensed under the MIT License, enabling its use for commercial purposes free of charge.
By harnessing the power of FlagEmbedding, users can unlock the potential for enhanced text mapping, which empowers retrieval, classification, clustering, and semantic search tasks. Stay tuned for further developments and improvements to this powerful tool.