Learn how to benchmark embedding models on your own data in this course for beginners.
In this course, you will learn:
The limitations of extracting text from PDF files with Python libraries and to solve that with the help of VLMs (Vision Language Models).
How to divide the extracted text into chunks that preserve context.
Generation questions for each chunk using LLMs (Large Language Models).
Use embedding models to create vector representations of the chunks and questions.
Use both open source and proprietary embedding models.
Use llama.cpp to run models in the GGUF format locally on your machine.
Perform the benchmarking of different embedding models using various metrics and statistical tests with the help of ranx.
Plot the vector representations to visualize if clusters are being formed.
Understand how to interpret the p-value that a statistical test provides.
And much more!
You can find the slides, notebook, and scripts in this GitHub repository:
https://github.com/ImadSaddik/Benchma...
The dataset is available here:
https://huggingface.co/datasets/ImadS...
To connect with Imad Saddik, check out his social accounts:
LinkedIn: / imadsaddik
YouTube: / @3codecampers
Website: https://imadsaddik.com/
⭐️ Course Contents ⭐️
(0:00:00) About the course
(0:06:05) Introduction
(0:17:58) Extracting text from PDF documents
(1:01:08) Divide text into coherent chunks
(1:23:10) Generate question-answer pairs from text chunks
(1:38:48) Embed text chunks and questions
(2:17:06) Statistical tests and metrics
(3:12:01) Expanding the dataset and adding more languages
(3:45:24) Conclusion