mastouille.fr est l'un des nombreux serveurs Mastodon indépendants que vous pouvez utiliser pour participer au fédiverse.
Mastouille est une instance Mastodon durable, ouverte, et hébergée en France.

Administré par :

Statistiques du serveur :

584
comptes actifs

#rag

1 message1 participant0 message aujourd’hui

Using the example of a retrieval augmented generation (RAG) pipeline developed at the University of Victoria Libraries, Unlocking Web Archives Using RAG investigates the potential and challenges of integrating large language models with #RAG to transform access to web archives, with a focus on infrastructure, data quality, and ethical AI integration. Watch the CNI video at: youtu.be/cs3iXyV4kLs

LLMs don’t know your PDF.
They don’t know your company wiki either. Or your research papers.

What they can do with RAG is look through your documents in the background and answer using what they find.

But how does that actually work? Here’s the basic idea behind RAG:
:blobcoffee: Chunking: The document is split into small, overlapping parts so the LLM can handle them. This keeps structure and context.
:blobcoffee: Embeddings & Search: Each part is turned into a vector (a numerical representation of meaning). Your question is also turned into a vector, and the system compares them to find the best matches.
:blobcoffee: Retriever + LLM: The top matches are sent to the LLM, which uses them to generate an answer based on that context.

Want to really understand how RAG, vector search & chunking work?

Then stop reading theory and build your own chatbot.

This guide shows you how to create a local PDF chatbot using:

☕ LangChain

☕ FAISS (vector DB)

☕ Mistral via Ollama

☕ Python & Streamlit

Step-by-step, from environment setup to deployment. Ideal for learning how Retrieval-Augmented Generation works in practice.

👉 medium.com/data-science-collec

Comment “WANT” if you need the friends link to the article, as you don’t have paid Medium.

Data Science Collective · RAG in Action: Build your Own Local PDF Chatbot as a BeginnerPar Sarah Lea
#rag#tech#Technology

Hello World! #introduction

Work in cybersec for 25+ years. Big OSS proponent.

Latest projects:

VectorSmuggle is acomprehensive proof-of-concept demonstrating vector-based data exfiltration techniques in AI/ML environments. This project illustrates potential risks in RAG systems and provides tools and concepts for defensive analysis.
github.com/jaschadub/VectorSmu

SchemaPin protocol for cryptographically signing and verifying AI agent tool schemas to prevent supply-chain attacks (aka MCP Rug Pulls).
github.com/ThirdKeyAI/SchemaPin

GitHubGitHub - jaschadub/VectorSmuggle: Testing platform for covert data exfiltration techniques where sensitive documents are embedded into vector representations and tunneled out under the guise of legitimate RAG operations — bypassing traditional security controls and evading detection through semantic obfuscation.Testing platform for covert data exfiltration techniques where sensitive documents are embedded into vector representations and tunneled out under the guise of legitimate RAG operations — bypassing...

Wir freuen uns ein weiteres der vier geförderten Projekte der zweiten Runde unseres #Forschungsstudienprogramms am Leibniz-Institut für Europäische Geschichte bekanntzugeben!

🏆 Rainer Simon (@aboutgeo) und Michela Vignoli für ihr Projekt „Digital Camerarius RAG: Multimodal Information Retrieval Prototype for CH and DH“.
Digital Camerarius: furman-editions-in-progress.gi

Herzlichen Glückwunsch! Wir freuen uns auf die innovativen Erkenntnisse, die dieses Projekt hervorbringen wird 🎉

Spent some time and had fun building a #Django documentation #RAG chatbot today. It answers questions by retrieving context from Django docs using embeddings. Currently using OpenAI/pgvector just to get some foundational knowledge, but I'd like to switch to entirely local and open-source embedding models (like sentence-transformers) and sqlite-vss for the vector search.