news 2026/4/3 2:11:21

RAG入门

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
RAG入门

对提取的文本进行分块

def chunk_text(text, n, overlap): """ Chunks the given text into segments of n characters with overlap. Args: text (str): 文本 n (int): 块长度 overlap (int): 重叠度 Returns: List[str]: A list of text chunks. """ chunks = [] # Initialize an empty list to store the chunks # Loop through the text with a step size of (n - overlap) for i in range(0, len(text), n - overlap): # Append a chunk of text from index i to i + n to the chunks list chunks.append(text[i:i + n]) return chunks

块长度和重叠度会影响semantic search的质量

文本块创建嵌入

embedding将文本转换为数值向量,这允许进行高效的相似性搜索

def create_embeddings(text): # Create embeddings for the input text using the specified model response = client.embeddings.create( model="nomic-embed-text", input=text ) return response # Return the response containing the embeddings # 文本块的嵌入向量 response = create_embeddings(text_chunks)

语义搜索

实现余弦相似度来找到与用户查询最相关的文本片段

def cosine_similarity(vec1, vec2): """ Calculates the cosine similarity between two vectors. Args: vec1 (np.ndarray): The first vector. vec2 (np.ndarray): The second vector. Returns: float: The cosine similarity between the two vectors. """ # Compute the dot product of the two vectors and divide by the product of their norms return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def semantic_search(query, text_chunks, embeddings, k=5): """ Performs semantic search on the text chunks using the given query and embeddings. Args: query (str): The query for the semantic search. text_chunks (List[str]): A list of text chunks to search through. embeddings (List[dict]): A list of embeddings for the text chunks. k (int): The number of top relevant text chunks to return. Default is 5. Returns: List[str]: A list of the top k most relevant text chunks based on the query. """ # Create an embedding for the query query_embedding = create_embeddings(query).data[0].embedding similarity_scores = [] # Initialize a list to store similarity scores # Calculate similarity scores between the query embedding and each text chunk embedding for i, chunk_embedding in enumerate(embeddings): similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding)) similarity_scores.append((i, similarity_score)) # Append the index and similarity score # Sort the similarity scores in descending order similarity_scores.sort(key=lambda x: x[1], reverse=True) # Get the indices of the top k most similar text chunks top_indices = [index for index, _ in similarity_scores[:k]] # Return the top k most relevant text chunks return [text_chunks[index] for index in top_indices]

在提取的文本块上进行语义搜索

# Load the validation data from a JSON file with open('../../data/val.json', encoding="utf-8") as f: data = json.load(f) # Extract the first query from the validation data query = data[0]['question'] # Perform semantic search to find the top 2 most relevant text chunks for the query top_chunks = semantic_search(query, text_chunks, response.data, k=2) # Print the query print("Query:", query) # Print the top 2 most relevant text chunks for i, chunk in enumerate(top_chunks): print(f"Context {i + 1}:\n{chunk}\n=====================================")

基于检索到的片段生成响应

# Define the system prompt for the AI assistant system_prompt = "你是一个AI助手,严格根据给定的上下文进行回答。如果无法直接从提供的上下文中得出答案,请回复:'我没有足够的信息来回答这个问题。'" def generate_response(system_prompt, user_message): """ Generates a response from the AI model based on the system prompt and user message. Args: system_prompt (str): The system prompt to guide the AI's behavior. user_message (str): The user's message or query. Returns: dict: The response from the AI model. """ response = client.chat.completions.create( model=os.getenv("LLM_MODEL_ID"), messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ], temperature=0.1, top_p=0.8, presence_penalty=1.05, max_tokens=4096, ) return response.choices[0].message.content # Create the user prompt based on the top chunks user_prompt = "\n".join([f"上下文内容 {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)]) user_prompt = f"{user_prompt}\n问题: {query}" # Generate AI response ai_response = generate_response(system_prompt, user_prompt) print(ai_response)

一开始我的text_chunks,块长度为500,重叠度100,结果

后面调整重叠度到150

可见重叠度的选择会对semantic search的质量产生影响

评估响应质量

# Define the system prompt for the evaluation system evaluate_system_prompt = "你是一个智能评估系统,负责评估AI助手的回答。如果AI助手的回答与真实答案非常接近,则评分为1。如果回答错误或与真实答案不符,则评分为0。如果回答部分符合真实答案,则评分为0.5。" # Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt evaluation_prompt = f"用户问题: {query}\nAI回答:\n{ai_response}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}" # Generate the evaluation response using the evaluation system prompt and evaluation prompt evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt) print(evaluation_response)

1.0

AI助手的回答与真实答案在核心定义和重要性方面高度一致。真实答案强调XAI的目标是提高透明度和可理解性,并指出其重要性在于建立信任、问责和公平性。AI回答详细阐述了这些要点,并补充了"安全性"作为重要性的一部分(虽然真实答案未明确提及),但整体内容准确且未偏离主题。因此,回答与真实答案非常接近,评分为1.0。

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/3/28 17:45:51

Win11 轻松设置更新暂停至 2042年告别过度弹窗 卸载系统冗余软件

时隔两年更新的Windows11 轻松设置 1.12 正式版,整合六大核心模块,一个工具就能替代多款零散小工具,Win11 系统优化、个性化设置全搞定,比单独用其他工具便捷太多! 软件下载地址 软件功能清晰划分六大板块&#xff0c…

作者头像 李华
网站建设 2026/3/27 17:52:55

现在的00后,真是卷死了呀,想离职了·····

都说00后躺平了,但是有一说一,该卷的还是卷。这不,刚开年我们公司来了个00后,工作没两年,跳槽到我们公司起薪23K,都快接近我了。 后来才知道人家是个卷王,从早干到晚就差搬张床到工位睡觉了。 …

作者头像 李华
网站建设 2026/3/28 7:30:15

【CTF Writeup】Crypto题型之经典RSA算法破解全解析

引言 一、RSA基础原理回顾 RSA加密流程&#xff1a; 生成两个大质数p、q&#xff0c;计算n pq&#xff0c;φ(n) (p-1)(q-1)&#xff1b; 选择公钥e&#xff0c;满足1 < e < φ(n)且gcd(e, φ(n)) 1&#xff1b; 计算私钥d&#xff0c;满足e*d ≡ 1 mod φ(n)&…

作者头像 李华
网站建设 2026/3/31 14:51:42

护网行动备战实战宝典!个人备赛到团队攻坚全攻略

护网行动备战与实战宝典&#xff1a;从个人备赛到团队攻坚的全攻略 近年来&#xff0c;护网行动已从“阶段性演练”升级为关键信息基础设施安全防护的“常态化考核”&#xff0c;对抗强度、攻击战术复杂度持续攀升——红队不仅模拟常规Web攻击&#xff0c;更频繁运用APT组织的…

作者头像 李华
网站建设 2026/3/31 10:21:12

Linux 之 【日志】(实现一个打印日志的类)

目录 1.日志的简介 1.1日志的概念 1.2日志的常见格式 2.实现日志类 包含所需头文件&#xff0c;定义所需宏 类成员 levelToString operator() printLog printOneFile&printClassFile 完整呈现 1.日志的简介 1.1日志的概念 日志是软件运行过程中产生的带时间戳的…

作者头像 李华