RAG入门-智慧文博士

对提取的文本进行分块

def chunk_text(text, n, overlap): """ Chunks the given text into segments of n characters with overlap. Args: text (str): 文本 n (int): 块长度 overlap (int): 重叠度 Returns: List[str]: A list of text chunks. """ chunks = [] # Initialize an empty list to store the chunks # Loop through the text with a step size of (n - overlap) for i in range(0, len(text), n - overlap): # Append a chunk of text from index i to i + n to the chunks list chunks.append(text[i:i + n]) return chunks

块长度和重叠度会影响semantic search的质量

文本块创建嵌入

embedding将文本转换为数值向量，这允许进行高效的相似性搜索

def create_embeddings(text): # Create embeddings for the input text using the specified model response = client.embeddings.create( model="nomic-embed-text", input=text ) return response # Return the response containing the embeddings # 文本块的嵌入向量 response = create_embeddings(text_chunks)

语义搜索

实现余弦相似度来找到与用户查询最相关的文本片段

def cosine_similarity(vec1, vec2): """ Calculates the cosine similarity between two vectors. Args: vec1 (np.ndarray): The first vector. vec2 (np.ndarray): The second vector. Returns: float: The cosine similarity between the two vectors. """ # Compute the dot product of the two vectors and divide by the product of their norms return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def semantic_search(query, text_chunks, embeddings, k=5): """ Performs semantic search on the text chunks using the given query and embeddings. Args: query (str): The query for the semantic search. text_chunks (List[str]): A list of text chunks to search through. embeddings (List[dict]): A list of embeddings for the text chunks. k (int): The number of top relevant text chunks to return. Default is 5. Returns: List[str]: A list of the top k most relevant text chunks based on the query. """ # Create an embedding for the query query_embedding = create_embeddings(query).data[0].embedding similarity_scores = [] # Initialize a list to store similarity scores # Calculate similarity scores between the query embedding and each text chunk embedding for i, chunk_embedding in enumerate(embeddings): similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding)) similarity_scores.append((i, similarity_score)) # Append the index and similarity score # Sort the similarity scores in descending order similarity_scores.sort(key=lambda x: x[1], reverse=True) # Get the indices of the top k most similar text chunks top_indices = [index for index, _ in similarity_scores[:k]] # Return the top k most relevant text chunks return [text_chunks[index] for index in top_indices]

在提取的文本块上进行语义搜索

# Load the validation data from a JSON file with open('../../data/val.json', encoding="utf-8") as f: data = json.load(f) # Extract the first query from the validation data query = data[0]['question'] # Perform semantic search to find the top 2 most relevant text chunks for the query top_chunks = semantic_search(query, text_chunks, response.data, k=2) # Print the query print("Query:", query) # Print the top 2 most relevant text chunks for i, chunk in enumerate(top_chunks): print(f"Context {i + 1}:\n{chunk}\n=====================================")

基于检索到的片段生成响应

# Define the system prompt for the AI assistant system_prompt = "你是一个AI助手，严格根据给定的上下文进行回答。如果无法直接从提供的上下文中得出答案，请回复：'我没有足够的信息来回答这个问题。'" def generate_response(system_prompt, user_message): """ Generates a response from the AI model based on the system prompt and user message. Args: system_prompt (str): The system prompt to guide the AI's behavior. user_message (str): The user's message or query. Returns: dict: The response from the AI model. """ response = client.chat.completions.create( model=os.getenv("LLM_MODEL_ID"), messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_message} ], temperature=0.1, top_p=0.8, presence_penalty=1.05, max_tokens=4096, ) return response.choices[0].message.content # Create the user prompt based on the top chunks user_prompt = "\n".join([f"上下文内容 {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)]) user_prompt = f"{user_prompt}\n问题: {query}" # Generate AI response ai_response = generate_response(system_prompt, user_prompt) print(ai_response)

一开始我的text_chunks，块长度为500，重叠度100，结果

后面调整重叠度到150

可见重叠度的选择会对semantic search的质量产生影响

评估响应质量

# Define the system prompt for the evaluation system evaluate_system_prompt = "你是一个智能评估系统，负责评估AI助手的回答。如果AI助手的回答与真实答案非常接近，则评分为1。如果回答错误或与真实答案不符，则评分为0。如果回答部分符合真实答案，则评分为0.5。" # Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt evaluation_prompt = f"用户问题: {query}\nAI回答:\n{ai_response}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}" # Generate the evaluation response using the evaluation system prompt and evaluation prompt evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt) print(evaluation_response)

1.0

AI助手的回答与真实答案在核心定义和重要性方面高度一致。真实答案强调XAI的目标是提高透明度和可理解性，并指出其重要性在于建立信任、问责和公平性。AI回答详细阐述了这些要点，并补充了"安全性"作为重要性的一部分（虽然真实答案未明确提及），但整体内容准确且未偏离主题。因此，回答与真实答案非常接近，评分为1.0。