RAG Bot 만들기: Generation 파트 — Retrieval·LLM 결합, RetrievalQA, 프롬프트 전략, 대화 메모리

티스토리 뷰

RAG

RAG Bot 만들기: Generation 파트 — Retrieval·LLM 결합, RetrievalQA, 프롬프트 전략, 대화 메모리

4OurFuture 2025. 8. 16. 23:12

728x90

2025.08.16 - [RAG] - 문서 인덱싱과 검색 — 벡터 스토어 구축 · Semantic Search 구현

문서 인덱싱과 검색 — 벡터 스토어 구축 · Semantic Search 구현

2025.08.16 - [RAG] - RAG Bot 만들기: Retrieval 파트 —Vector DB, Vectorization RAG Bot 만들기: Retrieval 파트 —Vector DB, Vectorization2025.08.16 - [RAG] - [LangChain] 문서 로딩과 전처리 가이드 이전 글(문서 로딩·전처리)

4ourfuture.tistory.com

3부(인덱싱·리트리버)에서 만든 retriever를 그대로 이어서 사용합니다. 이번 글의 목표는 Retrieval + LLM 결합 → RetrievalQA/ConversationalRetrievalChain 구현 → Prompt Engineering → 컨텍스트·샘플링 전략 → 멀티턴 대화 메모리까지 한 번에 완성하는 것입니다. 모든 예제는 LCEL 파이프라인(|) 기준이며, 그대로 실행할 수 있도록 구성했습니다.

0. 준비물

pip install -U langchain langchain-openai

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

이전 글에서 생성한 리트리버(예: faiss_store.as_retriever(search_kwargs={"k": 4}) 또는 chroma_store.as_retriever(search_kwargs={"k": 4}))가 있다고 가정합니다.

1. Retrieval과 LLM 결합하기: 베이직 RAG 체인

def format_docs(docs):
    return "\n\n".join(
        f"[source: {d.metadata.get('source', '?')}]\n{d.page_content}" for d in docs
    )

RAG_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "너는 신뢰할 수 있는 한국어 도우미다. 컨텍스트에 근거하지 않는 추측은 하지 말고, 모르면 모른다고 답해라."),
    ("human", "컨텍스트를 바탕으로 질문에 답하라.\n컨텍스트:\n{context}\n\n질문: {question}"),
])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | RAG_PROMPT
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("노쇼(No-show) 수수료는?"))

동작 순서

사용자의 question을 그대로 받고, 2) 같은 질의로 리트리버가 문서를 검색, 3) 검색 결과를 포맷팅, 4) 프롬프트에 삽입, 5) LLM이 최종 답을 생성합니다.

2. RetrievalQA 구현하기

2.1 LCEL 기반(권장)

rag_chain 자체가 RetrievalQA입니다. 필요 시 소스 문서를 함께 다루도록 확장할 수 있습니다.

from langchain_core.runnables import RunnableParallel

rag_with_docs = (
    RunnableParallel(docs=retriever, question=RunnablePassthrough())
    | {"context": lambda x: format_docs(x["docs"]), "question": lambda x: x["question"]}
    | RAG_PROMPT
    | llm
)
result = rag_with_docs.invoke("환불 처리 SLA는?")
print(result.content)  # 최종 답변

2.2 클래식 체인 API(빠른 시작)

from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # map_reduce/refine 등 대체 가능
    retriever=retriever,
    return_source_documents=True,
)
res = qa({"query": "체크인 시간 정책 알려줘"})
print(res["result"])             # 답변
# print(res["source_documents"])  # 출처

간단히 시작하기엔 클래식 체인이 편리하지만, 구성 유연성/관측은 LCEL이 유리합니다.

3. Retrieval Chain 동작 원리(파이프라인 해부)

from langchain_core.runnables import RunnableLambda

pipeline = (
    {"retrieved": retriever, "question": RunnablePassthrough()}               # 1) 검색
    | {"context": RunnableLambda(lambda x: format_docs(x["retrieved"])),      # 2) 포맷팅
       "question": lambda x: x["question"]}
    | RAG_PROMPT                                                               # 3) 프롬프트 생성
    | llm                                                                      # 4) 답변 생성
    | StrOutputParser()                                                        # 5) 문자열 파싱
)
print(pipeline.invoke("현금영수증 발급 기준은?"))

각 단계는 Runnable로 구성되어 invoke/stream/batch를 동일하게 활용할 수 있습니다.
운영 환경에서는 with_config(tags=[...], metadata={...})로 모니터링 태그/메타데이터를 부여하세요.

4. 프롬프트 엔지니어링 전략(체크리스트)

A. 역할·톤·금지 규칙을 System에 고정

SYSTEM_RULES = (
    "너는 정확하고 간결한 한국어 기술 지원 봇이다.\n"
    "- 컨텍스트에 없으면 모른다고 답한다.\n"
    "- 숫자/정책은 출처를 함께 표시한다.\n"
    "- 단계별 목록 또는 표 형식을 우선 고려한다."
)

PROMPT_RULED = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_RULES),
    ("human", "컨텍스트:\n{context}\n\n질문: {question}\n\n출처는 [source:...] 형식으로 함께 제시하라."),
])

B. 금지어·민감 주제 가드: 내부 문서 외 추측 금지, 버전·시점 명시 등.
C. 출력 포맷 고정: 표/목록/JSON. (구조화가 필요하면 Pydantic 모델 + with_structured_output 활용)
D. 실패 시 대응: 컨텍스트 부족 시 “추가 정보 필요”를 명확히 안내.

5. Context-aware Prompting(컨텍스트에 맞게 동적 구성)

사용자/조직/문서 메타데이터에 따라 톤·정책·언어를 변경합니다.

def build_contextual_prompt(user_role: str = "guest", locale: str = "ko"):
    style = "정중하고 간결" if user_role == "vip" else "친근하고 간결"
    sys = f"너는 {style}한 한국어 도우미다. 답변은 {locale} 언어로 제공한다."
    return ChatPromptTemplate.from_messages([
        ("system", sys),
        ("human", "컨텍스트:\n{context}\n\n질문: {question}")
    ])

prompt_ctx = build_contextual_prompt(user_role="vip", locale="ko")
rag_ctx = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_ctx | llm | StrOutputParser()
)
print(rag_ctx.invoke("우대회원 취소 규정 요약"))

6. Few-shot & Zero-shot

6.1 Zero-shot(기본)

위 예제 대부분이 zero-shot 형태입니다.

6.2 Few-shot: 예시 대화 주입

EXAMPLES = [
    {"q": "체크인 시간은?", "a": "표준 체크인은 15:00입니다. [source: policy.md]"},
    {"q": "노쇼 수수료?", "a": "숙박 요금의 100%가 부과됩니다. [source: fees.md]"},
]

fewshot = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_RULES),
    ("human", "아래 예시를 참고해 동일한 톤과 형식으로 답하라."),
    *[("ai", f"Q: {ex['q']}\nA: {ex['a']}") for ex in EXAMPLES],
    ("human", "컨텍스트:\n{context}\n\n질문: {question}")
])

rag_fewshot = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | fewshot | llm | StrOutputParser()
)
print(rag_fewshot.invoke("포인트 환불 기준은?"))

주의: few-shot 예시는 길수록 토큰 비용이 커집니다. 핵심 패턴만 유지하세요.

7. 멀티턴 대화 흐름 설계(히스토리 + 재검색)

멀티턴에서는 이전 발화를 참고해 질문을 보정하고, 대화 히스토리를 모델에 함께 제공해야 합니다.

7.1 RunnableWithMessageHistory(권장)

from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# 세션별 히스토리 저장소(데모: 메모리 dict)
_session_store = {}

def get_history(session_id: str):
    if session_id not in _session_store:
        _session_store[session_id] = ChatMessageHistory()
    return _session_store[session_id]

# 히스토리 메시지를 위한 자리
CHAT_PROMPT = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_RULES),
    MessagesPlaceholder("chat_history"),
    ("human", "컨텍스트:\n{context}\n\n질문: {question}"),
])

rag_chat = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | CHAT_PROMPT | llm | StrOutputParser()
)

rag_with_history = RunnableWithMessageHistory(
    rag_chat,
    get_history,
    input_messages_key="question",      # 사용자 입력 키
    history_messages_key="chat_history"  # 히스토리 키
)

cfg = {"configurable": {"session_id": "user-42"}}
print(rag_with_history.invoke({"question": "체크인 시간?"}, config=cfg))
print(rag_with_history.invoke({"question": "그럼 늦게 도착하면?"}, config=cfg))

7.2 질의 재작성(Query Rewriting)로 검색 품질 향상

REWRITE_PROMPT = ChatPromptTemplate.from_template(
    "이전 대화와 현재 질문을 참고해, 검색에 적합한 한 문장 쿼리로 재작성하라.\n대화: {history}\n질문: {question}\n재작성 쿼리:"
)

from langchain_core.documents import Document

def join_history(msgs):
    return "\n".join([f"{m.type}: {m.content}" for m in msgs])

rewrite_chain = (
    {"history": lambda x: join_history(get_history("user-42").messages),
     "question": RunnablePassthrough()}
    | REWRITE_PROMPT | llm | StrOutputParser()
)

q = rewrite_chain.invoke("거기에 노쇼 규정도 있지?")
retrieved = retriever.get_relevant_documents(q)
for d in retrieved[:2]:
    print(d.metadata.get("source"))

8. ConversationalRetrievalChain 활용(빠른 통합 방식 :CRC)

from langchain.chains import ConversationalRetrievalChain

crc = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

chat_history = []
res1 = crc({"question": "체크아웃은 몇 시?", "chat_history": chat_history})
chat_history.extend([("human", "체크아웃은 몇 시?"), ("ai", res1["answer"])])

res2 = crc({"question": "그럼 연장 비용은?", "chat_history": chat_history})
print(res2["answer"])             # 답변
# print(res2["source_documents"])  # 출처

빠르게 시작하려면 CRC가 편리하지만, 세밀한 제어·옵션은 LCEL + RunnableWithMessageHistory가 더 유리합니다.

9. 메모리 관리 전략

Conversation Buffer: 최근 N턴을 보관(간단/저비용)
Token Buffer: 토큰 기준으로 길이를 관리(긴 대화에 유리)
Summary Memory: 이전 대화를 요약하여 길이 관리(장문·장기 세션)
엔터프라이즈 운영: 세션 스토리지(Redis/DB)에 ChatHistory 저장 + 만료 정책/PII 마스킹/감사 로깅

9.1 ConversationBufferMemory(레거시 스타일)

from langchain.memory import ConversationBufferMemory
from langchain.chains import LLMChain

mem = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
prompt_legacy = ChatPromptTemplate.from_messages([
    ("system", SYSTEM_RULES),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])
legacy_chain = LLMChain(llm=llm, prompt=prompt_legacy, memory=mem)
print(legacy_chain.run("안녕?"))

새 프로젝트에는 RunnableWithMessageHistory를 권장합니다.

9.2 세션 저장·복구(예: Redis)

ChatMessageHistory 구현체를 Redis에 두고 session_id 별로 메시지를 저장하면 수평 확장이 쉽습니다.
만료 기간(예: 30일), 개인정보 마스킹, 감사 로그를 함께 설계하세요.

10. 운영 체크리스트

컨텍스트 과잉 방지: 너무 많은 청크는 오히려 품질 저하. k, MMR, 리랭킹으로 최적화
출처 표기 일관성: [source: ...] 규칙을 프롬프트에 고정하고 메타데이터(source, version, locale)를 표준화
관측·평가: LangSmith run_on_dataset으로 리콜/정확도·응답 품질을 주기적으로 점검
재현성 관리: 프롬프트 버저닝, 임베딩/인덱스 스키마 버전 태깅

11. FastAPI로 멀티턴 RAG 엔드포인트

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# 앞서 만든 rag_with_history 재사용
class ChatIn(BaseModel):
    session_id: str
    question: str

@app.post("/rag/chat")
def rag_chat_endpoint(req: ChatIn):
    config = {"configurable": {"session_id": req.session_id}}
    answer = rag_with_history.invoke({"question": req.question}, config=config)
    return {"answer": answer}

마무리

이제 Retrieval + LLM 결합 → RetrievalQA/CRC → 프롬프트 전략 → 컨텍스트·샘플링 전략 → 멀티턴 메모리까지 갖춘 Generation 파트를 완성했습니다. 다음 글에서는 품질 고도화(리랭킹·재작성·지식 업데이트 파이프라인)와 가드레일, 그리고 A/B 실험·LangSmith 평가 자동화를 다룹니다.

728x90

'RAG' 카테고리의 다른 글

실전 RAG Bot 프로젝트 — Q&A 봇 · 사내 지식베이스 · FastAPI/Slack/Discord 통합 (10)	2025.08.16
문서 인덱싱과 검색 — 벡터 스토어 구축 · Semantic Search 구현 (3)	2025.08.16
RAG Bot 만들기: Retrieval 파트 —Vector DB, Vectorization (5)	2025.08.16
[LangChain] 문서 로딩과 전처리 가이드 (4)	2025.08.16
LangSmith로 LangChain 체인/에이전트 관측과 평가하기 (3)	2025.08.16

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

글 보관함

250x250

4OurFuture 님의 블로그

티스토리 뷰