The process described above works well when users can only ask a single question. But this application allows follow-up questions as well, and this introduces a few additional complications. For example, there is a need to store all previous questions and answers, so that they can be included as additional context when sending the new question to the LLM.
The chat history in this application is managed through the ElasticsearchChatMessageHistory
class, another class that is part of the Elasticsearch integration with Langchain. Each group of related questions and answers are written to an Elasticsearch index with a reference to the session ID that was used.
def get_elasticsearch_chat_message_history(index, session_id):
return ElasticsearchChatMessageHistory(
es_connection=elasticsearch_client, index=index, session_id=session_id
)
INDEX_CHAT_HISTORY = os.getenv(
"ES_INDEX_CHAT_HISTORY", "workplace-app-docs-chat-history"
)
chat_history = get_elasticsearch_chat_message_history(
INDEX_CHAT_HISTORY, session_id
)
You may have noticed in the previous section that even though the response from the LLM is streamed out to the client in chunks, an answer
variable is generated with the full response. This is so that the response, along with its question, can be added to the history after each interaction:
chat_history.add_user_message(question)
chat_history.add_ai_message(answer)
If the client sends a session_id
argument in the query string of the request URL, then the question is assumed to be made in the context of any previous questions under that same session.
The approach taken by this application for follow-up questions is to use the LLM to create a condensed question that summarizes the entire conversation, to be used for the retrieval phase. The purpose of this is to avoid running a vector search on a potentially large history of questions and answers. Here is the logic that performs this task:
if len(chat_history.messages) > 0:
# create a condensed question
condense_question_prompt = render_template(
'condense_question_prompt.txt', question=question,
chat_history=chat_history.messages)
condensed_question = get_llm().invoke(condense_question_prompt).content
else:
condensed_question = question
docs = store.as_retriever().invoke(condensed_question)
This has a lot of similarities with how the main questions are handled, but in this case there is no need to use the streaming interface of the LLM, so the the invoke()
method is used instead.
To condense the question, a different prompt is used, stored in file **api/templates/condense_question_prompt.txt`:
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat history:
{% for dialogue_turn in chat_history -%}
{% if dialogue_turn.type == 'human' %}Question: {{ dialogue_turn.content }}{% elif dialogue_turn.type == 'ai' %}Response: {{ dialogue_turn.content }}{% endif %}
{% endfor -%}
Follow Up Question: {{ question }}
Standalone question:
This prompt renders all the questions and responses from the session, plus the new follow-up question at the end. The LLM is instructed to provide a simplified question that summarizes all the information.
To enable the LLM to have as much context as possible in the generation phase, the complete history of the conversation is added to the main prompt, along with the retrieved documents and the follow-up question. Here is the final version of the prompt as used in the example application:
Use the following passages and chat history to answer the user's question.
Each passage has a NAME which is the title of the document. After your answer, leave a blank line and then give the source name of the passages you answered from. Put them in a comma separated list, prefixed with SOURCES:.
Example:
Question: What is the meaning of life?
Response:
The meaning of life is 42.
SOURCES: Hitchhiker's Guide to the Galaxy
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----
{% for doc in docs -%}
---
NAME: {{ doc.metadata.name }}
PASSAGE:
{{ doc.page_content }}
---
{% endfor -%}
----
Chat history:
{% for dialogue_turn in chat_history -%}
{% if dialogue_turn.type == 'human' %}Question: {{ dialogue_turn.content }}{% elif dialogue_turn.type == 'ai' %}Response: {{ dialogue_turn.content }}{% endif %}
{% endfor -%}
Question: {{ question }}
Response:
You should note that the way the condensed question is used can be adapted to your needs. You may find that for some applications sending the condensed question also in the generation phase works better, also reducing the token count. Or perhaps not using a condensed question at all and always sending the entire chat history gives you better results. Hopefully now you have a good understanding of how this application works and can experiment with different prompts to find what works best for your use case.
Previously
Generation PhaseNext
Conclusion