Exploring limitations of basic Text-To-Query RAG workflows

Combining Memgraph and ChatGPT with LangChain to query graph data with natural language

Aug 23, 2024

It has been over a year since people started discussing retrieval augmented generation-related techniques to combine LLMs with existing datasets. Usually, Many of them require or recommend using a vector database or doing vector-base search for efficient semantic search.

However, when digging inside Langchain’s library, I noticed that it is quite easy to set up a Text-To-Query workflow where the large language model converts natural language to a query, such as SQL. These workflows do not require using any vector database.

This enables the user to ask what he wants without learning the query languages and the data models underneath. However, being so simple to set up, I wondered how production-ready it would be and what limitations it would have.

Choosing a test dataset

Of course, We need a queryable dataset to experiment with Text-To-Query. let’s use a sample Dataset of Reddit submissions, comments, and users stored on a Memgraph Graph Database.

This sample dataset can be imported directly via Memgraph’s Database UI

Why use Graph Databases For Text-to-Query?

Using a graph database was mainly out of personal curiosity. But let’s assume that I want to ask this question:

How may comment replies are associated with submission id ‘pb7zaw’

This question requires finding comments associated with this submission and all the sub-comments related to the top comments. All the comments cannot be made within a single lookup.

Graph query language such as the Cypher Query language, used in Neo4j and other graph technologies, allows you to do a recursive search quite easily:

# the * after :REPLY_TO allows 
# to search recursively all downstream nodes
MATCH (s:SUBMISSION {id:'pb803p'})<-[:REPLY_TO*]-(c:COMMENT)
RETURN c

However, assuming that we would have opted to do a Text-To-SQL RAG, the resulting query would need to be :

WITH RECURSIVE comment_tree AS (
    SELECT *
    FROM comments c
    JOIN submissions s ON s.id = 'pb803p'
    WHERE c.submission_id = s.id
    
    UNION
    
    SELECT *
    FROM comments c
    JOIN replies r ON c.id = r.comment_id
    JOIN comment_tree ct ON ct.id = r.reply_to_id
)
SELECT *
FROM comment_tree ct;

The SQL query language is more complex for expressing deeply relational queries. Consequently, I’ve decided to opt for the database with the most straightforward queries for those cases. I assume the easier the text-to-query translation is, the better it will be at translating more complex natural language questions. But this is only an assumption. To be sure of this assumption, we would have to do more thorough testing and comparison.

Basic Graph Database RAG setup

LangChain already provides the GraphCypherQAChain as a plug-and-play solution for Graph Databases that supports the bolt protocol for Neo4j and Memgraph:

cypher_prompt='...'
qa_prompt='...'
question = 'INSERT_QUESTION_HERE'

graph = MemgraphGraph(
    url=os.environ['NEO4J_URL'],
    username=os.environ['NEO4J_USERNAME'],
    password=os.environ['NEO4J_PASSWORD'],
    database=os.environ['NEO4J_DATABASE']
)

chain = GraphCypherQAChain.from_llm(
    ChatOpenAI(
        temperature=0,
        openai_api_key=os.environ["OPENAI_API_KEY"]
    ),
    graph=graph,
    # Uncomment for custom prompt,
    # cypher_prompt= PromptTemplate(
    #     input_variables=["schema", "question"],
    #     template=cypher_prompt
    # ),
    # qa_prompt= PromptTemplate(
    #     input_variables=["context", "question"],
    #     template=qa_prompt
    # ),
    verbose=True,
    model_name="gpt-4o", 
)

response = chain.invoke(question)

print(f"Response: {response['result']}")

The steps would, therefore, consist of the following:

Setup Memgraph or Neo4j Database Instance, on the cloud or locally
Setup an Open AI account and use the
Execute the LangChain python code mentioned above

How does GraphCypherQAChain work?

We can look at the source code of the chain to understand how it works.
the chain work as follow

Executes a query to get the schema of the Graph or Database
Builds a prompt that asks the LLM to generate a Cypher (or AQL) query. Ask this prompt to the LLM.
Executes the query returned by the LLM on the Graph Database
(Optional) prompts the LLM again to create a Human-like answer based on the Graph Database Result and the Initial question.

Results

Let’s ask the example question we had above with ChatGPT-4o:

Prompt: How may comment replies are associated with submission id ‘pb7zaw’

Generated Cypher:
MATCH (:SUBMISSION {id: 'pb7zaw'})<-[:REPLY_TO*]-(c:COMMENT)
RETURN COUNT(c)
Full Context:
[{'COUNT(c)': 41}]

> Finished chain.
Response: There are 41 comment replies associated with submission id 'pb7zaw'.
r

The example above worked perfectly. However, this was purely luck. Most queries I’ve tried failed first and then needed some tweaks to work as expected, making it blatantly clear that this simple approach is insufficient to give value to any users.

Limitations of Basic Databases RAG techniques

Fails to understand precisely how the cypher syntax works

Who is the redittor that commented the most?

// Snippet of the schema injected in the Cypher Query generation prompt
The relationships are the following:
(:SUBMISSION)-[:CREATED_BY]->(:REDDITOR),
(:COMMENT)-[:CREATED_BY]->(:REDDITOR),
(:COMMENT)-[:REPLY_TO]->(:SUBMISSION),
(:COMMENT)-[:REPLY_TO]->(:COMMENT)
//(...)

Generated Cypher:
MATCH (r:REDDITOR)-[:CREATED_BY]->(:COMMENT)
WITH r, COUNT(*) AS commentCount
RETURN r.name
ORDER BY commentCount DESC
LIMIT 1;
Full Context:
[]

> Finished chain.
Response: I don't know the answer.

The initial match between the two nodes is inverted. It should have been:

Generated Cypher:
MATCH (r:REDDITOR)<-[:CREATED_BY]-(:COMMENT)
(...)

It seems the LLM does not fully understand how the cypher query language works despite being fed schema-related information, which seems relatively straightforward. Maybe it does not understand the necessity of swapping the arrow direction when swapping nodes. Who knows?

A solution would not be to use such a generic LLM. Indeed, a fine-tuned model or a model specialized in translating natural language to Cypher query language would not make these simple mistakes.

Cypher query generation frequently does not make deeply relational queries

Prompt: What is the most commented submission?

Generated Cypher:
MATCH (s:SUBMISSION)<-[:REPLY_TO]-(c:COMMENT)
WITH s, COUNT(c) AS numReplies
RETURN s
ORDER BY numReplies DESC
LIMIT 1;

However, this answer might be misleading. Since it only looks at the top-level comments, not the comments inside each. A better query might be:

MATCH (s:SUBMISSION)<-[:REPLY_TO*]-(c:COMMENT)
WITH s, COUNT(c) AS numReplies
RETURN s
ORDER BY numReplies DESC
LIMIT 1;

But the query is naturally ambiguous: Should the most commented submission be the one with the most sub-comments? It depends on what the user wants. A basic Text-To-Query system like this one is not designed to clarify ambiguity.

If we are 100% sure that we want deeply relational queries, fine-tuning might help, but extra explanation inside the prompt helps to get more consistent results. By adding an extra note, I was able to get more consistent results:


Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: 
- Do not include any explanations or apologies in your responses.
- Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
- Do not include any text except the generated Cypher statement.
- Always try to search recursively for child nodes when possible. 

The question is:
{question}

Fails to understand possible values of a property

A common mistake is failing to understand the possible values of a specific property.

Prompt: Return 10 very negative comments

Generated Cypher:
MATCH (c:COMMENT)
WHERE c.sentiment = -2
RETURN c
LIMIT 10;

However, sentiment property values in this collection are -1,0, or 1, representing negative, neutral, and positive sentiment. By adding the word very, it suddenly thinks there are more possible values. This makes sense: If the LLM does not have the proper context regarding the data contained, it can request something that does not exist.

This is why providing a schema of the data on the initial query could help. The basic implementation of the Memgraph Chain is quite limited. Memgraph only gives the node properties, the relationship properties, and how nodes are related:

// Snippet of the schema injected in the Cypher Query generation prompt
Node properties are the following:
COMMENT {id: str, created_at: int, body: str, sentiment: int},
REDDITOR {name: str, id: str},
SUBMISSION {id: str, created_at: int, title: str, body: str, url: str, sentiment: int}

This whole set of problems can be fixed with better data documentation practices:

Use proper data type when necessary, for example
Add database constraints, such as uniqueness and presence constraint
Add documentation and metadata on each node and relationship properties
Add documentation and metadata on the whole collection

This new data needs to be injected inside the prompt for context. Unfortunately, Memgraph and the Used Chain fail to transmit schema and metadata other than the one shown above. We must adjust the above chain’s internal code for the best result. Let’s put it inside the prompt to cheat a little:

Prompt: Return 10 very negative comments. Sentiment values are -1 , 0 and 1

Generated Cypher:
MATCH (c:COMMENT)
WHERE c.sentiment = -1
RETURN c
LIMIT 10;

Of course, this is not good for the user experience; we cannot assume that the user has this knowledge about the dataset, and forcing the user to be explicit is extra unnecessary friction.

What about other Text-to Query chains?

The ArangoGraphQAChain is using a similar workflow as the one above, and non-surprisingly, it gives identical results. One slight difference is that instead of injecting schema, it programmatically fetches a sample of one record inside each node and edge collection to understand. Although it gives a better outlook on the collection’s content, this is unreliable in most cases because a subset of records cannot reliably portray the content of an entire collection, especially in a NoSQL database.

Here’s the ArangoGraphQAChain’s python script:

question='INSERT_QUESTION_HERE'

db = ArangoClient(hosts=os.environ['ARANGO_ENDPOINT']).db(
    os.environ["ARANGO_DATABASE"],
    os.environ["ARANGO_USERNAME"],
    os.environ["ARANGO_PASSWORD"],
    verify=True
)

graph = ArangoGraph(db)
graph.set_schema()

chain = ArangoGraphQAChain.from_llm(
    ChatOpenAI(
        model="gpt-4o",
        temperature=0,
        openai_api_key=os.environ["OPENAI_API_KEY"]
    ),
    graph=graph,
    verbose=True
)

print(chain.run(question))

Insights

With the LangChain and the community-built chains, combining different kinds of databases with an LLM is relatively easy. But the a basic workflow is far from production-ready. The challenge comes after once you are trying to get reliable answers and a high success rate.
AI models cannot magically know a dataset's content. A well-scoped schema, rich data documentation, and metadata are necessary to allow your dataset to be understood and interpreted.
AI models must be good enough to understand the syntax and semantics of a specific query language. In the case of Large language models with Cypher and Arango’s AQL query language, they didn’t seem to demonstrate that.

Additional Resources

Source Code
Memgraph
ArangoDB Official Documentation
Neo4j GraphRag and NeoConverse - A more mature text-to-query solution example provided by Neo4j.
Thanks for reading. Subscribe to get updates on on my experimentations and writings

Jérémie Bigras-Dunberry

Discussion about this post