GraphRAGs gathering

Authors: Simon Schindler, Ariadna Villanueva

Introduction

RAGs

Large language models (LLMs) achieve strong performance across many tasks by encoding knowledge in their parameters, but they often fail when required information lies outside their training data or is outdated. Retrieval-augmented generation (RAG) was developed to address this limitation by combining LLMs with external knowledge sources. In a RAG framework, relevant documents are retrieved from a database and provided to the model at inference time, grounding the generated output in explicit evidence. This approach improves task performance, enables access to up-to-date information, and reduces hallucinations.

Limitations: RAG systems may still misinterpret retrieved content, for example by extracting statements with missing context. When sources provide conflicting or temporally inconsistent information, the model may struggle to assess reliability, potentially producing responses that blend outdated and current facts in a misleading way.

Technically, RAG systems consist of two main components: a retriever and a generator. Documents are first preprocessed by splitting them into chunks and converting each chunk into a vector embedding using an embedding model. These embeddings are stored in a vector database. At inference time, the user query is embedded in the same vector space and used to retrieve the most relevant document chunks via similarity search. The retrieved content is then appended to the prompt and passed to the large language model, which generates a response conditioned on both the query and the retrieved context.

Image source: Wikimedia Commons, CC BY-SA 4.0

Related sources:

https://research.ibm.com/blog/retrieval-augmented-generation-RAG
Lewis, P., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates Inc.

Organization

We will first go through the tutorial, that we have divided into two notebooks:

Knowledge graphs: 01_knowledge_graph_construction.ipynb
RAGs: 02_graph_retrieval_augmented_generation.ipynb

Challenges:

You can select any of these challenges ordered by increasing level of difficulty.

Create a knowledge graph for another disease of your choice
Create your own knowledge database for papers relevant to your project
Navigate the Human reference atlas Knowledge Graph: https://docs.humanatlas.io/dev/kg#accessing-the-hra-kg
Create a network on genes based on common pathways
Make up your own challenge

Tasks:

Fork this repo if you want to show your solution (optional)
Set up the environment
Set up the Gemini API keys. If you have a OpenAI API you can also use it.
Go through the notebooks
Do one of the challenges
Share your solution (optional)

Outcomes:

Learn about graphRAGs

Environment Setup

Ensure you are in the root directory of the repository (gathering_graphrag) before running the following commands.

Option 1: Using `uv` (Recommended)

This project utilizes uv for dependency management, ensuring reproducible environments via uv.lock.

# Install uv if not already installed
curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh

# Sync dependencies and create the virtual environment
uv sync

# Activate the environment
source .venv/bin/activate

Option 2: Using `conda`

If you prefer Conda, create an environment and install dependencies using requirements.txt.

# Create and activate a new environment
conda create -n gathering_graphrag python=3.10
conda activate gathering_graphrag

# Install dependencies
pip install -r requirements.txt

Option 3: Using `venv`

Standard Python virtual environment setup using requirements.txt.

# Create the virtual environment
python -m venv .venv

# Activate the environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
# .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Neo4j Aura Setup

Create Account
- Navigate to the Neo4j Aura Console.
- Click Start for Free.
- Select Sign in with Google and authenticate with your existing Google credentials.
Create Database Instance
- Once logged in, click New Instance.
- Select the Free tier.
- Choose a region close to your location.
- Click Create Instance.
Retrieve Credentials
- A modal will appear displaying your generated password. Copy and save this password immediately, as it is shown only once.
- Wait for the instance status to change to Running.
- Copy the Connection URI displayed on the dashboard (e.g., neo4j+s://<db_id>.databases.neo4j.io).
Configure Environment
- Open the .env file in the root directory.
- Append the following variables, replacing the placeholders with your specific instance details:
```
NEO4J_URI=<YOUR_CONNECTION_URI>
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=<YOUR_GENERATED_PASSWORD>
```

Exploring your graph in AuraDB

You can explore the graph you created by using the AuraDB console.

Once you have created the graph, go to the Explore page in Tools (left panel). Here you can connect to the free instance you created.

There are two types of graphs created:

The semantic graph where the relationships between entities are derived from the input documents
A chunk–embedding graph used for RAG and similarity search

You may only be able to see the chunk-embedding graph. You can remove these nodes by using cypher syntax. Go to Query in the left panel and then use the following to remove them:

MATCH (n:__KGBuilder__)
REMOVE n:__KGBuilder__;

Setup Google GenAI API Key

Obtain API Key
- Navigate to Google AI Studio.
- Log in with a Google account.
- Select Get API key from the sidebar menu.
- Click Create API key.
- Select Create API key in new project (or select an existing project if preferred).
- Copy the generated key string.
Configure Environment
- Create a file named .env in the root of the project directory.
- Add the following line, replacing <YOUR_API_KEY> with the key copied in the previous step:
```
GOOGLE_API_KEY=<YOUR_API_KEY>
```
Note: Ensure .env is listed in your .gitignore file to prevent committing credentials to version control.
Test the setup
- Test your setup by running:
```
python scripts/test_api_keys.py
```

Sources

Tutorial from Neo4j: https://neo4j.com/blog/news/graphrag-python-package/
Hugging face tutorial: https://huggingface.co/learn/cookbook/rag_with_knowledge_graphs_neo4j

Additional interesting pages:

If you want to run your AuraDB locally with docker: https://blog.greenflux.us/building-a-knowledge-graph-locally-with-neo4j-and-ollama/

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
notebooks		notebooks
scripts		scripts
solutions		solutions
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
env.template		env.template
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAGs gathering

Introduction

Organization

Environment Setup

Option 1: Using `uv` (Recommended)

Option 2: Using `conda`

Option 3: Using `venv`

Neo4j Aura Setup

Exploring your graph in AuraDB

Setup Google GenAI API Key

Sources

Additional interesting pages:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

menchelab/gathering_graphrag

Folders and files

Latest commit

History

Repository files navigation

GraphRAGs gathering

Introduction

Organization

Environment Setup

Option 1: Using uv (Recommended)

Option 2: Using conda

Option 3: Using venv

Neo4j Aura Setup

Exploring your graph in AuraDB

Setup Google GenAI API Key

Sources

Additional interesting pages:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Option 1: Using `uv` (Recommended)

Option 2: Using `conda`

Option 3: Using `venv`

Packages