Skip to content

A comprehensive data migration project demonstrating the transformation of a relational movie database from MySQL to MongoDB's document-oriented structure, including data modeling, ETL processes, and performance optimization.

Notifications You must be signed in to change notification settings

Karthih2/data-migration

Repository files navigation

Relational to NoSQL Migration: ETL, Denormalization, and Query-Driven Design

Python MySQL MongoDB ETL Data Modeling Status

A comprehensive data migration project demonstrating the transformation of a relational movie database from MySQL to MongoDB's document-oriented structure, including data modeling, ETL processes, and performance optimization.

πŸ“‹ Project Overview

This project showcases a complete migration workflow from a normalized relational database (MySQL) to a denormalized NoSQL database (MongoDB). The migration transforms a MovieLens-style dataset while demonstrating key concepts in both database paradigms.

Key Features

  • Complete ETL Pipeline: Extract data from MySQL, transform to document model, load into MongoDB
  • Data Denormalization: Embedded documents and arrays for optimized query performance
  • Schema Design: Thoughtful document structure balancing read/write patterns
  • Index Optimization: Strategic indexing for common query patterns
  • Data Validation: Migration verification with count comparisons and data integrity checks

πŸ—„οΈ Database Schemas

MySQL Schema (Source)

The relational database consists of four normalized tables:

movies

  • movieId (INT, PK)
  • title (VARCHAR)
  • genres (VARCHAR)

links

  • movieId (INT, FK)
  • imdbId (VARCHAR)
  • tmdbId (INT)

ratings

  • userId (INT)
  • movieId (INT, FK)
  • rating (DOUBLE)
  • rating_time (DATETIME)

movie_tags

  • userId (INT)
  • movieId (INT, FK)
  • tag (VARCHAR)
  • tag_time (DATETIME)

MongoDB Schema (Target)

The document-oriented database uses three collections with embedded documents:

movies

{
  "_id": ObjectId,
  "movieId": 1,
  "title": "Toy Story",
  "genres": ["Adventure", "Animation", "Children", "Comedy", "Fantasy"],
  "release_year": "1995",
  "links": {
    "imdbId": "0114709",
    "tmdbId": 862
  },
  "averageRating": 3.92,
  "ratingCount": 215,
  "tags": [
    {"tag": "fun", "tagCount": 1},
    {"tag": "pixar", "tagCount": 3}
  ]
}

ratings

{
  "_id": ObjectId,
  "userId": 1,
  "movieId": 1,
  "rating": 4.0,
  "rating_time": ISODate("2000-07-31T00:15:03Z")
}

users

{
  "_id": ObjectId,
  "userId": 1,
  "stats": {
    "totalRatings": 232,
    "averageRating": 4.23,
    "totalTags": 13
  },
  "recentRatings": [
    {"movieId": 150, "rating": 5.0, "rating_time": ISODate}
  ],
  "recentTags": [
    {"movieId": 260, "tag": "classic", "timestamp": ISODate}
  ]
}

πŸ”„ Migration Process

1. Data Extraction

Connect to MySQL and extract data from all four tables:

conn = pymysql.connect(host=host, user=user, password=password, database=database)
movies_df = pd.read_sql("SELECT m.*, l.imdbId, l.tmdbId FROM movies m LEFT JOIN links l ON m.movieId = l.movieId", conn)

2. Data Transformation

Movie Documents

  • Extract release year from title using regex
  • Split pipe-delimited genres into arrays
  • Embed link information as nested objects
  • Calculate and embed rating statistics
  • Aggregate and embed tag information

User Documents

  • Aggregate all ratings and tags per user
  • Calculate user statistics (total ratings, average rating, total tags)
  • Maintain recent activity (top 10 ratings and tags by timestamp)

3. Data Loading

Bulk insert transformed documents into MongoDB:

db.movies.insert_many(movies_df.to_dict('records'))
db.ratings.insert_many(ratings_df.to_dict('records'))
db.users.insert_many(users_list)

4. Index Creation

Strategic indexes for query optimization:

# Movies collection
db.movies.create_index('movieId', unique=True)
db.movies.create_index('genres')
db.movies.create_index([('title', 'text')])

# Ratings collection
db.ratings.create_index([('userId', 1), ('rating_time', -1)])
db.ratings.create_index('movieId')

# Users collection
db.users.create_index('userId', unique=True)

πŸ“Š Migration Results

Entity MySQL Count MongoDB Count Status
Movies 9,742 9,742 βœ… Complete
Ratings 100,836 100,836 βœ… Complete
Users 610 610 βœ… Complete

πŸ› οΈ Technologies Used

  • Python 3.x: Core programming language
  • PyMySQL: MySQL database connector
  • PyMongo: MongoDB driver for Python
  • Pandas: Data manipulation and transformation
  • Matplotlib: Data visualization
  • python-dotenv: Environment variable management

πŸ“ Project Structure

β”œβ”€β”€ migration_script.ipynb              # Main migration script
β”œβ”€β”€ .env                              # Environment variables (not tracked)
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”‚
β”œβ”€β”€ data_models/                      # Database schemas & datasets
β”‚   β”œβ”€β”€ MongoDB_MoviesDB_DataModel   # MongoDB data model
|   β”œβ”€β”€ MySql_MovieDB_DataModel      # MySQL data model
β”‚   β”‚
β”‚   β”œβ”€β”€ datasets/                     # Dataset references
β”‚   β”‚   β”œβ”€β”€ movies.csv                # Movies dataset
β”‚   β”‚   β”œβ”€β”€ ratings.csv               # Ratings dataset
β”‚   β”‚   β”œβ”€β”€ tags.csv                  # Tags dataset
β”‚   β”‚   └── links.csv                 # Links dataset
β”‚
β”œβ”€β”€ visualizations/                   # Migration result charts
β”‚   β”œβ”€β”€ movies_migration.png
β”‚   β”œβ”€β”€ ratings_migration.png
β”‚   └── users_migration.png
β”‚
└── README.md                          # Project documentation

πŸš€ Getting Started

Prerequisites

  • Python 3.7+
  • MySQL Server
  • MongoDB Server
  • Access credentials for both databases

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/mysql-mongodb-migration.git
cd mysql-mongodb-migration
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables in .env:
# MySQL Configuration
host=localhost
user=your_mysql_user
password=your_mysql_password
database=moviesdb

# MongoDB Configuration
MONGO_URI=mongodb://localhost:27017/
  1. Run the migration:
python migration_script.py

πŸ“ˆ Key Learnings

Database Design Decisions

  1. Embedding vs. Referencing: Embedded link information and tags in movies collection for read optimization, while keeping ratings separate due to high volume and update frequency.

  2. Denormalization Trade-offs: Pre-calculated aggregates (averageRating, ratingCount) reduce query complexity at the cost of update complexity.

  3. Index Strategy: Compound indexes on (userId, rating_time) support user activity queries, while text index on title enables search functionality.

  4. Array Fields: Genres stored as arrays enable multi-value queries using MongoDB's array operators.

Performance Considerations

  • Read Optimization: Embedded documents eliminate JOIN operations for common queries
  • Write Trade-offs: Updates to ratings require updating movie aggregates
  • Index Overhead: Strategic indexing balances query performance with storage cost
  • Document Size: Maintained under MongoDB's 16MB limit through recent activity limits

πŸ” Sample Queries

MongoDB Queries

Find all Action movies with high ratings:

db.movies.find({
  genres: "Action",
  averageRating: { $gte: 4.0 }
})

Get user's recent activity:

db.users.findOne(
  { userId: 1 },
  { recentRatings: 1, recentTags: 1 }
)

Text search for movies:

db.movies.find({
  $text: { $search: "toy story" }
})

πŸ“ Future Enhancements

  • Implement incremental migration for ongoing data sync
  • Add data validation and error handling
  • Create MongoDB aggregation pipeline examples
  • Implement rollback mechanism
  • Add unit tests for transformation logic
  • Performance benchmarking between MySQL and MongoDB queries

🀝 Contributing

Contributions, issues, and feature requests are welcome!.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘€ Author

Karthick S

πŸ™ Acknowledgments

  • MovieLens dataset for sample data structure
  • MongoDB documentation for best practices

This project demonstrates practical experience with database migration, ETL processes, and understanding of both relational and document-oriented database paradigms.

About

A comprehensive data migration project demonstrating the transformation of a relational movie database from MySQL to MongoDB's document-oriented structure, including data modeling, ETL processes, and performance optimization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published