This project consists of two main components:
- An AI-powered script (
discover_pattern.py) to automatically identify the URL structure of product pages on a given e-commerce website. - An asynchronous web crawler (
crawler.py, run viamain.py) that uses these identified patterns to find and save product URLs.
The system operates in two distinct phases:
-
Pattern Discovery:
- You run
discover_pattern.pyproviding the starting URL of the target e-commerce site (e.g.,https://www.example-store.com/). - This script utilizes the
browser-uselibrary, which controls a headless browser (via Playwright) to navigate the site. - It interacts with a configured Large Language Model (LLM - currently set to use Gemini via
langchain-google-genai) to analyze the site structure and determine the common URL path segment that indicates a product detail page (e.g.,/products/,/p/,/item/). - The script parses the LLM's output (extracted from logs) to get the pattern string.
- The discovered pattern is saved (or updated) in the
patterns.jsonfile, mapped to the website's domain name (e.g.,"www.example-store.com": "/products/").
- You run
-
Crawling:
- You run
main.pyproviding the starting URL of a site whose pattern has already been discovered and saved inpatterns.json. - The
ProductCrawlerclass incrawler.pyis initialized. It readspatterns.jsonand loads the specific pattern associated with the target domain. - The crawler launches multiple headless browser instances (using Playwright) up to the specified concurrency limit.
- Worker tasks asynchronously navigate through the website, starting from the initial URL.
- For each page visited, the crawler extracts all links (
<a>tags). - It checks if a link belongs to the same domain and hasn't been visited before.
- If a link matches the loaded product URL pattern for that domain (using a simple substring check), its URL is appended to the
product_urls.txtfile in the formatdomain,url(e.g.,www.example-store.com,https://www.example-store.com/products/cool-item-123). - Non-product links on the same domain are added to a queue for further crawling, respecting the
max_pageslimit. - The process continues until the queue is empty or the
max_pageslimit is reached.
- You run
-
Prerequisites:
- Python 3.10+ recommended.
pip(Python package installer).- Access to a Google Gemini API key.
-
Clone Repository: (If applicable)
git clone <repository_url> cd <repository_directory>
(Assuming you are already in the
ecom-crawlerdirectory) -
Create Virtual Environment: (Recommended)
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
- Ensure
requirements.txtis up-to-date or install manually. Key packages are:browser-use,playwright,langchain-google-genai,python-dotenv,aiofiles.
# pip install -r requirements.txt - Ensure
-
Install Playwright Browsers:
- This downloads the necessary headless browser engines.
playwright install chromium
-
Configure API Key:
- Create a file named
.envin the project root directory (ecom-crawler). - Add your Gemini API key to the file like this:
GEMINI_API_KEY='YOUR_ACTUAL_GEMINI_API_KEY'
- Create a file named
-
Discover Pattern for a New Site:
- Run the discovery script from your terminal, providing the site's starting URL:
python discover_pattern.py https://www.new-store.com/
- This will analyze the site (may take a few minutes) and update
patterns.json. Check the script's output andpatterns.jsonto confirm success.
- Run the discovery script from your terminal, providing the site's starting URL:
-
Crawl a Supported Site:
- Ensure the pattern for the site exists in
patterns.json. - Run the main crawler script, providing the site's starting URL:
python main.py https://www.new-store.com/
- Optional arguments:
--max-pages N: Limit the crawl to N pages (default: 100).--max-concurrent N: Set the number of concurrent browser pages (default: 5).
python main.py https://www.new-store.com/ --max-pages 500 --max-concurrent 3
- Found product URLs will be appended to
product_urls.txtin the formatdomain,url.
- Ensure the pattern for the site exists in
- Pattern Accuracy: The LLM-based pattern discovery is generally effective but might occasionally misidentify patterns or fail on complex sites. Manual verification of
patterns.jsonmight be needed sometimes. The current pattern matching in the crawler (pattern in url) is basic and might need refinement for certain URL structures. - Resource Usage: The Playwright crawler uses full browser instances and can be memory and CPU intensive, especially with higher concurrency. Adjust
--max-concurrentbased on your system resources. - Error Handling: Basic error handling (timeouts, network errors) is included, but complex site structures or unexpected errors might still halt the crawl for specific URLs. Check the log output for details.