This repo has three parts:
- Scraper: A bunch of scrapers which fetch concert listings from various venues in the Bay Area (Ruby / Selenium). Essentially a one-off process which you can schedule to run daily via Cron or Systemd or whatever. It writes results to a GCS bucket.
- Frontend: A website to view the listings (React). Compiled to a static site via
npm buildand deployed to Github pages via a Github action. It reads the files off GCS and can send API requests to the LLM Server - LLM Server: a LLM backend to research additional concert details (Python / LangChain). It's a web server you need to keep running on some machine.
cd frontend/react-appnpm inpm run dev- To build & deploy, run
ShowScraper/bin/deploy.
-
Install a modern ruby and
bundle install -
Install a driver
- Currently the app is set up to use Firefox or Chromedriver. I previously was using Chromedriver but I switched to Firefox. Edit Scraper#init_driver to switch implementation.
- For Chromedriver:
- On Linux you can
apt-get install chromium-chromedriver. Our application should hopefully pick up the executable path automatically. - On Windows / OSX, download from https://chromedriver.chromium.org/ and add the executable-containing folder to your
PATHmanually.
- On Linux you can
- For Firefox:
- Install Geckodriver and set
GECKODRIVER_PATHin env to point directly to the executable. I found it at/opt/homebrew/bin/geckodriverfor OSX or/usr/local/bin/geckodriverfor Linux
- Install Geckodriver and set
-
copy
.env.exampleto.envand configure it -
Make a new project on GCP and create a GCS bucket within it. Set
STORAGE_PROJECTin.envto your project id. Download thekeyfile.jsonand setSTORAGE_CREDENTIALSto this file's path. -
Make a new "project" on google cloud. Create a GCS bucket in the project. Add the credentials to
.env: -
Change the GCS bucket permissions so all files are publicly available by default.
-
Configure
gsutilsto use your new project, then upload the CORS file which I've included in the repo:gsutil cors set cors-json-file.json gs://<BUCKET_NAME>
There is a command line tool at bin/run_scraper.
By default it will run all scrapers (each will fetch a maximum of 200 events)
and then upload the results to GCS.
Options (note that most of these can also be set from .env)
# Limits each scraper to N results
--limit=10
# Just print the results, don't upload them to GCS
--skip-persist
# Don't rescue scraping errors - stop the script immediately
--rescue=false
# Skip broken scrapers (default behavior)
--rescue=true
# Trigger a binding.pry breakpoint upon error
--debugger
# Just update the list of venues. Don't actually scrape any events.
--no-scrape
# Limit the scrape to a set of venues. Comma-separated list.
--sources=GreyArea,Cornerstone
# Run headlessly
--headless=true
Note that every time you run a scraper, it will completely overwrite the list of events for that venue.
The LLM server provides AI-powered concert research via streaming SSE endpoints.
It's currently set up to use OpenAI, but you can probably switch it to use another provider easily.
-
Add API keys to
llm-server/.env:OPENAI_API_KEY=your_key_here SERPAPI_API_KEY=your_key_here -
cd llm-server -
uv venv venv(create a virtual environment) -
source venv/bin/activate -
uv pip install -r requirements.txt -
python main.py(runs on localhost:8000)
The server is used by the frontend's AI Research feature to provide two-phase streaming concert information.
-
Add a new entry to
sources.json. You can getlatlngfrom Google Maps (right click the marker on the map and the coords will pop up). Fordescyou can just copy the blurb from Google Maps as well. -
Create a new file
scraper/lib/sources/venue_name.rb(replacingvenue_name, obviously). -
You can copy one of the existing scraper classes as a starting point. Note that there are a few different types of websites (calendar view, infinite scroll, all-on-one-page) so it's best to find another scraper that is similar in that regard.
-
Make sure the class name is the exact same as the
namevalue insources.json -
Fill out the contents of the scraper, using
--debuggerand--headless=falseas needed for debugging. -
Add a test case to
scraper_spec.rb(can just usegeneric_run_testlike the other scrapers)
Note, there is no need to explicitly require the scraper class anywhere into the codebase.
Autoloading is already set up based on sources.json.
These are both unused. I kept them here in case I want to have a dedicated backend at some point.
For now it suffices to go backend-less and just host the results on GCS.
- Map View
- Add more meta-scrapers (e.g. scrape other scrapers/aggregators), especially for electronic shows which aren't really captured by the current venue list or "The List"
- Add more venues (have specifically received requests for South Bay, but probably there are new SF / East Bay venues as well).
- Add Venue Events List view (accessible from Venue List View)
- Find a way to handle events that don't have an explicit year in their date
- Add Submit Event / About pages