DataPipeline

a Real-Time Data Pipeline made with Apache Airflow, Kafka, Spark, and Cassandra.

based on the pipeline detailed here: https://medium.com/@jushijun/building-a-real-time-data-pipeline-with-apache-airflow-kafka-spark-and-cassandra-be4ee5be8843

To start docker and build all containers:

docker compose up -d --build

To check on a specific containers last logs:

docker logs --tail 100

To bring down all the containers(and excess containers that might not be used anymore or specified in docker-compose.yml)

docker compose down --remove-orphans

General Steps:

build all the containers, check that all have started and not having any errors
check the docker logs on datapipeline-webserver-1, datapipeline-control-center-1 to make sure container is booted and web services are live and ready to be connected to.
navigate to [VM external ip]:port# for various UIs, 8080 for Airflow, 9021 for Confluent, 4040 for Spark Stream
manually toggle the user_automation in Airflow, trigger the DAG to begin ingestion
navigate to confluent's UI to the cluster, Topics, users_created, and messages to view the incoming JSON stream
navigate to Spark Stream's UI to view the job
Run the Cassandra command to view the newly inputted data in the table

Command to create admin user within airflow ui

docker exec -it datapipeline-webserver-1 airflow users create
--username admin
--firstname Admin
--lastname Admin
--role Admin
--email admin@example.com
--password admin

Other things that had to be fixed:

Had to add firewall rules in Google Cloud to port 8080 and port 9021 and port 4040 for airflow ui and confluent and spark stream

had to remove some health checks on containers/make the start time delay 100+ seconds due to slow cold boot up

send a test topic from kafka(Project not configured for this anymore)

//produce from inside the broker container (best – guarantees we hit “broker:29092”) docker exec -it datapipeline-broker-1
kafka-console-producer --bootstrap-server broker:29092 --topic test-topic

{"id": 42, "name": "spark_live", "value": 999} //hit Enter again, then Ctrl‑C to close the producer

Get Table from Cassandra

docker exec -it datapipeline-cassandra-1
cqlsh -e "SELECT * FROM spark_demo.users_created;"

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dags		dags
jars		jars
script		script
Dockerfile		Dockerfile
README.md		README.md
cqlsh		cqlsh
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataPipeline

To start docker and build all containers:

To check on a specific containers last logs:

To bring down all the containers(and excess containers that might not be used anymore or specified in docker-compose.yml)

General Steps:

Command to create admin user within airflow ui

Other things that had to be fixed:

send a test topic from kafka(Project not configured for this anymore)

Get Table from Cassandra

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mikerabs/DataPipeline

Folders and files

Latest commit

History

Repository files navigation

DataPipeline

To start docker and build all containers:

To check on a specific containers last logs:

To bring down all the containers(and excess containers that might not be used anymore or specified in docker-compose.yml)

General Steps:

Command to create admin user within airflow ui

Other things that had to be fixed:

send a test topic from kafka(Project not configured for this anymore)

Get Table from Cassandra

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages