Real-Time and Batch Processing of Splitwise Data with Big Data Technologies

Real-Time and Batch Processing of Splitwise Data with Big Data Technologies — featured screenshot

Technologies Used

  1. Apache Cassandra
  2. Apache Hadoop
  3. Apache Kafka
  4. Apache Spark
  5. Bash
  6. Docker
  7. Git
  8. Kotlin
  9. Spring Boot

Links


Introduction

Splitwise is a free tool friends and roommates use to track shared bills and expenses. This project pulls that financial data and processes it with a Big Data stack in two modes: a real-time streaming pipeline and a batch analytics job. It is built with Kafka, Spark, Spring Boot, Cassandra, and Docker Compose, and is meant as an end-to-end demonstration of wiring these tools together into a working pipeline.

Architecture

Architecture

Technologies Used

  • Kafka
  • Spark
  • Spring Boot
  • Cassandra
  • Docker
  • Docker Compose
  • Gradle
  • SBT
  • Scala
  • Kotlin
  • CQL

How to run the project

Prerequisites

  • Docker

Steps

Clone the project, change into its directory, and build the images:

docker build -t jobscheduler ./scheduler
docker build -t spark-analysis ./sparkanalysis
docker build -t kafka-streaming-app ./kafka-streaming-app

Start everything with Docker Compose:

docker-compose up -d

Docker Desktop should look like this:

Docker Desktop

Give Cassandra and Kafka about 60 seconds to start. The cassandra-init service is expected to stop after ~65 seconds (depending on machine speed) once it has initialized the database — if it stops before the database is ready, run it again and wait another 60 seconds. This is only needed the first time.

Once Cassandra is up and initialized, start the jobscheduler service, then register a Splitwise user:

curl --location --request POST 'localhost:8080/add_user_key?key=<splitwise-api-key>' \
--data ''

A Splitwise API key can be generated here.

Start the kafka-streaming-app service and trigger the scheduler:

curl --location --request GET 'localhost:8080/job/splitwise'

Trigger Scheduler manually

This triggers the scheduler to fetch data from Splitwise and push it to Kafka. The kafka-streaming-app service consumes from Kafka, processes the data, and writes it to Cassandra.

Add User Key

Finally, start the sparkanalysis service to generate reports into the output folder. It stops automatically once the reports are written. The generated CSVs can be opened directly in Excel or fed into a visualization tool such as Tableau or Power BI.