Capstone Report for Udacity’s Data Scientist Nanodegree


Project overview

This project is about utilizing Apache Spark to perform Data Science tasks on the ‘Sparkify’ dataset. The Sparkify dataset is a hypothetical user log coming from a music-listening service such as Spotify or Apple Music. The high level goal of the project is to create a prediction model capable of identifying users that are considering to drop their subscription to the music listening service. The music-listening service could utilize the results of the predictive model to send targeted offers in hopes to preserve the user subscription.

Problem Statement

The core technical problem of this project is to find behavioral patterns in the available data, particularly for users that are considering to drop the service. To find these patterns, data science techniques such as data cleaning and data normalization need to be applied. A set of suitable features (numerical representations of user behavior) need to be identified or calculated based on available data. Then, machine learning algorithms such as LogisticRegression and RandomForestClassifiers are to be trained on these features. Finally, the trained models are utilized to make predictions from the stream of user activity logs and determine weather a user may be considering to drop his/her subscription.


The following metrics will be utilized to measure performance of the predictive model implemented for the capstone project.


Data Exploration

In this section, characteristics, anomalies, features and statistics of of the input data are discussed. For this project, the dataset provided by the Udacity staff on the ‘Sparkify Project Workspace (mini_sparkify_event_data.json ) is utilized.

['Cancel', 'Submit Downgrade', 'Thumbs Down', 'Home', 'Downgrade', 'Roll Advert', 'Logout', 'Save Settings', 'Cancellation Confirmation', 'About', 'Submit Registration', 'Settings', 'Login', 'Register', 'Add to Playlist', 'Add Friend', 'NextSong', 'Thumbs Up', 'Help', 'Upgrade', 'Error', 'Submit Upgrade']

Data Visualization

The following table shows the distribution of event types

Figure 1 Visualization of Events Before Downgrade

Data Preprocessing

The following block diagram (Figure 2) illustrates the ETL steps followed as part of the data pre-processing, along with processing steps followed to reach the point of training and testing the predictive model.

Figure 2 Sparkify Flow Diagram


The following table describes some of the most important pyspark functionality applied to tackle the meet the goals of the project. The table pairs API calls to key steps on the block diagram of Figure 1


One of the initial observations made on the input data was that visiting the ‘Downgrade’ page seem like a strong indicator of future service downgrade. Initial experiments utilized the accumulation of such events over the 7 days pre-dating the churning event.


Model Evaluation and Validation

A variety of performance metrics were gathered for each of the model trainings and testing experiments. The experiments included making different combinations of feature vectors, of machine learning algorithms and feature scaling and normalization. For individual metrics on these attempts, consult the Jupyter notebook provided with this report.

Experiment variables:



Modestly optimistic results were obtained. However, for a model to be considered optimal, classification accuracies of over 90% are needed. Not achieving over 90% could be attributed the small amount of data. Despite of large number of records, these records only were representative of the activity of less than 200 users on 2 months of activity. More over, the activity occurred between October and December or holiday seasons. User spending habits vary significantly throughout the year, and on the holiday season, several external factors (such as competing expenses and competitor offers) could influence churning rate more than factors extracted from the activity data (user log).


Given the achieved accuracy, it is difficult to conclude that the Sparkify dataset is sufficient to construct predictive models capable of warning platform stakeholders about potential drops on paid subscription before they occur. The pre-processing steps applied to the data and the features selected for data analysis allowed the machine learning models to detect engagement patterns, and label a user as a ‘churner’ or ‘non-churner’ with a small margin over 50%, which not much better than luck or chance.


Implementation to a solution to the Sparkify problem highlight key distinctions on working with Big Data vs traditional datasets. On working with big data, implementation of the various algorithmic steps need to leverage the distributed processing framework as much as possible. This project highlights the importance of utilization pysparks API to ensure that processing steps are scalable to large volumes of data.


• Deployment on a compute cluster (AWS or IBM Watson)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store