Capstone Report for Udacity’s Data Scientist Nanodegree


Project overview

This project is about utilizing Apache Spark to perform Data Science tasks on the ‘Sparkify’ dataset. The Sparkify dataset is a hypothetical user log coming from a music-listening service such as Spotify or Apple Music. The high level goal of the project is to create a prediction model capable of identifying users that are considering to drop their subscription to the music listening service. The music-listening service could utilize the results of the predictive model to send targeted offers in hopes to preserve the user subscription.

Problem Statement

The core technical problem of this project is to find behavioral patterns in the available data, particularly for users that are considering to drop the service. To find these patterns, data science techniques such as data cleaning and data normalization need to be applied. A set of suitable features (numerical representations of user behavior) need to be identified or calculated based on available data. Then, machine learning algorithms such as LogisticRegression and RandomForestClassifiers are to be trained on these features. Finally, the trained models are utilized to make predictions from the stream of user activity logs and determine weather a user may be considering to drop his/her subscription.

The general strategy to solve the core problem lies on applying good data science techniques to clean the data and to later utilized this processed data to discover patterns in the data are the most indicative of an upcoming ‘churn’, or a dropped subscription. The discovered patterns are turned into feature vectors that can be fed to a variety of machine learning algorithms with the end goal of classifying a user as a potential churner or no.

An additional challenge of this project is the large amount of data that needs to be analyzed. This problem will be tackled by utilizing Apache Spark as the main underlying framework to process the data. Apache Spark is a distributed processing framework specifically designed to handle Big Data.


The following metrics will be utilized to measure performance of the predictive model implemented for the capstone project.

Considering a potential class representation imbalance, The F1 Score metric will also be utilized. An accuracy metric on its own could be misleading. It could potentially hide poor performance on classifying members with low representation. For instance, if all images for a given class X are wrongly classified, it may not affect the accuracy metric if the class constitutes a small percentage of the overall training or validation data.

In order to compute F1 Score, two additional metrics, namely Precision and Recall, need to be computed. Explanations of what these metrics are and how they are computed is provided below.

· True Positive: When the model is given an input image of class X, the model classifies it as class X

· True Negative: When the model is given an image of a class that is NOT X, the model correctly labeled as NOT X

· False Negative: When the model is given an image of a class X, the model erroneously assigned it a NOT X label.

False Positive: When the model is given an image of class NOT X, the model incorrectly labeled as class X.


Data Exploration

In this section, characteristics, anomalies, features and statistics of of the input data are discussed. For this project, the dataset provided by the Udacity staff on the ‘Sparkify Project Workspace (mini_sparkify_event_data.json ) is utilized.

The input data is a stream of user-triggered events. Trough its various attributes, data scientists could track the activities of all users within the platform. Activities or states include: Song being listened by user, whether the user like or dislike a song, or what page within the platform has the user visited. The following table includes a list of the attributes of an event, along with their high-level description.

Of particular interest are the userId, level, page and ts. This set of attributes could be leveraged to create activity patterns for each user as a function of time. The patterns could be further associated (or correlated) with the event of dropping a paid subscription and should support the construction of predictive models.

A total of 22 different event (pages) are present on input data.

['Cancel', 'Submit Downgrade', 'Thumbs Down', 'Home', 'Downgrade', 'Roll Advert', 'Logout', 'Save Settings', 'Cancellation Confirmation', 'About', 'Submit Registration', 'Settings', 'Login', 'Register', 'Add to Playlist', 'Add Friend', 'NextSong', 'Thumbs Up', 'Help', 'Upgrade', 'Error', 'Submit Upgrade']

The Data Visualization section provides additional insight into the distribution of this events.

The analysis of these events will focus on pages that are the strongest indicators of user engagement. This project is based on the premise or hypothesis that the following activities are the stronger indicators of user engagement:

The Data Visualization section provides a sequence that illustrates the behavioral pattern that could be observed before a user drops the service subscription.

The original dataset contains a total of 286500 records spanning from October 1, 2018 through December 3 of 2018. It contains interactions of a total of 226 users. However, initial exploration of the data demonstrated the presence of null values and/or empty usernames. These values were removed from the dataset, and the approach to do so is discussed on the Methodology section.

Data exploration also demonstrated the existence of users that never subscribed to the platforms paid service. Keeping these users posed the risk of misleading the machine learning algorithms, as their activities could not be correlated with dropping the paid service. On view of this, records for users that never subscribed to the service were also removed from the data frame.

One of the key steps on the data exploration phase of this project was to identify which users within the dataset became ‘churners’, or users that dropped their paid service. To meet the goals of this project, the machine learning model will be trained to recognize between the churners and non-churners classes.

The ‘Canelation Confirmation’ was identified as the page (or event) associated with dropping the service subscription. By analyzing the ‘Cancelation Confirmation’ events, a total of 76 ‘churner’s and 89 non-churners were found. To achieve class representation balance during model training, additional steps will be required. The steps will be discussed in the Methodology section.

Data Visualization

The following table shows the distribution of event types

The plot of events shown on Figure 1 has activity for a specific user several days before downgrading the subscription. In the graphic, the various events are highlighted with different symbols and plotted against time. The Y axis represent the numeric label assigned to the event (or page), which simplifies visualization by spreading events on the Y axis.

Figure 1 Visualization of Events Before Downgrade

The Figure 1 demonstrates that this user visited the Downgrade page several times before ending their subscription. This indicates that it could be possible to predict churn by using visits to the Downgrade page as a feature.

Data Preprocessing

The following block diagram (Figure 2) illustrates the ETL steps followed as part of the data pre-processing, along with processing steps followed to reach the point of training and testing the predictive model.

Figure 2 Sparkify Flow Diagram

Step description

A. Load from JSON — Loading Sparkify data from JSON file

B. Exploratory steps — gather and print various statistics and visualizations that allow the identification of cleaning and processing steps

C. Remove records with nulls and other invalid events — data cleaning step

D. Remove unnecessary attributes (columns) — Data reduction step

E. OneHotEncoder ‘churn’ events — Creates a new column with binary values. The column would have ones when a ‘churn’ (e.g. Cancelation Confirmation) event is found.

F. OneHotEncoder key events — ‘key events’ are those events identified as metrics of user engagement. ‘key events’ are: ‘Thumb Up’, ‘Thump Down’, ‘Downgrade’ and ‘NextSong’

G. Capture key events in temporal windows (feature computation) — This steps refers to applying a sliding window focused on the activity of a specific user over the past 7 and 14 days. This window will be keep track of how many of each of the ‘key events’ occurred during those time windows.

H. Group records by ‘churners’ and ‘non-churners’ — Assign each user the label of ‘churner’ or ‘non churner’.

I. Create feature vectors — groups all features into a single row vector. A necessary step to utilize pyspark’s machine learning classes

J. Apply scaling to feature vectors — Applies scaling and normalizations to the vectors so that the disparities on feature magnitude do not create undesired model bias.

K. Train, test, Val data split — Of all valid records, 75% is utilized for training, and 12.5% percent for testing, and the final 12.5% for validation.

L. Train models — train instantiated machine learning models with the training data.

M. Test models — Feeds trained model with test data to compute accuracy metric.

Compute performance metrics — Select the best performing model, feed it with validation data and compute performance metrics.


The following table describes some of the most important pyspark functionality applied to tackle the meet the goals of the project. The table pairs API calls to key steps on the block diagram of Figure 1


One of the initial observations made on the input data was that visiting the ‘Downgrade’ page seem like a strong indicator of future service downgrade. Initial experiments utilized the accumulation of such events over the 7 days pre-dating the churning event.

The best performing model achieve an accuracy of about 63% with this feature.

On the refinement stage, a variety of machine learning algorithms were used, along with different combinations of feature sets and scalers. Despite of this effort, no performance improvements were observed.


Model Evaluation and Validation

A variety of performance metrics were gathered for each of the model trainings and testing experiments. The experiments included making different combinations of feature vectors, of machine learning algorithms and feature scaling and normalization. For individual metrics on these attempts, consult the Jupyter notebook provided with this report.

Experiment variables:


· Using accumulated number of visits to the Downgrade page over 7-day windows

· Using accumulated number visits to Downgrade page over 7 and 14 day windows

· Using accumulation number of visits to the following pages, over both 7 and 14 day windows

o Thumbs Up

o Thumbs Down

o Downgrade

o NextSong


· Logistic Regression

o Parameters:

§ Default parameters as listed in the API Reference[1]

· Decision Tree

o Parameters:

§ Default parameters as listed in the API Reference[2]

· Random Forest

o Parameters

§ Default parameters as listed in the API Reference[3]

The best performing model was the Logistic Regression, when training occurred with the accumulated number of visits to the Downgrade page over 7-day and 14-day windows. It achieved an accuracy of 63%. The following table summarizes other metrics for the best performing model for the Test Dataset.

[1] LogisticRegression — PySpark 3.1.1 documentation (

[2] DecisionTreeClassifier — PySpark 3.1.1 documentation (

[3] LogisticRegression — PySpark 3.1.1 documentation (

This combination of features and models led to an 80% recall on detecting ‘churners’. A high recall, which implies a low False Negative rate, indicates that the model is less likely to miss a ‘churner’, perhaps at the expense of classifying non-churners as such. For the use case of Sparkify, it is best not to miss a potential ‘churner’ so business with a customer is preserved.


Modestly optimistic results were obtained. However, for a model to be considered optimal, classification accuracies of over 90% are needed. Not achieving over 90% could be attributed the small amount of data. Despite of large number of records, these records only were representative of the activity of less than 200 users on 2 months of activity. More over, the activity occurred between October and December or holiday seasons. User spending habits vary significantly throughout the year, and on the holiday season, several external factors (such as competing expenses and competitor offers) could influence churning rate more than factors extracted from the activity data (user log).


Given the achieved accuracy, it is difficult to conclude that the Sparkify dataset is sufficient to construct predictive models capable of warning platform stakeholders about potential drops on paid subscription before they occur. The pre-processing steps applied to the data and the features selected for data analysis allowed the machine learning models to detect engagement patterns, and label a user as a ‘churner’ or ‘non-churner’ with a small margin over 50%, which not much better than luck or chance.


Implementation to a solution to the Sparkify problem highlight key distinctions on working with Big Data vs traditional datasets. On working with big data, implementation of the various algorithmic steps need to leverage the distributed processing framework as much as possible. This project highlights the importance of utilization pysparks API to ensure that processing steps are scalable to large volumes of data.

Analyzing a stream of data over time is also a key distinction on the field of machine learning and artificial intelligence. For successful predictions, data scientist should look at data patterns that happened before an event of interest. Predictive models are effectively chasing a moving target, and more factors come into play.


• Deployment on a compute cluster (AWS or IBM Watson)

• Modeling and testing on the entire 12GB dataset.

• Apply pyspark ParamGridBuilder to determine best model hyperparameters.

• Test other models such as Linear Regression, Linear Support Vector Machine and Naïve Bayes.

• The effects of feature normalization need to be studied more carefully. By default, pyspark normalizes the rows vectors. Applying normalization based on the entire population of a given feature (column) may lead to desirable statistical improvements.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store