šŸ“‰ How-tos survival šŸ§

Watching ā€œReactJS Basicsā€ course the other day Iā€™ve noticed that the further I go ā€“ the smaller number of pageviews for each lesson was. But is that ā€œdropout rateā€ the same for all courses? Here is my little research.

Intro

So the procedure we are going to follow is called survival analysis. Itā€™s very well known in medicine but could be applied to many more industries as well.

The fraction of patients living for a certain amount of time after treatment. by Wikipedia

Letā€™s think of a brighter example for it than cancer patients. It could be a percentage of working laptops from the same batch or fraction of people who continue using your product after the first week, second week, etc.

In our research, weā€™ll look at how many people who started online course continue with it and watch new lessons.

Main assumption

We are going to work with Youtube data since itā€™s open and there are plenty of interesting courses. But there is a very important assumption ā€“ to make a precise survival analysis weā€™d need to know timestamp and user id for every view of every video in the course. But since Youtube gives us only a number of pageviews per video, letā€™s assume that all pageviews are unique and there are no users who started the course lately (weā€™d exclude them in real analysis).

Getting the data

So Iā€™ve picked up some courses on programming, chess, guitar, drawing, and fitness:

I think there is no use to share Ruby basics like looping through an array of ids and dump downloaded data to CSV, you can check all the code yourself. May be pay attention how itā€™s organized. Itā€™s very much inspired by Jupyter Notebooks in Python: progression of steps that should be run one by one.

Visualize survival curves

At this point, we have pageviews for each video in selected courses. Before we start processing it, letā€™s just plot it as it is. I think this step is very useful in any research you do.

Gnuplot is pretty much a standard, so letā€™s go with it:

Raw Youtube Pageviews per video Youtube Pageviews with no spikes

As you can see not every curve has downward slope ā€“ there are spikes in the middle with millions of pageviews. I guess itā€™s very specific to Youtube when some videos became viral ā€“ just look at the title ā€œHow to Achieve Checkmate in 2 Movesā€. So weā€™ll remove these points. Weā€™d also remove pageviews when users havenā€™t started from the beginning, but our assumption is that there are no such pageviews.

For the final plot we will calculate a portion of ā€œsurvivedā€ users for every lesson:

Youtube Pageviews Survival Curves

As you can see itā€™s really hard to survive singing and drawing classes online šŸ˜ƒ On top, we have chess and React JS ā€“ both are easy but bring you a lot of fun! šŸ»

P.S. All code for this post is available on Github.

ā€œno fuss, just things you actually needā€

Start learning with SQL Habit today

Master Data Analysis with SQL through the story of how a startup succeeded through data.
TRY 35 LESSONS FOR FREE

Explore other articles

2019

2018

2017

2014

2012