In part one of this miniseries, I introduced you to the topic of machine learning. And in part two, I showed you how to get started on a simple Python machine learning project using CrateDB, Jupyter Notebook, and Pandas.
If you followed along with the tutorial in part two, you should have a local CrateDB instance running, some test data (imported tweets from Twitter), and a Jupyter document.
In this post, I am going to show you how to predict the number of Twitter followers a user has using regression analysis.
Why regression analysis?
Well, in the loan approval example we looked at in part one, classification is used to predict discrete values (e.g., "Loan Approved" or "Loan Denied").
Regression analysis, on the other hand, allows you to predict continuous values, i.e. numbers, such as someone's follower count.
(If you want to learn more about the difference between discrete values and continuous values, I recommend that you check out The Elements of Statistical Learning and the corresponding YouTube videos.)
The rest of this post covers the following process:
Let's dive in!
Here are some important questions to answer when designing an experiment:
Let’s go through the process of answering these questions for the experiment we're going to conduct.
We want to predict the number of followers a Twitter user has by looking at the number of people they follow.
Our hypothesis:
The linear regression model performs better than the base model.
A base model (also called baseline) is a result of a different, usually more naïve, model that we are trying to beat. This serves as a reference against which we can compare the results of the linear regression model.
For something as basic as the experiment we are conducting with our test Twitter data, an educated guess (e.g., predicting the mean follower count for all inputs) can function as a satisfactory base model. In a more advanced scenario, you might want to compare your model against an existing state-of-the-art algorithm.
We can use a performance metric to determine the success criteria. This will, in turn, allow us to validate the hypothesis (our model was better than the base model) or invalidate the hypothesis (our model was no better than the base model).
For regression analysis, typical performance metrics include:
For our experiment, we are going to use the RMSE and [latex]r^2[/latex] values as performance metrics.
We will calculate the RMSE and [latex]r^2[/latex] values for both the base model and our linear regression model. Comparing these values will allow us to determine which model is better, and can, therefore, be used to test our hypothesis.
Note: normally, when evaluating different machine learning models, more robust techniques (like k-fold cross-validation or statistical hypothesis tests) would be used to help choose the final model. But that is beyond the scope of this tutorial.
We have two types of variable:
So, for us:
The control environment is basically the setup we're planning to use for our independent variables.
For us, we're interested in comparing the actual following count vs. the predicted following account for every Twitter user in our data set. And we want to do this once each for both models: our base model and our linear regression model. No other variables will be changed between comparisons.
Now we've defined the crucial aspects of our experiment, we're ready to get our hands dirty with some actual code.
If you followed along with the instructions in the last part of this miniseries, you should have:
We're gonna use this setup to conduct our experiment.
To save you a bit of time: if you previously set up this environment, you can get things up and running again in three steps:
bin/crate
from the CrateDB directoryanaconda-navigator
I have not included all the code necessary for running this experiment in this post. Instead, I will show the most important parts.
The full code is available on GitHub. If you want to follow along, you can download the reference notebook and use that to interact with your Twitter test data, using this blog post as a guide.
Clone the repository:
$ https://github.com/crate/cratedb-jupyter.git
Open Jupyter Notebook (see the previous post) and use the file system browser to navigate to the cratedb-jupyter
directory you just created.
You should see something like this:
Select CrateDB and Linear Regression.ipynb
, and the notebook should open in a new tab:
Woo!
First things first. Let's set up the environment.
This notebook is a mix of explanatory text, code cells, and program output. A code cell is just a chunk of code.
Here are the first two cells:
Select the Run this cell icon that appears when you hover your mouse over In [1]
in the left-hand margin. When you do, the [1]
will increment, but nothing else will change because all we've done is import some modules so there is no output to display.
Repeat the process for the next cell. This sets some configuration values for the graphs we'll be producing later. Again, no output.
Take heed of the comments in the first cell before moving on. This is covered in the previous part of this miniseries. But in case you don't have the crate Python module installed, you must install it like so:
$ /anaconda3/bin/pip install crate
Now we come to this cell:
This code:
DataFrame
You should see something like this:
The specific values you see will differ from this screenshot because the tweets you imported were different from mine.
As you progress through this tutorial, the rest of your results will also be different. This provides you with an opportunity to interpret your own results.
Before we go any further, let's plot the data we have using a scatter plot. Doing this allows us to explore the data.
Fortunately, with pandas, we can plot the data very easily:
When you execute, you should see a scatter plot:
Here, we see that some values are very large, but most of the data lies in the lower left-hand corner. (Yours may look a little different, but this observation probably still holds true.)
This isn't an ideal situation, because linear regression is very sensitive to outliers. All it takes is a couple of very large values to change our prediction model significantly.
Additionally, looking at this scatter plot, there doesn't appear to be a linear relationship between followees and followers.
Before we continue, let's address the issues we have:
We can address both of these by transforming the input values. This just means that we apply a mathematical function to the data to transform it before we feed it to our machine learning model.
In particular, we can apply a nonlinear function to get a linear relationship:
[latex]f(x)[/latex]
Logarithms are a popular nonlinear function:
[latex]\log_b \, (x)[/latex]
Logarithms retain the original ordering of the data (e.g., large values remain large) while also "pulling in" outliers (i.e., making them less extreme). This behavior makes logarithms particularly suitable when the ratio of the smallest value to the largest value is very large—which is the case for our data!
So, let's apply a logarithmic function:
This code:
DataFrame
We could use [latex]\log_e \, (x)[/latex], but in practice most people prefer to use [latex]\log_{10} \, (x)[/latex] because it is easier to interpret.
If the value of [latex]\log_{10} \, (x)[/latex] increases by one, the value of [latex]x[/latex] increases by 10. In the case of [latex]\log_e \, (x)[/latex], [latex]x[/latex] would be increased by [latex]e[/latex], which is non-intuitive.
When you execute this cell, you should see a table like this:
Great! Let's plot it with this:
You should see a plot that looks like this:
Huzzah!
Immediately, here, we can see the outline of a linear relationship between followees and followers (i.e. one increases as the other increases). We still have outliers, but they're less extreme, so they won't influence the model as much as before.
Now we can move on to building our models.
Before we can start building our base model, we have to split the data into training and testing data:
With scikit-learn, we can randomly split the data into training (two thirds) and testing (one third) with a single line of code:
Now we have our training data, we can set up our base model by calculating the average number of followers:
Which, in my case, gave this:
'Average followers 2.429029322032724'
Here, remember that we've applied a logarithmic function to the data. So, to get the real-world value this corresponds to, we have to apply:
[latex]10^x[/latex]
In my case, that's 269 followers.
So, our base model predicts 269 followers for all Twitter users.
Let's evaluate this model using the RMSE and [latex]r^2[/latex] values mentioned earlier:
When I did this, I got the following result:
Root mean squared error: 0.86
Variance score: -0.00
To find out what the RMSE score means, we can apply [latex]10^x[/latex] like before, which gives us 7.24435960075. This means that by predicting 269 followers over and over again, we can expect to be off by a factor of 7x. Eep! Not great. But we expected that.
The variance score ([latex]r^2[/latex]) is zero. Which again, is what we'd expect, because we guess the same value (~269 followers) no matter the input (followees), so there is no linear relationship between the input and output values.
With our base model in hand, we can move on to the interesting stuff: creating and training a linear regression model.
Let's see if we can beat the base model and validate our hypothesis.
Execute this cell:
This code:
LinearRegression
classWith this in hand, we can use the LinearRegression
class to predict the number of followers each use has:
With my data, I got the following result:
Root mean squared error: 0.62
Variance score: 0.48
Again, to find out what the RMSE score means, we can apply [latex]10^x[/latex], which gives us 4.1686938347. Which means can expect our predictions to be off by a factor of 4x. Which may not be perfect, but it is an improvement.
The variance score ([latex]r^2[/latex]) is a huge improvement over the base model. A variance score of zero means the model does not explain any of the variances, and a score of one indicates the model explains all of the variances.
In conclusion:
The changes in the respective performance scores support our hypothesis that the linear regression model performs better than the base model.
Nice!
But we can do better than that.
Let's visualize it.
Now that we have prediction data, we can plot this alongside the actual data:
Which gives us this:
Here, the test data is plotted in blue, and the predictions of our linear regression model are plotted in red.
As before, the values on this plot represent logarithm values. If we want to access the non-logarithmic predictions, we have to apply a reverse transformation:
When I ran this, I saw the following:
Cool.
Let's write the results back to CrateDB:
This code:
And, if you switch over to the CrateDB admin UI, you should see something like this:
Et voila!
In this post we:
I hope you enjoyed this miniseries. Hopefully, you feel more confident getting your feet wet with machine learning or data science more generally.
If you haven't already done so, check out part one for an introduction to machine learning and part two for an introduction to data science and working with CrateDB, Jupyter Notebook, and Pandas.
If you'd like me to cover something else in a follow-up post, please don't hesitate to drop me a line.