Tuesday, October 13, 2015

Game of Life with Python

The game of life consists of a grid of cells where each cell can be dead or alive and the state of the cells can change at each time step. The state of a cell at the time step t depends on the state of the grid at time t-1 and it is determined with a very simple rule:

A cell is alive if it's already alive and has two living neighbours, or if it has three live neighbours.

We call the grid universe and the alive cells population. At each time step the population evolves and we have a new generation. The evolution of the population is a fascinating process to observe because it can generate an incredible variety of patterns (and also puzzles!).

Implementing the game of life in Python is quite straightforward:
import numpy as np

def life(X, steps):
    """
     Conway's Game of Life.
     - X, matrix with the initial state of the game.
     - steps, number of generations.
    """
    def roll_it(x, y):
        # rolls the matrix X in a given direction
        # x=1, y=0 on the left;  x=-1, y=0 right;
        # x=0, y=1 top; x=0, y=-1 down; x=1, y=1 top left; ...
        return np.roll(np.roll(X, y, axis=0), x, axis=1)

    for _ in range(steps):
        # count the number of neighbours 
        # the universe is considered toroidal
        Y = roll_it(1, 0) + roll_it(0, 1) + roll_it(-1, 0) \
            + roll_it(0, -1) + roll_it(1, 1) + roll_it(-1, -1) \
            + roll_it(1, -1) + roll_it(-1, 1)
        # game of life rules
        X = np.logical_or(np.logical_and(X, Y ==2), Y==3)
        X = X.astype(int)
        yield X
The function life takes in input a matrix X which represents the universe of the game where each cell is alive if its corresponding element has value 1 and dead if 0. The function returns the next steps generations. At each time step the number of neighbours of each cell is counted and the rule of the game is applied. Now we can create an universe with an initial state:
X = np.zeros((40, 40)) # 40 by 40 dead cells

# R-pentomino
X[23, 22:24] = 1
X[24, 21:23] = 1
X[25, 22] = 1
This initial state is known as the R-pentomino. It consists of five living cells organized as shown here (image from Wikipedia)

It is by far the most active polyomino with fewer than six cells, all of the others stabilize in at most 10 generations. Let's create a video to visualize the evolution of the system:
from matplotlib import pyplot as plt
import matplotlib.animation as manimation

FFMpegWriter = manimation.writers['ffmpeg']
metadata = dict(title='Game of life', artist='JustGlowing')
writer = FFMpegWriter(fps=10, metadata=metadata)

fig = plt.figure()
fig.patch.set_facecolor('black')
with writer.saving(fig, "game_of_life.mp4", 200):
    plt.spy(X)
    plt.axis('off')
    writer.grab_frame()
    plt.clf()
    for x in life(X, 800):
        plt.spy(x)
        plt.axis('off')
        writer.grab_frame()
        plt.clf()
The result is as follows:


In the video we can notice few very well known patters like gliders and blinkers. Also an exploding start at 0:55!

Thursday, April 9, 2015

Stacked area plots with matplotlib

In a stacked area plot, the values on the y axis are accumulated at each x position and the area between the resulting values is then filled. These plots are helpful when it comes to compare quantities through time. For example, considering the the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971 (data that we already considered here). We can compare the number of cases of each disease month by month. First, we need to load and organize the data properly:
from scipy.io import loadmat
NYCdiseases = loadmat('NYCDiseases.mat') # loading a matlab file
chickenpox = np.sum(NYCdiseases['chickenPox'],axis=0)
mumps = np.sum(NYCdiseases['mumps'],axis=0)
measles = np.sum(NYCdiseases['measles'],axis=0)
In the snippet above we read the data from a Matlab file and summed the number of cases for each month. We are now ready to visualize our values using the function stackplot:
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

plt.stackplot(arange(12)+1,
          [chickenpox, mumps, measles], 
          colors=['#377EB8','#55BA87','#7E1137'])
plt.xlim(1,12)

# creating the legend manually
plt.legend([mpatches.Patch(color='#377EB8'),  
            mpatches.Patch(color='#55BA87'), 
            mpatches.Patch(color='#7E1137')], 
           ['chickenpox','mumps','measles'])
plt.show()
The result is as follows:


We note that the highest number of cases happens between January and Jul, also we see that measles cases are more common than mumps and chicken pox cases.

Tuesday, January 27, 2015

Forecasting beer consumption with sklearn

In this post we will see how to implement a straightforward forecasting model based on the linear regression object of sklearn. The model that we are going to build is based on the idea idea that past observations are good predictors of a future value. Using some symbols, given xn−k,...,xn−2,xn−1 we want to estimate xn+h where h is the forecast horizon just using the given values. The estimation that we are going to apply is the following:


where xn−k and xn−1 are respectively the oldest and the newest observation we consider for the forecast. The weights wk,...,w1,w0 are chosen in order to minimize


where m is number of periods available to train our model. This model is often referred as regression model with lagged explanatory variables and k is called lag order.

Before implementing the model let's load a time series to forecast:
import pandas as pd
df = pd.read_csv('NZAlcoholConsumption.csv')
to_forecast = df.TotalBeer.values
dates = df.DATE.values
The time series represent the total of alcohol consumed by quarter millions of litres from the 1st quarter of 2000 to 3rd quarter of 2012. The data is from New Zealand government and can be downloaded in csv from here. We will focus on the forecast of beer consumption.
First, we need to organize our data in forecast in windows that contain the previous observations:
import numpy as np

def organize_data(to_forecast, window, horizon):
    """
     Input:
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
     Output:
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    """
    shape = to_forecast.shape[:-1] + /
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
                                        shape=shape, 
                                        strides=strides)
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

k = 4   # number of previous observations to use
h = 1   # forecast horizon
X,y = organize_data(to_forecast, k, h)
Now, X is a matrix where the i-th row contains the lagged variables xn−k,...,xn−2,xn−1 and y[i] contains the i-th target value. We are ready to train our forecasting model:
from sklearn.linear_model import LinearRegression
 
m = 10 # number of samples to take in account
regressor = LinearRegression(normalize=True)
regressor.fit(X[:m], y[:m])
We trained our model using the first 10 observations, which means that we used the data from 1st quarter of 2000 to the 2nd quarter of 2002. We use a lag order of one year and a forecast horizon of 1 quarter. To estimate the error of the model we will use the mean absolute percentage error (MAPE). Computing this metric to compare the forecast of the remaining observation of the time series and the actual observations we have:
def mape(ypred, ytrue):
    """ returns the mean absolute percentage error """
    idx = ytrue != 0.0
    return 100*np.mean(np.abs(ypred[idx]-ytrue[idx])/ytrue[idx])

print 'The error is %0.2f%%' % mape(regressor.predict(X[m:]),y[m:])
The error is 6.15%
Which means that, on average, the forecast provided by our model differs from the target value only of 6.15%. Let's compare the forecast and the observed values visually:
figure(figsize=(8,6))
plot(y, label='True demand', color='#377EB8', linewidth=2)
plot(regressor.predict(X), 
     '--', color='#EB3737', linewidth=3, label='Prediction')
plot(y[:m], label='Train data', color='#3700B8', linewidth=2)
xticks(arange(len(dates))[1::4],dates[1::4], rotation=45)
legend(loc='upper right')
ylabel('beer consumed (millions of litres)')
show()

We note that the forecast is very close to the target values and that the model was able to learn the trends and anticipate them in many cases.