Monday, October 9, 2017

Spotting outliers with Isolation Forest using sklearn

Isolation Forest is an algorithm to detect outliers. It partitions the data using a set of trees and provides an anomaly scores looking at how isolated is the point in the structure found, the anomaly score is then used to tell apart outliers from normal observations. In this post we will see an example of how IsolationForest behaves in simple case. First, we will generate 1-dimensional data from bimodal distribution, then we will compare the anomaly score with the distribution of the data and highlighting the regions considered where the outliers fall.

To start, let's generate the data and plot the histogram:
import numpy as np
import matplotlib.pyplot as plt

x = np.concatenate((np.random.normal(loc=-2, scale=.5,size=500), 
                    np.random.normal(loc=2, scale=.5, size=500)))

plt.hist(x, normed=True)
plt.xlim([-5, 5])
plt.show()

Here we note that there are three regions where the data has low probability to appear. One on the right side of the distribution, another one and the left and another around zero. Let's see if using IsolationForest we are able to identify these three regions:

from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(n_estimators=100)
isolation_forest.fit(x.reshape(-1, 1))

xx = np.linspace(-6, 6, 100).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)

plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score), 
                 where=outlier==-1, color='r', 
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('x')
plt.xlim([-5, 5])
plt.show()

In the snippet above we have trained our IsolationForest using the data generated, computed the anomaly score for each observation and classified each observation as outlier or non outlier. The chart shows, the anomaly scores and the regions where the outliers are. As expected, the anomaly score reflects the shape of the underlying distribution and the outlier regions correspond to low probability areas.

Thursday, July 13, 2017

Dates in Pandas Cheatsheet

Lately I've been working a lot with dates in Pandas so I decided to make this little cheatsheet with the commands I use the most.

Importing a csv using a custom function to parse dates

import pandas as pd

def parse_month(month):
    """
    Converts a string from the format M in datetime format.
    Example: parse_month("2007M02") returns datetime(2007, 2, 1)
    """
    return pd.datetime(int(month[:4]), int(month[-2:]), 1)

temperature = pd.read_csv('TempUSA.csv', parse_dates=['Date'], 
                          date_parser=parse_month, 
                          index_col=['Date'], # will become an index
                          # use a subset of the columns
                          usecols=['Date', 
                                   'LosAngelesMax', 'LosAngelesMin'])
print temperature
            LosAngelesMax  LosAngelesMin
Date                                    
2000-01-01           19.6           10.0
2000-02-01           18.9           10.1
2000-03-01           18.6           10.1
2000-04-01           20.2           12.5
2000-05-01           21.9           14.2

Format the dates in a chart

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
plt.plot(temperature['LosAngelesMax'])
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.show()

Here's the reference of the date format directives. ISO compliant format: %Y-%m-%dT%H:%M:%S.

Group the DataFrame by month

print temperature.groupby([temperature.index.month]).mean() 
      LosAngelesMax  LosAngelesMin
Date                              
1         20.092308       8.992308
2         19.223077       9.276923
3         19.253846      10.492308
4         19.992308      11.461538
5         21.076923      13.761538
6         22.123077      15.800000
7         23.892308      17.315385
8         24.246154      17.530769
9         24.384615      16.846154
10        23.330769      14.630769
11        21.950000      11.241667
12        19.241667       8.683333
The resulting DataFrame is indexed by month.

Merging two DataFrames indexed with timestamps that don't match exactly

date_range_a = pd.date_range('2007-01-01 01:00', 
                            '2007-01-01 3:00', freq='1h')
date_range_b = date_range_a + pd.Timedelta(10, 'm')
df_a = pd.DataFrame(np.arange(len(date_range_a)), 
                    columns=['a'], index=date_range_a)
df_b = pd.DataFrame(['x', 'y', 'z'], 
                    columns=['b'], index=date_range_b)

print 'left DataFrame'
print df_a
print '\nright DataFrame'
print df_b
print '\nmerge_AsOf result'
print pd.merge_asof(df_a, df_b, direction='nearest', 
                    left_index=True, right_index=True)
left DataFrame
                     a
2007-01-01 01:00:00  0
2007-01-01 02:00:00  1
2007-01-01 03:00:00  2

right DataFrame
                     b
2007-01-01 01:10:00  x
2007-01-01 02:10:00  y
2007-01-01 03:10:00  z

merge_AsOf result
                     a  b
2007-01-01 01:00:00  0  x
2007-01-01 02:00:00  1  y
2007-01-01 03:00:00  2  z
The DataFrames have been aligned according to the index on the left.

Aligning two DataFrames

aligned = df_a.align(df_b)

print 'left aligned'
print aligned[0]
print '\nright aligned'
print aligned[1]
print '\ncombination'
aligned[0]['b'] = aligned[1]['b']
print aligned[0]
left aligned
                       a   b
2007-01-01 01:00:00  0.0 NaN
2007-01-01 01:10:00  NaN NaN
2007-01-01 02:00:00  1.0 NaN
2007-01-01 02:10:00  NaN NaN
2007-01-01 03:00:00  2.0 NaN
2007-01-01 03:10:00  NaN NaN

right aligned
                      a    b
2007-01-01 01:00:00 NaN  NaN
2007-01-01 01:10:00 NaN    x
2007-01-01 02:00:00 NaN  NaN
2007-01-01 02:10:00 NaN    y
2007-01-01 03:00:00 NaN  NaN
2007-01-01 03:10:00 NaN    z

combination
                       a    b
2007-01-01 01:00:00  0.0  NaN
2007-01-01 01:10:00  NaN    x
2007-01-01 02:00:00  1.0  NaN
2007-01-01 02:10:00  NaN    y
2007-01-01 03:00:00  2.0  NaN
2007-01-01 03:10:00  NaN    z
The timestamps are now aligned according to both the DataFrames and unknown values have been filled with NaNs. The missing value can be filled with interpolation when working with numeric values:
print aligned[0].a.interpolate()
2007-01-01 01:00:00    0.0
2007-01-01 01:10:00    0.5
2007-01-01 02:00:00    1.0
2007-01-01 02:10:00    1.5
2007-01-01 03:00:00    2.0
2007-01-01 03:10:00    2.0
Name: a, dtype: float64
The categorical values can be filled using the fillna method:
print aligned[1].b.fillna(method='bfill')
2007-01-01 01:00:00    x
2007-01-01 01:10:00    x
2007-01-01 02:00:00    y
2007-01-01 02:10:00    y
2007-01-01 03:00:00    z
2007-01-01 03:10:00    z
Name: b, dtype: object
The method bfill propagates the next valid observation, while ffil the last valid observation.

Convert a Timedelta in hours

td = pd.Timestamp('2017-07-05 16:00') - pd.Timestamp('2017-07-05 12:00')
print td / pd.Timedelta(1, unit='h')
4.0
To convert in days, months, minutes and so on one just need to change the unit. Here are the values accepted: D,h,m,s,ms,us,ns.

Convert pandas timestamps in unix timestamps

unix_ts = pd.date_range('2017-01-01 1:00', 
                        '2017-01-01 2:00', 
                        freq='30min').astype(np.int64) // 10**9
print unix_ts
Int64Index([1483232400, 1483234200, 1483236000], dtype='int64')
To convert in milliseconds divided by 10**6 instead of 10**9.

Convert unix timestamps in pandas timestamps

print pd.to_datetime(unix_ts, unit='s')
DatetimeIndex(['2017-01-01 01:00:00', '2017-01-01 01:30:00',
               '2017-01-01 02:00:00'],
              dtype='datetime64[ns]', freq=None)
To convert from timestamps in milliseconds change the unit to 'ms'.

Friday, June 16, 2017

A heatmap of male to female ratios with Seaborn

In this post we will see how to create a heatmap with seaborn. We'll use a dataset from the Wittgenstein Centre Data Explorer. The data extracted is also reported here in csv format. It contains the ratio of males to females in the population by age for 1970 to 2015 (data reported after this period is projected). First, we import the data using Pandas:
import pandas as pd
import numpy as np

sex_ratios = pd.read_csv('m2f_ratios.csv', skiprows=8)

age_code = {a: i for i,a in enumerate(sex_ratios.Age.unique())}
age_label = {i: a for i,a in enumerate(sex_ratios.Age.unique())}
sex_ratios['AgeCode'] = sex_ratios.Age.apply(lambda x: age_code[x])

area_idx = sex_ratios.Area == \
           'United Kingdom of Great Britain and Northern Ireland'
years_idx = sex_ratios.Year <= 2015
sex_ratios_uk = sex_ratios[np.logical_and(years_idx, area_idx)]
Here take care of the age coding and isolate the data for the United Kingdom and Northern Ireland. Now we can rearrange the data to see ratio per year and age using a pivot table, we can then visualize the result using the heatmap function from seaborn:
import matplotlib as plt
import seaborn as sns

pivot_uk = sex_ratios_uk.pivot_table(values='Ratio', 
                                     index='AgeCode', 
                                     columns='Year')
pivot_uk.index = [age_label[a] for a in pivot_uk.index.values]

plt.figure(figsize=(10, 8))
plt.title('Sex ratio per year and age groups')
sns.heatmap(pivot_uk, annot=True)
plt.show()

In each year we see that the ratio was above 1 (in favor of males) for young ages it then becomes lower than 1 during adulthood and keeps lowering with the age. It also seems that with time the ratio decreases more slowly. For example, we see that the age group 70-74 had a ratio of 0.63 in 1970, while the ratio in 2015 was 0.9.

Thursday, April 27, 2017

Solving the Two Spirals problem with Keras

In this post we will see how to create a Multi Layer Perceptron (MLP), one of the most common Neural Network architectures, with Keras. Then, we'll train the MLP to tell apart points from two different spirals in the same space.
To have a sense of the problem, let's first generate the data to train the network:
import numpy as np
import matplotlib.pyplot as plt

def twospirals(n_points, noise=.5):
    """
     Returns the two spirals dataset.
    """
    n = np.sqrt(np.random.rand(n_points,1)) * 780 * (2*np.pi)/360
    d1x = -np.cos(n)*n + np.random.rand(n_points,1) * noise
    d1y = np.sin(n)*n + np.random.rand(n_points,1) * noise
    return (np.vstack((np.hstack((d1x,d1y)),np.hstack((-d1x,-d1y)))), 
            np.hstack((np.zeros(n_points),np.ones(n_points))))

X, y = twospirals(1000)

plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.', label='class 1')
plt.plot(X[y==1,0], X[y==1,1], '.', label='class 2')
plt.legend()
plt.show()

As we can see, this dataset contains two different spirals. This kind of dataset has been named as Worst Dataset Ever!, indeed telling apart the points from the two spirals is not an easy part if your MLP is not sophisticated enough. Let's build a simple MLP with Keras and see what we can achieve:
from keras.models import Sequential
from keras.layers import Dense

mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))

mymlp.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# trains the model
mymlp.fit(X, y, epochs=150, batch_size=10,  verbose=0)
Here we created a Neural Network with the following structure: 2 inputs (the data is in a 2D space) fully connected to 12 hidden neurons and 1 output. Let's generate some test data and see if our model is able to classify them:
X_test, y_test = twospirals(1000)

yy = np.round(mymlp.predict(X_test).T[0])

plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()

We have the original train set on the left and the results of the Neural Network on the right. It's easy to note that the model misclassified most of the points on the test data. Let's add two hidden layers to our model and see what happens:
mymlp = Sequential()
mymlp.add(Dense(12, input_dim=2, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(12, activation='tanh'))
mymlp.add(Dense(1, activation='sigmoid'))

mymlp.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Fit the model
mymlp.fit(X, y, epochs=150, batch_size=10,  verbose=0)

yy = np.round(mymlp.predict(X_test).T[0])

plt.subplot(1,2,1)
plt.title('training set')
plt.plot(X[y==0,0], X[y==0,1], '.')
plt.plot(X[y==1,0], X[y==1,1], '.')
plt.subplot(1,2,2)
plt.title('Neural Network result')
plt.plot(X_test[yy==0,0], X_test[yy==0,1], '.')
plt.plot(X_test[yy==1,0], X_test[yy==1,1], '.')
plt.show()

The structure of our Network is now more suited to solve the problem and we see that most of the points used for the test were correctly classified.