Mucking around with ARIMA and Bitcoin price fluctuations

ARIMA and bitcoin Jupyter Notebook

Note that this notebook can also be found here

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA, ARIMAResults
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from scipy import signal
import scipy.stats as scs
%matplotlib inline

The bitcoin dollar price data was downloaded from Quandl. I chose the set from the Bitstamp Exchange

df2 = pd.read_csv('data/bitstamp.txt', sep='\t')

df2.set_index(pd.DatetimeIndex(df2['Timestamp']), inplace=True, drop=True)

del df2['Timestamp']

df2.head(2)

	Open	High	Low	Close	Volume (BTC)	Volume (Currency)	Weighted Price
Timestamp
2014-02-16	655.83	670	590.01	621	26397.57	16510289.03	625.45
2014-02-17	621	664.99	610	633.66	19902.44	12681178.75	637.17

# Change to numeric data
df2['Open'] = pd.to_numeric(df2['Open'], errors='coerce')
df2['High'] = pd.to_numeric(df2['High'], errors='coerce')
df2['Low'] = pd.to_numeric(df2['Low'], errors='coerce')
df2['Close'] = pd.to_numeric(df2['Close'], errors='coerce')
df2['Volume (BTC)'] = pd.to_numeric(df2['Volume (BTC)'], errors='coerce')
df2['Volume (Currency)'] = pd.to_numeric(df2['Volume (Currency)'], errors='coerce')
df2['Weighted Price'] = pd.to_numeric(df2['Weighted Price'], errors='coerce')

df2.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1460 entries, 2014-02-16 to 2018-02-14
Data columns (total 7 columns):
Open                 1457 non-null float64
High                 1457 non-null float64
Low                  1457 non-null float64
Close                1457 non-null float64
Volume (BTC)         1457 non-null float64
Volume (Currency)    1457 non-null float64
Weighted Price       1457 non-null float64
dtypes: float64(7)
memory usage: 91.2 KB

Resampled Data for weekly average - used Weighted Price

# Resampled to get weekly mean of 'weighted price'
weekly_bit = df2['Weighted Price'].resample('W').mean()

### Plotting weekly price, differenced data, log-transformed data, and differenced log-transformed data

plt.style.use('ggplot')
fig, ax = plt.subplots(4, 1, figsize=(14, 8))
ax[0].plot(weekly_bit.index, weekly_bit)
ax[0].set_title('Bitcoin Exchange Rate (Bitstamp)')
ax[0].grid(False)
ax[1].plot(weekly_bit.index, weekly_bit.diff(), color='b')
ax[1].set_title('Differenced Data')
ax[1].grid(False)
ax[2].plot(weekly_bit.index, np.log(weekly_bit))
ax[2].set_title('Log Transformed')
ax[2].grid(False)
ax[3].plot(weekly_bit.index, np.log(weekly_bit).diff(), color='k')
ax[3].set_title('Log Transformed and Differenced')
ax[3].grid(False)
# ax.set_title('Philippine Peso Weekly Exchange Rate')
plt.tight_layout()
# plt.savefig('data/bitcoin_transformation.png')

png

Log transformation and differencing appear to have made the time series stationary.

ADFuller test for stationarity confirms this.

untransformed = sm.tsa.stattools.adfuller(weekly_bit)
diffed_week = sm.tsa.stattools.adfuller(weekly_bit.diff()[1:])
logged_week = sm.tsa.stattools.adfuller(np.log(weekly_bit))
log_diff_week = sm.tsa.stattools.adfuller(np.log(weekly_bit).diff()[1:])
log_diff2 = sm.tsa.stattools.adfuller(np.log(weekly_bit).diff(2)[2:])

print 'ADFuller Test for Stationarity'
print 'Untransformed Series {}'.format(untransformed[1])
print 'Differenced (1) {}'.format(diffed_week[1])
print 'Log Transformation {}'.format(logged_week[1])
print 'Log Transformation and Differencing(1) {}'.format(log_diff_week[1])
print 'Log Transformation and Differencing(2) {}'.format(log_diff2[1])

ADFuller Test for Stationarity
Untransformed Series 1.0
Differenced (1) 0.741516170524
Log Transformation 0.994334841317
Log Transformation and Differencing(1) 3.01872830745e-18
Log Transformation and Differencing(2) 2.07801212847e-05

# the differenced log-transformed data had the lowest ADFuller score
log_weekly_bit = np.log(weekly_bit)

plot_pacf(log_weekly_bit.diff(2)[2:50])
plt.grid(False)
# plt.savefig('data/bitcoin_pacf.png')

png

plot_acf(log_weekly_bit.diff(2)[2:50])
plt.grid(False)
plt.savefig('data/bitcoin_acf.png')

png

Autocorrelation and partial autocorrelation plots suggests an optimal AR(2) and MA(1) is sufficient with differencing of 1

Also performed a training and testing set split of the data. Trained on the first 85% and will validate on the remaining 15% of the series.

However, optimization runs showed that when the (p,d,q) values of (1,2,1) was used we obtained the lowest AIC, BIC scores and best p-values for coeffs

idx3 = int(len(log_weekly_bit) * 0.85)
train3 = log_weekly_bit[:idx3]
test3 = log_weekly_bit[idx3:]
model3 = ARIMA(train3, order=(1,2,1))
model_fit3 = model3.fit(disp=0)
print(model_fit3.summary())
# plot residual errors
residuals3 = pd.DataFrame(model_fit3.resid)
residuals3.plot()
plt.title('Residuals')
plt.show()
residuals3.plot(kind='kde')
plt.title('kde plot of Residuals')
# pyplot.show()
print(residuals3.describe())

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:      D2.Weighted Price   No. Observations:                  176
Model:                 ARIMA(1, 2, 1)   Log Likelihood                 218.904
Method:                       css-mle   S.D. of innovations              0.069
Date:                Thu, 01 Mar 2018   AIC                           -429.807
Time:                        23:56:20   BIC                           -417.125
Sample:                    03-02-2014   HQIC                          -424.664
                         - 07-09-2017                                         
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                       0.0004      0.000      2.930      0.004       0.000       0.001
ar.L1.D2.Weighted Price     0.1818      0.074      2.447      0.015       0.036       0.327
ma.L1.D2.Weighted Price    -1.0000      0.015    -65.765      0.000      -1.030      -0.970
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            5.5018           +0.0000j            5.5018            0.0000
MA.1            1.0000           +0.0000j            1.0000            0.0000
-----------------------------------------------------------------------------

png

                0
count  176.000000
mean    -0.000787
std      0.070537
min     -0.230235
25%     -0.040174
50%     -0.004960
75%      0.035439
max      0.216069

png

from sklearn.metrics import mean_squared_error

""" Setting arrays for true values in the test set and predicted/forecast values from the model """
truth = test3.head(32).values
forecast = (model_fit3.forecast(steps=32))
pred = forecast[0]

rmse = np.sqrt(mean_squared_error(truth, pred))
print 'Root MSE {}'.format(rmse)

Root MSE 0.353034310708

fig, ax = plt.subplots(figsize=(14,8))
fig = model_fit3.plot_predict(start='2014-03-02', end='2018-02-18', ax=ax)
legend = ax.legend(loc='upper left', fontsize='large')
ax.set_ylabel('Log Bitcoin Price',fontsize=14, fontweight='bold')
ax.set_title('Weekly Bitcoin Mean USD Price - ARIMA Forecasting', fontsize=14, fontweight='bold' )
ax.grid(False)
# plt.savefig('data/confidence.png')

png

Using the ARIMA forecast function - Looking at how the forecast price (untransformed) compares to the truth

plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(14,8))
x_val = range(len(truth))
ax.scatter(x_val, np.exp(truth), c='blue', label='True', alpha=0.6)
ax.scatter(x_val, np.exp(pred), c='red', alpha=0.4, label='Predicted')
ax.set_xlabel('Weeks after end of training dataset (week of 2017-07-09)', fontsize=14, fontweight='bold')
ax.set_ylabel('Bitcoin Price', fontsize=14, fontweight='bold')
ax.legend(fontsize='large', markerscale=1)
ax.grid(False)
ax.set_title('ARIMA Forecasting - Predicted versus True Bitcoin Price', fontsize=14, fontweight='bold' )
ax.text(28.5, 15600, 'RMSE = 3552.9', fontsize=12)
# plt.savefig('data/forecasting.png')

<matplotlib.text.Text at 0x112e2e810>

png

Stepwise Forecasting/Retraining ARIMA model

The idea is that most accurate predictions occur in the first few time points past the training set
This snippet will forecast only to the next time point
Afterwards, the true price for that time point is added back to the series and the ARIMA model is retrained.
The retrained model will then forecast the following time point and the retraining is performed again after every forecasting step.

from collections import deque
moving = list(train3)
mov_test = deque(test3)
pred_steps =[]
for n in range(len(test3)):
    model = ARIMA(moving, order=(1,2,1))
    model_fit = model.fit(disp=0)
    forecast_one = model_fit.forecast(steps=1) # forecast the next time point only
    pred_steps.append(forecast_one[0])
    z = mov_test.popleft()
    moving.append(z) #update the training data with next time point

#rmse of stepwise
np.sqrt(mean_squared_error(np.exp(test3), np.exp(pred_steps)))

1252.7893334584987

plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=(14,8))
x_val = range(len(test3))
ax.scatter(x_val, np.exp(test3), c='blue', label='True', alpha=0.6)
ax.scatter(x_val, np.exp(pred_steps), c='red', alpha=0.4, label='Predicted')
ax.set_xlabel('Weeks after end of training dataset (week of 2017-07-09)', fontsize=12, fontweight='bold')
ax.set_ylabel('Bitcoin Price', fontsize=14, fontweight='bold')
ax.legend(fontsize='large', markerscale=1)
ax.grid(False)
ax.set_title('ARIMA Stepwise Forecasting - Predicted versus True Bitcoin Price', fontsize=14, fontweight='bold' )
ax.text(28.5, 17600, 'RMSE = 1252.8', fontsize=12)
# plt.savefig('data/forecasting_stepwise.png')

<matplotlib.text.Text at 0x111b67b10>

png

Conclusion - ARIMA time series modeling can provide a decent forecasting tool. If the forecasting algorithm is set up so that the model can be continually updated, then the forecasts for the immediate time points are acceptable. I will revisit this time series model with unseen new data. This particular time series ended in mid-February 2018 and I will wait a few weeks to do the two forecasting analysis again.