Trying again to dip my toes into advanced data science, I decided to experiment with the Gaussian processes module in sci-kit learn. I’ve been working with barometric data to study dust devils, and that work involves spotting short dips in otherwise slowly varying time series.
In principle, Gaussian processes provides a way to model the slowly varying portion of the time series. Basically, such an analysis assumes the noise infesting each data point depends a little bit on the value of other nearby data points. The technical way to say this is that the covariance matrix for the data stream is non-diagonal.
So I loaded one data file into an ipython notebook and applied the sci-kit learn Gaussian processes module to model out background oscillations. Here’s the notebook.
%matplotlib inline #2015 Feb 15 -- A lot of this code was adapted from # http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_regression.html. import numpy as np from sklearn.gaussian_process import GaussianProcess from matplotlib import pyplot as pl import seaborn as sns import pandas as pd sns.set(palette="Set2") #from numpy import genfromtxt my_data = np.genfromtxt('Location-A_P28_DATA-003.CSV', delimiter=',', skip_header=7, usecols=(0, 1), names=['time', 'pressure']) X = np.atleast_2d(np.array(my_data['time'])[0:1000]).T y = np.atleast_2d(np.array(my_data['pressure'])[0:1000]).T y -= np.median(y) # Instanciate a Gaussian Process model gp = GaussianProcess(theta0=1e-2, thetaL=abs(y-y), thetaU=np.std(y), nugget=1e-3) # Fit to data using Maximum Likelihood Estimation of the parameters gp.fit(X, y) # Make the prediction on the meshed x-axis (ask for MSE as well) y_pred, MSE = gp.predict(X, eval_MSE=True) sigma = np.sqrt(MSE) data = pd.DataFrame(dict(time=X[:,0], pres=y[:,0])) sns.lmplot("time", "pres", data=data, color='red', fit_reg=False, size=10) predicted_data = pd.DataFrame(dict(time=X[:,0], pres=y_pred[:,0])) pl.plot(X, y_pred, color='blue')
Unfortunately, the time series has some large jumps in it, and these are not well described by the slowly varying Gaussian process. What causes these jumps is a good question, but for the purposes of this little analysis, they are a source of trouble.
Probably need to pursue some other technique. Not to mention that the time required to perform a Gaussian process analysis scales with the third power of the number of data points, so it will get very slow very fast.