Trying again to dip my toes into advanced data science, I decided to experiment with the Gaussian processes module in sci-kit learn. I’ve been working with barometric data to study dust devils, and that work involves spotting short dips in otherwise slowly varying time series.

In principle, Gaussian processes provides a way to model the slowly varying portion of the time series. Basically, such an analysis assumes the noise infesting each data point depends a little bit on the value of other nearby data points. The technical way to say this is that the covariance matrix for the data stream is non-diagonal.

So I loaded one data file into an ipython notebook and applied the sci-kit learn Gaussian processes module to model out background oscillations. Here’s the notebook.

```
<span class="o">%</span><span class="k">matplotlib</span> <span class="n">inline</span>
<span class="c">#2015 Feb 15 -- A lot of this code was adapted from </span>
<span class="c"># http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_regression.html.</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.gaussian_process</span> <span class="kn">import</span> <span class="n">GaussianProcess</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="n">sns</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">palette</span><span class="o">=</span><span class="s">"Set2"</span><span class="p">)</span>
<span class="c">#from numpy import genfromtxt</span>
<span class="n">my_data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">genfromtxt</span><span class="p">(</span><span class="s">'Location-A_P28_DATA-003.CSV'</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">','</span><span class="p">,</span> <span class="n">skip_header</span><span class="o">=</span><span class="mi">7</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'time'</span><span class="p">,</span> <span class="s">'pressure'</span><span class="p">])</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">my_data</span><span class="p">[</span><span class="s">'time'</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1000</span><span class="p">])</span><span class="o">.</span><span class="n">T</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">my_data</span><span class="p">[</span><span class="s">'pressure'</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1000</span><span class="p">])</span><span class="o">.</span><span class="n">T</span>
<span class="n">y</span> <span class="o">-=</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c"># Instanciate a Gaussian Process model</span>
<span class="n">gp</span> <span class="o">=</span> <span class="n">GaussianProcess</span><span class="p">(</span><span class="n">theta0</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">,</span> <span class="n">thetaL</span><span class="o">=</span><span class="nb">abs</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-</span><span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">thetaU</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">nugget</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">)</span>
<span class="c"># Fit to data using Maximum Likelihood Estimation of the parameters</span>
<span class="n">gp</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="c"># Make the prediction on the meshed x-axis (ask for MSE as well)</span>
<span class="n">y_pred</span><span class="p">,</span> <span class="n">MSE</span> <span class="o">=</span> <span class="n">gp</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">eval_MSE</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">MSE</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">time</span><span class="o">=</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">pres</span><span class="o">=</span><span class="n">y</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">lmplot</span><span class="p">(</span><span class="s">"time"</span><span class="p">,</span> <span class="s">"pres"</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">fit_reg</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">predicted_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">time</span><span class="o">=</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">pres</span><span class="o">=</span><span class="n">y_pred</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">pl</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
```

Unfortunately, the time series has some large jumps in it, and these are not well described by the slowly varying Gaussian process. What causes these jumps is a good question, but for the purposes of this little analysis, they are a source of trouble.

Probably need to pursue some other technique. Not to mention that the time required to perform a Gaussian process analysis scales with the third power of the number of data points, so it will get very slow very fast.