Archives

All posts for the month February, 2015

Trying again to dip my toes into advanced data science, I decided to experiment with the Gaussian processes module in sci-kit learn. I’ve been working with barometric data to study dust devils, and that work involves spotting short dips in otherwise slowly varying time series.

In principle, Gaussian processes provides a way to model the slowly varying portion of the time series. Basically, such an analysis assumes the noise infesting each data point depends a little bit on the value of other nearby data points. The technical way to say this is that the covariance matrix for the data stream is non-diagonal.

So I loaded one data file into an ipython notebook and applied the sci-kit learn Gaussian processes module to model out background oscillations. Here’s the notebook.

In [32]:
<span class="o">%</span><span class="k">matplotlib</span> <span class="n">inline</span>
<span class="c">#2015 Feb 15 -- A lot of this code was adapted from </span>
<span class="c">#  http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_regression.html.</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.gaussian_process</span> <span class="kn">import</span> <span class="n">GaussianProcess</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="n">sns</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="n">palette</span><span class="o">=</span><span class="s">"Set2"</span><span class="p">)</span>

<span class="c">#from numpy import genfromtxt</span>

<span class="n">my_data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">genfromtxt</span><span class="p">(</span><span class="s">'Location-A_P28_DATA-003.CSV'</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">','</span><span class="p">,</span> <span class="n">skip_header</span><span class="o">=</span><span class="mi">7</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s">'time'</span><span class="p">,</span> <span class="s">'pressure'</span><span class="p">])</span>

<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">my_data</span><span class="p">[</span><span class="s">'time'</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1000</span><span class="p">])</span><span class="o">.</span><span class="n">T</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">my_data</span><span class="p">[</span><span class="s">'pressure'</span><span class="p">])[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1000</span><span class="p">])</span><span class="o">.</span><span class="n">T</span>
<span class="n">y</span> <span class="o">-=</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>

<span class="c"># Instanciate a Gaussian Process model</span>
<span class="n">gp</span> <span class="o">=</span> <span class="n">GaussianProcess</span><span class="p">(</span><span class="n">theta0</span><span class="o">=</span><span class="mf">1e-2</span><span class="p">,</span> <span class="n">thetaL</span><span class="o">=</span><span class="nb">abs</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">-</span><span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">thetaU</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">nugget</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">)</span>

<span class="c"># Fit to data using Maximum Likelihood Estimation of the parameters</span>
<span class="n">gp</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>

<span class="c"># Make the prediction on the meshed x-axis (ask for MSE as well)</span>
<span class="n">y_pred</span><span class="p">,</span> <span class="n">MSE</span> <span class="o">=</span> <span class="n">gp</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">eval_MSE</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">MSE</span><span class="p">)</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">time</span><span class="o">=</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">pres</span><span class="o">=</span><span class="n">y</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">sns</span><span class="o">.</span><span class="n">lmplot</span><span class="p">(</span><span class="s">"time"</span><span class="p">,</span> <span class="s">"pres"</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">fit_reg</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

<span class="n">predicted_data</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="nb">dict</span><span class="p">(</span><span class="n">time</span><span class="o">=</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">pres</span><span class="o">=</span><span class="n">y_pred</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]))</span>
<span class="n">pl</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
Barometric time series. Pressure in hPa, and time in seconds.

Barometric time series. Pressure in hPa, and time in seconds. The red dots show the original data, and the blue line the fit from the Gaussian process.

Unfortunately, the time series has some large jumps in it, and these are not well described by the slowly varying Gaussian process. What causes these jumps is a good question, but for the purposes of this little analysis, they are a source of trouble.

Probably need to pursue some other technique. Not to mention that the time required to perform a Gaussian process analysis scales with the third power of the number of data points, so it will get very slow very fast.

We had an abbreviated meeting, as everyone (including myself) seems to have been unusually busy this week. No specific research updates, but we discussed two interesting articles of science news. Attendees were Liz Kandziolka, Jennifer Briggs, Emily Jensen, Brenton Peck, Nathan Grigsby, Trent Garrett, and Prof. Daryl Macomb.

The first article we discussed, “Cosmology from Quantum Potential“, proposed an origin for the universe that represents an alternative to the big bang. We didn’t really understand the theory, but it seems to involve the idea that universe doesn’t really have a beginning. We puzzled over whether the theory actually proposes any specific testables or observables.

The second article we discussed, “Titan Submarine : Vehicle Design and Operations Concept for the Exploration of the Hydrocarbon Seas of Saturn’s Giant Moon“, suggests a plan to send a submersible to plumb the depths of Titan’s seas, the only seas in our solar system besides those on Earth. Studying Titan’s seas may teach us about the pre-biotic chemistry in the Earth’s early oceans. And as with all exciting scientific work, this study comes with an animation and dramatic sound track.

The tightly packed system, named Kepler-444, is home to five small planets in very compact orbits. The planets were detected from the dimming that occurs when they transit the disc of their parent star, as shown in this artist's conception. From http://www.nasa.gov/ames/kepler/astronomers-discover-ancient-system-with-five-small-planets/.

The tightly packed system, named Kepler-444, is home to five small planets in very compact orbits. The planets were detected from the dimming that occurs when they transit the disc of their parent star, as shown in this artist’s conception. From http://www.nasa.gov/ames/kepler/astronomers-discover-ancient-system-with-five-small-planets/.

In journal club on Friday, we discussed a fascinating paper from Campante and colleagues announcing discovery of one of the oldest planetary systems ever discovered — Kepler-444. The system comprises five planets, ranging from roughly Mercury- to Venus-sized with orbital periods from about 3 to 9 days.

Studying the frequencies of oscillations within the K-dwarf host star (an approach known as asteroseismology), Campante et al. estimate the host star, and therefore probably the planets, is about 11 billion years old, almost as old as the Milky Way galaxy itself.

To put that age into perspective, by the time our solar system formed, about 5 billion years ago, the Kepler-444 was already a billion years older than our solar system is now.

The existence of such an old system tells us that rocky planets began forming almost as soon as the Milky Way itself formed, which allows for the exciting possibility of very ancient life in the galaxy.

Present at journal club were Jennifer Briggs, Trent Garrett, Nathan Grisgby, Emily Jensen, Liz Kandziolka, Brenton Peck, and Jacob Sabin.

Pressure variations (in hectoPascal, hPa) vs. local time for one dust devil pressure dip. The blue curve shows our model fit.

Pressure variations (in hectoPascal, hPa) vs. local time for one dust devil pressure dip. The blue curve shows our model fit.

Dust devils occur in arid climates on the Earth and ubiquitously on Mars. Martian dust devils have been studied with orbiting and landed spacecraft, while most studies of terrestrial dust devils have involved manned monitoring of field sites, which can be costly both in time and personnel. As an alternative approach, my colleague Ralph Lorenz and I performed a multi-year in-situ survey of terrestrial dust devils using pressure loggers deployed at El Dorado Playa in Nevada, USA, a site known for dust devil activity.

When a dust devil passed over our pressure sensors, it appeared as a pressure dip in the time series, as illustrated in the figure. By modeling these signals, we learned a lot of about dust devils. For instance, in spite of expectations, we found signals that looked a lot like dust devils that occurred at night and even in the winter. So do dust devils happen year-round, day and night? More work will help us figure it out.

Our paper on this study will appear soon in the Journal of Geophysical Research Planets.

From https://emps.exeter.ac.uk/physics-astronomy/research/astrophysics/phd-opportunities/modelling-shock-waves/.

From https://emps.exeter.ac.uk/physics-astronomy/research/astrophysics/phd-opportunities/modelling-shock-waves/.

On Friday, everyone in our research group gave a little update on what they’ve been up to.

Liz and Jennifer talked about Parmentier et al.’s (2013) paper on the meteorology of hot Jupiters and how condensates are transported throughout these dynamic atmospheres.

Emily talked about working through the first few chapters of Murray & Dermott’s classic Solar System Dynamics. She will eventually study the orbital dynamics of systems of exoplanets very close to their host stars.

Brenton discussed his reading of Balme & Greeley (2006) on dust devils in preparation for working with me on terrestrial and Martian dust devils. A very exciting possibility, Brenton and the rest of the group said dust devils are common just south of Boise. Good chance we can do some in-situ monitoring locally.

Nathan spoke briefly about looking for more very short-period planets using data from the Kepler and K2 missions.

In attendance were Liz Kandziolka, Jennifer Briggs, Emily Jensen, Brenton Peck, Nathan Grigsby, Trent Garrett, and Tiffany Watkins.