Pandas: TimeSeries and DataFrames in Python

I start my Pandas introduction by demonstrating how to create and use the core data structures: TimeSeries and DataFrames.

In brief, a TimeSeries (or Series) is an indexed (labeled) vector. Technically, a TimeSeries in pandas is a Series where the index is composed of dates. A DataFrame is more like an indexed matrix (or collection of Series). So both have labeled rows, but a Series can only have one dimension (column) while a DataFrame can have many dimensions (columns). The data within these data structures are stored in numpy-ndarray objects. So each column can contain any kind of data. For someone familiar with a data.frame in R, this will seem familiar: an R data.frame is a list of vectors, where each column can be a different type.

Wes has done a remarkable job improving the documentation in the last few weeks, so you should absolutely refer to the pandas documentation as the primary source. This will cover part of the basics.

It is also convenient to use the pandas unit testing functions. Wes implemented functions here that make it trivial to create test data. In most cases, these are simply called "makeObject", such as "makeTimeSeries", and the automatically create the appropriate object for testing. I will skip this below since I want to demonstrate how to create these objects from scratch.

[Note: by convention below, I import modules and call their objects using full names so that everything is explicit. I also use ">>>>" to denote code typed on the console (I realize this makes it harder to copy-paste, but it seems like the clearest way of denoting code from output).]

Dates and Times

Date and time objects are often overlooked, yet they are also one of the most important tools when working with time series data. R dates are based on the POSIX convention. Fortunately, Python has the <a href="http://docs.python.org/library/datetime.html">datetime module.


>>>> import datetime

>>>> datetime.date(2010, 1, 1)
>>>> datetime.time(9, 0, 0)
>>>> datetime.datetime(2010, 1, 1, 9, 0, 0)

You can also supply timezones to these objects, using the pytz module:


>>>> import pytz

>>>> datetime.datetime(2010, 1, 1, 9, 10, tzinfo=pytz.timezone('US/Eastern'))

In addition to dates, datetime also includes the timedelta object for working with date arithmetic.


>>>> a = datetime.datetime(2010, 1, 1, 9)
>>>> b = datetime.datetime(2010, 1, 2, 10)
>>>> b - a  # -> datetime.timedelta(1, 3600)
>>>> c = b - a
>>>> a - c  # -> datetime.datetime(2009, 12, 31, 8, 0)

So we can now define different dates using this convenient format, including times and time zones, and find time differences.

TimeSeries

I start by importing numpy and pandas:


>>>> import numpy 
>>>> import pandas 

There are three different ways to create a Series: with numpy arrays, a dict, or with a scalar value. Here I create a TimeSeries where the index is created using the pandas.DataRange function:


>>>> N = 10
>>>> ts = TimeSeries(numpy.random.randn(N), pandas.DateRange(datetime.datetime(2010, 1, 1), periods=N))
>>>> len(ts)
10
>>>> ts.cumsum().plot() 

The TimeSeries has the same kind of slicing that you would expect from a numpy array:


>>>> ts[:5]
2000-01-03 00:00:00    1.71634399744
2000-01-04 00:00:00    -1.06705409378
2000-01-05 00:00:00    0.380685702287
2000-01-06 00:00:00    0.459558156758
2000-01-07 00:00:00    0.781506778368
>>>> ts[-5:]
2000-01-24 00:00:00    1.04970709349
2000-01-25 00:00:00    -1.39060856079
2000-01-26 00:00:00    0.793533184014
2000-01-27 00:00:00    0.132606725832
2000-01-28 00:00:00    2.83905538586

Series and TimeSeries objects have two primary slots: index to store the row names (or dates) and values which stores the observations.


>>>> isinstance(ts.index, numpy.ndarray)
True
>>>> isinstance(ts.index, pandas.Index)
True
>>>> isinstance(ts.values, numpy.ndarray)
True

Notice that the index and values are both numpy-ndarray, but that the index is also of type Index. To get data between two dates, use the truncate function (you need to supply dates that are of the same exact type as the Index dates, such as whether you include a time zone):


>>>> ts.truncate(datetime.datetime(2000, 1, 4), datetime.datetime(2000, 1, 10))
2000-01-04 00:00:00    -1.06705409378
2000-01-05 00:00:00    0.380685702287
2000-01-06 00:00:00    0.459558156758
2000-01-07 00:00:00    0.781506778368
2000-01-10 00:00:00    -0.444235549297

We could alternatively have used the unit testing function to create a TimeSeries of length 20:


>>>> pandas.util.testing.N = 20
>>>> ts = pandas.util.testing.makeTimeSeries()

DataFrame

A DataFrame in Pandas is essentially a dict of TimeSeries objects. It can also be constructed either from a dict of Series, a 2-D numpy-ndarray, a structured/record ndarray, or from another DataFrame. For example:


>>>> ts_a = pandas.Series(numpy.random.randn(10), range(10))
>>>> ts_b = pandas.Series(numpy.random.randn(10), range(10))
>>>> df = pandas.DataFrame({"A": ts_a, "B": ts_b})
>>>> df[:2]
     A              B
0    1.453          0.644
1    1.766          0.526

DataFrames are very similar to Series in many operations, and they also have an index and values:


>>>> df.index
Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
>>>> df.values[:2]
array([[ 1.45253467,  0.64401451],
       [ 1.76571549,  0.52602833]])

We can use truncate to subset the DataFrame. Arithmetic operation operate across the same index and same column. We can add additional columns by using the join function:


>>>> df.join(pandas.DataFrame({"A + B": ts_a + ts_b}))[:5]
     A              A + B          B
0    1.453          2.097          0.644
1    1.766          2.292          0.526
2    0.02976        0.491          0.4612
3    1.213          1.477          0.2642
4    -0.2468        0.4903         0.7372

This just scratches the surface. I will walk through a complete example of getting a real-world dataset and doing some basic analysis in the next post.

Be Sociable, Share!

3 thoughts on “Pandas: TimeSeries and DataFrames in Python

  • jkmacc

    Thanks for the great article. Your page design made me want to quit reading halfway through, though. Please add color and boxes to your code listing styles.

  • Ernie

    Thank you very much. I jumped into pandas DataFrame and numpy arrays without any knowledge of python to begin with and with your knowledge being passed on, things are beginning to make some sense! Thank you very much!

Leave a Reply

%d bloggers like this: