# Data Analysis
## NumPy
NumPy is a library that adds support for large, multi-dimensional arrays and matrices to the Python programming language. It also offers a large collection of high-level mathematical functions to operate on these arrays. NumPy is free software released under the three-clause BSD license. It is the foundation of many other Python libraries for scientific computing - including pandas. Travis Oliphant is NumPy's original author.
### ndarray
At the heart of NumPy lies its n-dimensional array object. Each ndarray has a shape, a tuple indicating the size of each dimension.

In [None]:
import numpy as np
a = np.arange(0, 24)
print(a)
print(a.ndim)
print(a.shape)

In [None]:
a = a.reshape(6,4)
print(a)
print(a.ndim)
print(a.shape)

Arrays can be initialized from Python sequences. Alternatively, they can be created via `arange`, `zeros`, `ones` and `empty`.

In [None]:
np.array((1, 4, 2))

In [None]:
np.array((1, 4, 2), dtype=np.float64)

In [None]:
np.zeros(12).reshape(3, 4)

In [None]:
np.ones(12).reshape(3, 4)

In [None]:
np.empty(12).reshape(3, 4)

Operations between equal sized arrays applies the operation elementwise.

In [None]:
a * a

In [None]:
a - a

In [None]:
1 / a

In [None]:
a == 1

### Indexing and Slicing
One of the most important distinctions from Python lists is that array slices are views on the original array (instead of copies).

In [None]:
row = a[3]
row[-1] = 99
a

Assignment of a scalar to a view is broadcast to the entire range.

In [None]:
a[1:4][:] = 0
a

In [None]:
a[4, 0]

A boolean array can be passed for indexing.

In [None]:
homeworks = np.array(('hw_1', 'hw_2', 'hw_3', 'hw_4', 'hw_5', 'hw_6'))
print(homeworks == 'hw_5')
a[homeworks == 'hw_5']

In [None]:
r = np.random.randn(6, 4)
r

In [None]:
print(r < 0)
r[r < 0] = 0
r

### Fancy Indexing
Uses integer arrays for indexing. Always creates copies of the data.

In [None]:
r[[-1, 4, 0]]

In [None]:
t = r[[-1, 4, 0]]
t[:] = 0
t

In [None]:
r

### Universal Functions

In [None]:
np.sqrt(r)

In [None]:
r2 = np.random.randn(6, 4)
r2

In [None]:
np.maximum(r, r2)

Counting True values:

In [None]:
(r2 > 0).sum()

### Conditional Logic

In [None]:
xs = np.arange(0, 1, 0.1)
ys = np.arange(1, 2, 0.1)
cs = np.array((True, False, False, True, False, True, True, True, False, True))
print(xs, ys, cs, sep='\n')

In [None]:
[(x if c else y) for x, y, c in zip(xs, ys, cs)]

In [None]:
np.where(cs, xs, ys)

In [None]:
np.where(r2 > 0, 'Chris', 'Pat')

### Statistical Operations

In [None]:
r2.mean()

In [None]:
print(r2)
r2.cumsum()

### Sorting

In [None]:
r = np.random.randn(6, 4)
r

In [None]:
r.sort()
r

In [None]:
r.sort(0)
r

### Linear Algebra
Matrix operations like inverse and determinant are using the same industry standard Fortran libraries (eg. BLAS, LAPACK) that are also used in other matrix languages - including some of NumPy's commercial counterparts. 

In [None]:
X = np.arange(6).reshape(3, 2)
X

In [None]:
Y = np.arange(10, 4, -1).reshape(2, 3)
Y

In [None]:
Z = X.dot(Y)
Z

In [None]:
M = np.matrix(Z)
M.I

In [None]:
from numpy.linalg import inv
inv(Z)

### Random Numbers
NumPy provides generators for random numbers following different distributions.

In [None]:
np.random.normal(size=(6, 4))

In [None]:
np.random.chisquare(3, size=(6, 4))

## pandas

This library offers facilities for data manipulation and analysis. It contains the `Series` and `DataFrame` data structures that allow efficient manipulation and analysis of one and two dimensional data. Pandas is free software released under the three-clause BSD license. Its name is derived from "panel data", an econometrics term for multidimensional structured data sets. Pandas was originally written by Wes McKinney.

In [None]:
import pandas as pd
from pandas import DataFrame, Series

### Series and DataFrame
The `Series` and `Dataframe` types are close relatives to the numpy `ndarray` but allow for labelling the data. The labels are called index. Since a `Series` is one-dimensional it can be compared with a Python `dict`.

In [None]:
s = Series((4, 2, 3, 5, 7))
s

In [None]:
s.values

In [None]:
s.index

In [None]:
s = Series((4, 2, 3, 5, 7), index=['Sue', 'Pat', 'Chris', 'John', 'Stu'])
s

In [None]:
s[['Chris', 'John', 'Sue']]

A `DataFrame` represents tabular data much like the common spreadsheet programs or a single table in a database. Each column can have a different datatype. It behaves similar to a `dict` of `Series`.

In [None]:
d = DataFrame({'student': ['Pat', 'Pat', 'Chris', 'Chris'],
 'homework': [1, 1, 2, 2],
 'points': [9, 9, 7, 8]})
d

In [None]:
d['points'].describe()

Even though it is possible to create DataFrames from scratch it is most common to load the data from external files or databases.

In [None]:
[method for method in dir(pd) if method.startswith('read')]

Let's dive into a small real-life example.
### Analyzing a Photovoltaic Powerplant's Log File

As it happens, there's interesting data worth analysis everywhere. The data we're looking at was taken from a power inverter installed in my basement.
CSV files quickly get large. Fortunately, pandas can open compressed csv files on-the-fly (the data is available at [pv_data.csv.bz2](http://www.senarclens.eu/~gerald/teaching/cms/notebooks/pv_data.csv.bz2)).

In [None]:
df = pd.read_csv('pv_data.csv.bz2', sep=';', skiprows=1)
df.tail()

Let's try to find out how much energy was produced. We could start by transforming the data into a more common unit.

In [None]:
df.columns

That's more columns than we're interested in. Let's drop the rest.

In [None]:
df.drop(['Inverter No.', 'Device Type', 'Reactive Energy L[Vars]', 'Reactive Energy C[Vars]',
 'Uac L1 [V]', 'Uac L2 [V]', 'Uac L3 [V]', 'Iac L1 [A]', 'Iac L2 [A]',
 'Iac L3 [A]', 'Udc MPPT1[V]', 'Idc MPPT1[A]', 'Udc MPPT2[V]',
 'Idc MPPT2[A]', 'Description'], axis=1, inplace=True)
df.head()

In [None]:
df['Wh'] = df['Energy [Ws]'] / 3600
df['Wh'].tail()

Maybe the data would be even nicer and easier to understand in the more common kWh unit.

In [None]:
df['kWh'] = df['Wh'] / 1000
df['kWh'].tail()

A quick peak at some basic statistics.

In [None]:
df['kWh'].describe()

In [None]:
df['kWh'].sum()

Lookslike about 5.2 MWh were produced during the entire time. How long was the log recorded?

In [None]:
df['Date'] = df['Date'].astype('datetime64')
delta = df['Date'].max() - df['Date'].min()
delta

So how much produces this power plant roughly per year?

In [None]:
df['kWh'].sum() / delta.days * 365

Is this value correct? Maybe the 477 days included two winters and only one summer?

In [None]:
df['Date'].head()

In [None]:
df['Date'].tail()

A straight-forward solution would be to limit the date to the (entire) year 2016. Pandas can do the job.

In [None]:
df['Year'] = pd.DatetimeIndex(df['Date']).year
df[df['Year'] == 2016].head()

In [None]:
df[df['Year'] == 2016].tail()

In [None]:
df_2016 = df[df['Year'] == 2016]
df_2016['kWh'].sum()

Finally it looks like that the power plant produces roughly 5 MWh per year - almost enough for a household of 4 persons.
How did the production distribute over the year?

In [None]:
grouped = df_2016.groupby('Date').sum()
grouped.tail()

A chart would be much easier to grasp...

In [None]:
%matplotlib inline
grouped.plot(y='kWh', title='2016 PV Output', figsize = (12, 12))

There is so much more to pandas. Please dive in and enjoy all the cool things you can do. I highly recommend Wes McKinney's (creator of pandas) excellent book [Python for Data Analysis](http://amzn.to/2rKFHby). Starting in fall 2017, the [2nd edition](http://amzn.to/2sHWH7B) will be available.

Since I was recently asked whether pandas can combine dataframse in an SQL manner - yes. Pandas rocks: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging