{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Analysis\n",
    "## NumPy\n",
    "NumPy is a library that adds support for large, multi-dimensional arrays and matrices to the Python programming language. It also offers a large collection of high-level mathematical functions to operate on these arrays. NumPy is free software released under the three-clause BSD license. It is the foundation of many other Python libraries for scientific computing - including pandas. Travis Oliphant is NumPy's original author.\n",
    "### ndarray\n",
    "At the heart of NumPy lies its n-dimensional array object. Each ndarray has a shape, a tuple indicating the size of each dimension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "a = np.arange(0, 24)\n",
    "print(a)\n",
    "print(a.ndim)\n",
    "print(a.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = a.reshape(6,4)\n",
    "print(a)\n",
    "print(a.ndim)\n",
    "print(a.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays can be initialized from Python sequences. Alternatively, they can be created via `arange`, `zeros`, `ones` and `empty`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.array((1, 4, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.array((1, 4, 2), dtype=np.float64)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.zeros(12).reshape(3, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.ones(12).reshape(3, 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.empty(12).reshape(3, 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operations between equal sized arrays applies the operation elementwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a * a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a - a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1 / a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a == 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Indexing and Slicing\n",
    "One of the most important distinctions from Python lists is that array slices are views on the original array (instead of copies)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "row = a[3]\n",
    "row[-1] = 99\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assignment of a scalar to a view is broadcast to the entire range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[1:4][:] = 0\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[4, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A boolean array can be passed for indexing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "homeworks = np.array(('hw_1', 'hw_2', 'hw_3', 'hw_4', 'hw_5', 'hw_6'))\n",
    "print(homeworks == 'hw_5')\n",
    "a[homeworks == 'hw_5']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = np.random.randn(6, 4)\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(r < 0)\n",
    "r[r < 0] = 0\n",
    "r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fancy Indexing\n",
    "Uses integer arrays for indexing. Always creates copies of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r[[-1, 4, 0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t = r[[-1, 4, 0]]\n",
    "t[:] = 0\n",
    "t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Universal Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r2 = np.random.randn(6, 4)\n",
    "r2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.maximum(r, r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Counting True values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(r2 > 0).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conditional Logic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = np.arange(0, 1, 0.1)\n",
    "ys = np.arange(1, 2, 0.1)\n",
    "cs = np.array((True, False, False, True, False, True, True, True, False, True))\n",
    "print(xs, ys, cs, sep='\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[(x if c else y) for x, y, c in zip(xs, ys, cs)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.where(cs, xs, ys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.where(r2 > 0, 'Chris', 'Pat')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Statistical Operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r2.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(r2)\n",
    "r2.cumsum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sorting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = np.random.randn(6, 4)\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r.sort()\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r.sort(0)\n",
    "r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear Algebra\n",
    "Matrix operations like inverse and determinant are using the same industry standard Fortran libraries (eg. BLAS, LAPACK) that are also used in other matrix languages - including some of NumPy's commercial counterparts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.arange(6).reshape(3, 2)\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = np.arange(10, 4, -1).reshape(2, 3)\n",
    "Y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Z = X.dot(Y)\n",
    "Z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "M = np.matrix(Z)\n",
    "M.I"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy.linalg import inv\n",
    "inv(Z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random Numbers\n",
    "NumPy provides generators for random numbers following different distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.normal(size=(6, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.chisquare(3, size=(6, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This library offers facilities for data manipulation and analysis. It contains the `Series` and `DataFrame` data structures that allow efficient manipulation and analysis of one and two dimensional data. Pandas is free software released under the three-clause BSD license. Its name is derived from \"panel data\", an econometrics term for multidimensional structured data sets. Pandas was originally written by Wes McKinney."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas import DataFrame, Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Series and DataFrame\n",
    "The `Series` and `Dataframe` types are close relatives to the numpy `ndarray` but allow for labelling the data. The labels are called index. Since a `Series` is one-dimensional it can be compared with a Python `dict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = Series((4, 2, 3, 5, 7))\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = Series((4, 2, 3, 5, 7), index=['Sue', 'Pat', 'Chris', 'John', 'Stu'])\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s[['Chris', 'John', 'Sue']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `DataFrame` represents tabular data much like the common spreadsheet programs or a single table in a database. Each column can have a different datatype. It behaves similar to a `dict` of `Series`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = DataFrame({'student': ['Pat', 'Pat', 'Chris', 'Chris'],\n",
    "               'homework': [1, 1, 2, 2],\n",
    "               'points': [9, 9, 7, 8]})\n",
    "d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['points'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though it is possible to create DataFrames from scratch it is most common to load the data from external files or databases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[method for method in dir(pd) if method.startswith('read')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's dive into a small real-life example.\n",
    "### Analyzing a Photovoltaic Powerplant's Log File"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As it happens, there's interesting data worth analysis everywhere. The data we're looking at was taken from a power inverter installed in my basement.\n",
    "CSV files quickly get large. Fortunately, pandas can open compressed csv files on-the-fly (the data is available at [pv_data.csv.bz2](http://www.senarclens.eu/~gerald/teaching/cms/notebooks/pv_data.csv.bz2))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('pv_data.csv.bz2', sep=';', skiprows=1)\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to find out how much energy was produced. We could start by transforming the data into a more common unit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's more columns than we're interested in. Let's drop the rest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop(['Inverter No.', 'Device Type', 'Reactive Energy L[Vars]', 'Reactive Energy C[Vars]',\n",
    "       'Uac L1 [V]', 'Uac L2 [V]', 'Uac L3 [V]', 'Iac L1 [A]', 'Iac L2 [A]',\n",
    "       'Iac L3 [A]', 'Udc MPPT1[V]', 'Idc MPPT1[A]', 'Udc MPPT2[V]',\n",
    "       'Idc MPPT2[A]', 'Description'], axis=1, inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Wh'] = df['Energy [Ws]'] / 3600\n",
    "df['Wh'].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Maybe the data would be even nicer and easier to understand in the more common kWh unit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['kWh'] = df['Wh'] / 1000\n",
    "df['kWh'].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick peak at some basic statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['kWh'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['kWh'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lookslike about 5.2 MWh were produced during the entire time. How long was the log recorded?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date'] = df['Date'].astype('datetime64')\n",
    "delta = df['Date'].max() - df['Date'].min()\n",
    "delta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So how much produces this power plant roughly per year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['kWh'].sum() / delta.days * 365"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is this value correct? Maybe the 477 days included two winters and only one summer?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Date'].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A straight-forward solution would be to limit the date to the (entire) year 2016. Pandas can do the job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Year'] = pd.DatetimeIndex(df['Date']).year\n",
    "df[df['Year'] == 2016].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df['Year'] == 2016].tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2016 = df[df['Year'] == 2016]\n",
    "df_2016['kWh'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally it looks like that the power plant produces roughly 5 MWh per year - almost enough for a household of 4 persons.\n",
    "How did the production distribute over the year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = df_2016.groupby('Date').sum()\n",
    "grouped.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A chart would be much easier to grasp..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "grouped.plot(y='kWh', title='2016 PV Output', figsize = (12, 12))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is so much more to pandas. Please dive in and enjoy all the cool things you can do. I highly recommend Wes McKinney's (creator of pandas) excellent book [Python for Data Analysis](http://amzn.to/2rKFHby). Starting in fall 2017, the [2nd edition](http://amzn.to/2sHWH7B) will be available.\n",
    "\n",
    "Since I was recently asked whether pandas can combine dataframse in an SQL manner - yes. Pandas rocks: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}