{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis\n", "## NumPy\n", "NumPy is a library that adds support for large, multi-dimensional arrays and matrices to the Python programming language. It also offers a large collection of high-level mathematical functions to operate on these arrays. NumPy is free software released under the three-clause BSD license. It is the foundation of many other Python libraries for scientific computing - including pandas. Travis Oliphant is NumPy's original author.\n", "### ndarray\n", "At the heart of NumPy lies its n-dimensional array object. Each ndarray has a shape, a tuple indicating the size of each dimension." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "a = np.arange(0, 24)\n", "print(a)\n", "print(a.ndim)\n", "print(a.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = a.reshape(6,4)\n", "print(a)\n", "print(a.ndim)\n", "print(a.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arrays can be initialized from Python sequences. Alternatively, they can be created via `arange`, `zeros`, `ones` and `empty`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array((1, 4, 2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array((1, 4, 2), dtype=np.float64)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.zeros(12).reshape(3, 4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.ones(12).reshape(3, 4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.empty(12).reshape(3, 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations between equal sized arrays applies the operation elementwise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a * a" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a - a" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "1 / a" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a == 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing and Slicing\n", "One of the most important distinctions from Python lists is that array slices are views on the original array (instead of copies)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row = a[3]\n", "row[-1] = 99\n", "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assignment of a scalar to a view is broadcast to the entire range." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[1:4][:] = 0\n", "a" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[4, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A boolean array can be passed for indexing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "homeworks = np.array(('hw_1', 'hw_2', 'hw_3', 'hw_4', 'hw_5', 'hw_6'))\n", "print(homeworks == 'hw_5')\n", "a[homeworks == 'hw_5']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = np.random.randn(6, 4)\n", "r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(r < 0)\n", "r[r < 0] = 0\n", "r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fancy Indexing\n", "Uses integer arrays for indexing. Always creates copies of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r[[-1, 4, 0]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = r[[-1, 4, 0]]\n", "t[:] = 0\n", "t" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Universal Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.sqrt(r)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2 = np.random.randn(6, 4)\n", "r2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.maximum(r, r2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Counting True values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(r2 > 0).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conditional Logic" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xs = np.arange(0, 1, 0.1)\n", "ys = np.arange(1, 2, 0.1)\n", "cs = np.array((True, False, False, True, False, True, True, True, False, True))\n", "print(xs, ys, cs, sep='\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[(x if c else y) for x, y, c in zip(xs, ys, cs)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.where(cs, xs, ys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.where(r2 > 0, 'Chris', 'Pat')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical Operations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r2.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(r2)\n", "r2.cumsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r = np.random.randn(6, 4)\n", "r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.sort()\n", "r" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "r.sort(0)\n", "r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear Algebra\n", "Matrix operations like inverse and determinant are using the same industry standard Fortran libraries (eg. BLAS, LAPACK) that are also used in other matrix languages - including some of NumPy's commercial counterparts. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.arange(6).reshape(3, 2)\n", "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Y = np.arange(10, 4, -1).reshape(2, 3)\n", "Y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Z = X.dot(Y)\n", "Z" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "M = np.matrix(Z)\n", "M.I" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from numpy.linalg import inv\n", "inv(Z)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Numbers\n", "NumPy provides generators for random numbers following different distributions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.normal(size=(6, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.chisquare(3, size=(6, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This library offers facilities for data manipulation and analysis. It contains the `Series` and `DataFrame` data structures that allow efficient manipulation and analysis of one and two dimensional data. Pandas is free software released under the three-clause BSD license. Its name is derived from \"panel data\", an econometrics term for multidimensional structured data sets. Pandas was originally written by Wes McKinney." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandas import DataFrame, Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series and DataFrame\n", "The `Series` and `Dataframe` types are close relatives to the numpy `ndarray` but allow for labelling the data. The labels are called index. Since a `Series` is one-dimensional it can be compared with a Python `dict`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = Series((4, 2, 3, 5, 7))\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = Series((4, 2, 3, 5, 7), index=['Sue', 'Pat', 'Chris', 'John', 'Stu'])\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s[['Chris', 'John', 'Sue']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` represents tabular data much like the common spreadsheet programs or a single table in a database. Each column can have a different datatype. It behaves similar to a `dict` of `Series`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = DataFrame({'student': ['Pat', 'Pat', 'Chris', 'Chris'],\n", " 'homework': [1, 1, 2, 2],\n", " 'points': [9, 9, 7, 8]})\n", "d" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['points'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though it is possible to create DataFrames from scratch it is most common to load the data from external files or databases." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "[method for method in dir(pd) if method.startswith('read')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's dive into a small real-life example.\n", "### Analyzing a Photovoltaic Powerplant's Log File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it happens, there's interesting data worth analysis everywhere. The data we're looking at was taken from a power inverter installed in my basement.\n", "CSV files quickly get large. Fortunately, pandas can open compressed csv files on-the-fly (the data is available at [pv_data.csv.bz2](http://www.senarclens.eu/~gerald/teaching/cms/notebooks/pv_data.csv.bz2))." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('pv_data.csv.bz2', sep=';', skiprows=1)\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to find out how much energy was produced. We could start by transforming the data into a more common unit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's more columns than we're interested in. Let's drop the rest." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop(['Inverter No.', 'Device Type', 'Reactive Energy L[Vars]', 'Reactive Energy C[Vars]',\n", " 'Uac L1 [V]', 'Uac L2 [V]', 'Uac L3 [V]', 'Iac L1 [A]', 'Iac L2 [A]',\n", " 'Iac L3 [A]', 'Udc MPPT1[V]', 'Idc MPPT1[A]', 'Udc MPPT2[V]',\n", " 'Idc MPPT2[A]', 'Description'], axis=1, inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Wh'] = df['Energy [Ws]'] / 3600\n", "df['Wh'].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe the data would be even nicer and easier to understand in the more common kWh unit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['kWh'] = df['Wh'] / 1000\n", "df['kWh'].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick peak at some basic statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['kWh'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['kWh'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lookslike about 5.2 MWh were produced during the entire time. How long was the log recorded?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date'] = df['Date'].astype('datetime64')\n", "delta = df['Date'].max() - df['Date'].min()\n", "delta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So how much produces this power plant roughly per year?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['kWh'].sum() / delta.days * 365" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is this value correct? Maybe the 477 days included two winters and only one summer?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date'].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Date'].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A straight-forward solution would be to limit the date to the (entire) year 2016. Pandas can do the job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Year'] = pd.DatetimeIndex(df['Date']).year\n", "df[df['Year'] == 2016].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df['Year'] == 2016].tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_2016 = df[df['Year'] == 2016]\n", "df_2016['kWh'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally it looks like that the power plant produces roughly 5 MWh per year - almost enough for a household of 4 persons.\n", "How did the production distribute over the year?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = df_2016.groupby('Date').sum()\n", "grouped.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A chart would be much easier to grasp..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "grouped.plot(y='kWh', title='2016 PV Output', figsize = (12, 12))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is so much more to pandas. Please dive in and enjoy all the cool things you can do. I highly recommend Wes McKinney's (creator of pandas) excellent book [Python for Data Analysis](http://amzn.to/2rKFHby). Starting in fall 2017, the [2nd edition](http://amzn.to/2sHWH7B) will be available.\n", "\n", "Since I was recently asked whether pandas can combine dataframse in an SQL manner - yes. Pandas rocks: https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }