Predictive Data Analytics with Python

In this course trainees learn how to read, clean, visualize and analyze data effectively using Python and its powerful free libraries Pandas, Seaborn, Scipy, Numpy, Matplotlib, and Statsmodels.

Course Overview

The course focuses on using Python libraries to solve practical applications rather than on the underlying math concepts. Trainees will learn enough statistical and analytical concepts and procedures in the tutorials to use these libraries effectively. This foundation is invaluable, whether trainees continue to use free Python libraries for analysis and visualization in their own work, or move on to a commercial analytics/visualization product specific to their industry.

Predictive Data Analytics with Python uses the Python programming language, and therefore basic knowledge of Python commands, structures and syntax is required. If trainees have little or no prior experience in computer programming, they will need to begin with Introduction to Computer Programming that takes 80-120 hours. This visual course uses a simplified Python language to teach them algorithmic and problem solving skills which are a crucial prerequisite to succeed in computer programming.

Predictive Data Analytics with Python starts by covering a necessary minimum of the Python programming language for applications in Data Science. Then it teaches traineess how to use Python and its powerful free libraries including Pandas, Numpy, Scipy, Matplotlib, Seaborn, and Statsmodels to read data from files, clean data, present data in visual form, perform qualitative and quantitative analysis of data, interpret data, and make predictions.

Prerequisites

Traineess will work with some high school math, which we will review as we go: the equation of the line, how to calculate the average of a set of values, and other basic computations.

Equipment Requirements

Internet access

Email

One of the following browsers:

  • Google Chrome
  • Mozilla Firefox
  • Microsoft Internet Explorer (9.0 or above)
  • Safari

Course Structure and Length

The course is self-paced, and trainees will practice each skill or concept as they go. Automatic feedback is built into the course for both practices and quizzes.

The course is divided into three Units and every Unit has five Sections. Each Section consists of seven instructional/ practice levels, a quiz, and a master (proficiency) level. Trainees can return to any level or quiz for review.

This table illustrates the course structure as units, sections, and levels.

While learning the skills in Predictive Data Analytics with Python, trainees can practice skills and create portfolio artifacts with NCLab’s Python apps. Use a project idea from NCLab or create their own. This independent practice will develop their fluency and confidence as data analysts and programmers.

The estimated time to complete this course is 80-120 hours, based on ability level. Since the course is self-paced, the amount of time required to complete the course will vary from person to person. Trainees are responsible for learning both the tutorial content and the skills acquired through practice. Upon successful completion of the course, trainees will be ready to to perform a Capstone Project (40-60 hours) under the supervision of an NCLab instructor in order to graduate and obtain a career certificate.

Unit 1: Calculations, Libraries, Variables, Functions, Tuples, Lists, Indices, For-Loop

Section 1

  • Use Python as a powerful command-line scientific calculator.
  • Add, subtract, multiply and divide numbers.
  • Compute using the priorities of arithmetic operators and parentheses.
  • Understand how, In contrast to other languages, the integer division in Python 3 yields the correct real value.
  • Modern and traditional ways of importing libraries.
  • Import the fractions library and work with fractions.
  • Use the built-in function help.
  • Import Numpy and use its functionality.

Section 2

  • Use the floor division operator //, modulo operator %, and power operator **.
  • Understand that real numbers are not represented exactly in the computer, which can lead to problems with the floor division and modulo operators.
  • Real numbers should never be compared using the standard comparison operator ==, and how to compare them correctly using a small tolerance.
  • Compute and convert data sizes, including b, KB, MB and GB.

Section 3

  • Learn how to create text strings, and basic text string operations.
  • Use quotation marks correctly when defining text strings.
  • Define text string variables from raw text strings.
  • Clean text strings of trailing spaces and inspect hidden characters using the function repr.
  • Add text strings, multiply them with integers and update them with the operators += and *=.
  • Use the operators +=, -=, *= and /= for numerical variables.
  • Measure the length of text strings with the function len.
  • Use the special newline character \n.
  • Define functions and call them from the main program.
  • Write docstrings and comment code.
  • Understand that functions do not have to accept arguments and do not have to return values.
  • Use functions that accept multiple arguments and return multiple values.

Section 4

  • Understand the difference between global and local scopes, and global and local variables.
  • Use the terms function parameters and function arguments correctly.
  • Understand that a function which returns multiple values actually returns a tuple.
  • Unpack a tuple, and access individual items using indices.
  • Understand how indices are numbered.
  • Use the for loop to parse tuples and text strings one item / character at a time.
  • Understand that in Python, loops do not create their own scope.
  • Use the command pass as a placeholder to do nothing.
  • Apply similar functions and methods to text strings and tuples, such as using indices and measuring their length with the function len.
  • Understand that Python is an object-oriented language where objects have methods.
  • Apply text string methods to text strings, such as upper and lower.
  • Use the for loop in combination with the range function; use the range function in three different ways.

Section 5

  • Create empty and non-empty lists.
  • Add lists and multiply them with numbers.
  • Add items to lists using the list methods append and insert.
  • Parse lists using the for loop and use indices.
  • Remove and return list items with the method pop.
  • Remove and destroy list items using the command del.
  • Reverse lists with the function reversed and list method reverse.
  • Sort lists with the function sorted and list method sort.
  • Insert numbers and other values into text strings.
  • Understand the properties of mutable and immutable objects: that lists, sets and dictionaries are mutable objects while numbers, text strings, booleans and tuples are not.

Unit 2: Logic, Decision-Making, File Operations, and Dictionaries

Section 6

  • Use conditions and decision making in programs.
  • Work with Boolean values, operators, expressions and functions.
  • Write conditions using if, if-else and if-elif-else
  • Search for items in tuples and lists, and for substrings in text strings.
  • Remove duplicates from tuples and lists.
  • Chain algebraic comparison operators.
  • Instantly terminate any loop with the break
  • Calculate probabilities with Scipy using built-in Probability Density Functions (PDFs) and Cumulative Distribution Functions (CDFs).

Section 7

  • Write programs using the while loop; compare its properties to the for
  • Understand the main applications for the while
  • Understand that the keyword while can be used to create infinite loops, and when these are useful.
  • Use different ways to check if a list is non-empty.
  • Implement the interval bisection method in Python.
  • Define and use anonymous lambda
  • Implement Newton’s method in Python.
  • Implement the Steepest Descent Method (SDM) in Python.
  • Solve a real-world optimization problem using the SDM method.

Section 8

  • Explore the concept of mutable and immutable data types in-depth.
  • Use functions with default parameters.
  • Check variable type at runtime.
  • Raise an exception when a program is used incorrectly.
  • Work with complex numbers.
  • Slice text strings, tuples and lists, and create their copies and reversed copies.
  • Time programs, and work with system date and time.
  • Use ternary expressions and list comprehension.

Section 9

  • Open and close files for reading and writing.
  • Parse a text file line-by-line using the for
  • Clean text strings with strip, lstrip and rstrip.
  • Split text strings into words.
  • Count lines, words and characters in a text file.
  • Rewind a file to manage multiple operations.
  • Read selected lines and use the file method readline.

Section 10

  • Create and work with Python dictionaries.
  • Understand that Python dictionary is the same thing as SQL table.
  • Create empty and non-empty dictionaries.
  • Add and remove key:value pairs.
  • Access values using keys.
  • Parse dictionaries using the for
  • Extract lists of keys, values, and items from dictionaries.
  • Zip lists of keys and values to create a dictionary.
  • Combine dictionaries.
  • Access the Google translate API from Python programs for real-time translations.

Unit 3: Working With Data

Section 11

  • Import the Pandas, Seaborn and Matplotlib libraries.
  • Understand that Pandas DataFrames are basically Python dictionaries.
  • Enter data into DataFrames using dictionaries.
  • Visualize pairwise data using scatterplots.
  • Enter data into DataFrames by zipping lists of column values.
  • Enter data into DataFrames by using a list of pairs (or tuples) representing rows.
  • Understand that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
  • Know the equation of the line and the meaning of the y-intercept and the slope.
  • Use Seaborn to display the regression plot.
  • Use the regression line to make predictions.
  • Understand confidence intervals.
  • Use Seaborn to display the residual plot.
  • Display the regression and residual plots either in the same figure or in separate figures.
  • Adjust the horizontal limits of the regression and residual plots.
  • Discuss the residual plot as part of every regression analysis.
  • Know that for the regression analysis to be acceptable, the residuals must add up to zero.
  • Understand that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
  • Understand time series, and generate sequences of integers to work with time series data.

Section 12

  • Know what types of data files are supported by Pandas.
  • Read files from the hard disk or from a URL.
  • Distinguish between continuous, discrete and categorical random variables.
  • Identify CSV files; understand their structure and how to read all or selected columns.
  • Check for delimiters, quotes and empty spaces when reading CSV files.
  • Specify which values in the CSV file should be treated as missing values.
  • Obtain the number of rows and number of columns of a DataFrame.
  • Obtain the list of all column names of a DataFrame.
  • Display the beginning and the end of a DataFrame.
  • Use the Titanic data set to practice analytics.
  • Add new data columns to a DataFrame.
  • Iterate over columns in a DataFrame.
  • Access columns using integer indices.
  • Append new rows.
  • Access rows using integer indices.
  • Understand the difference between global and local row indices.
  • Delete selected rows and/or columns from a DataFrame.
  • Identify missing values in DataFrames using Boolean masks.
  • Sum row and column values in DataFrames.
  • Elegantly count missing values in the rows and columns of a DataFrame.
  • Drop rows and/or columns with missing values.

Section 13

  • Calculate the minimum, maximum, sum and mean of column values.
  • Calculate variance and standard deviation.
  • Understand why standard deviation is used more often in practice.
  • Plot histograms with Pandas.
  • Use the 68–95–99.7 rule in predictive analysis.
  • Filter rows (= select rows with a given property) in DataFrames.
  • Calculate the median and mode with Pandas.
  • Understand how the median differs from the mean.
  • Know the five values which define a boxplot.
  • Display boxplots with Seaborn, and how to interpret them.
  • Confirm information read from a boxplot by calling the method describe.

Section 14

  • Understand how a pair of random variables can be correlated.
  • Calculate the Pearson coefficient of correlation R (R-value).
  • Use Pandas to quickly see which quantities in a DataFrame are correlated.
  • Visualize correlation graphically via heatmaps.
  • Annotate heatmaps, set limits for the values, and change color maps.
  • Understand the purpose of using the R-squared value, and its advantages over the R-value.
  • Calculate the R-squared value by squaring all values in the correlation matrix.
  • Search and replace in DataFrames using the method replace.
  • Modify values in entire DataFrames and in individual columns using functions.

Section 15

In the last section of the course trainees will learn:

  • Why it is important to quantify the results of linear regression analysis.
  • What is statistical hypothesis testing and the null hypothesis.
  • What is the P-value and the standard error of the estimate.
  • How to use the function linregress of the Scipy Stats module.
  • How to use Matplotlib to plot Scipy Stats results.
  • How to use the calculated y-intercept and slope for predictions.
  • How to perform simple linear regression with Statsmodels.
  • About the difference between the training and testing datasets.
  • About two “gotchas” that one needs to pay attention to when using Statsmodels.
  • How to use simple linear regression results to make predictions
  • What is multiple linear regression, and how to do it with Statsmodels.
  • How to use multiple linear regression results to make predictions.
  • How to plot the results of multiple linear regression (and understand the plots).
  • What is multicollinearity of independent variables in multiple linear regression.
  • What are the main limitations of linear regression models.
  • That linear regression does not work for categorical dependent variables.
  • What is logistic regression, and what types of models it is used for.
  • How to recognize when the results of logistic regression are wrong.
  • How to perform logistic regression with Seaborn and Statsmodels.
  • Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.