Unit 1

  • Importing the Pandas, Seaborn and Matplotlib libraries.
  • Understanding that Pandas DataFrames are basically Python dictionaries.
  • Entering data into DataFrames using dictionaries.
  • Visualizing pairwise data using scatterplots.
  • Entering data into DataFrames by zipping lists of column values.
  • Entering data into DataFrames by using a list of pairs (or tuples) representing rows.
  • Understanding that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
  • Knowing the equation of the line and the meaning of the y-intercept and the slope.
  • Using Seaborn to display the regression plot.
  • Using the regression line to make predictions.
  • Understanding confidence intervals.
  • Using Seaborn to display the residual plot.
  • Displaying the regression and residual plots either in the same figure or in separate figures.
  • Adjusting the horizontal limits of the regression and residual plots.
  • Discussing the residual plot as part of every regression analysis.
  • Knowing that for the regression analysis to be acceptable, the residuals must add up to zero.
  • Understanding that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
  • Understanding time series, and generate sequences of integers to work with time series data.

Unit 2

  • Knowing what types of data files are supported by Pandas.
  • Reading files from the hard disk or from a URL.
  • Distinguishing between continuous, discrete and categorical random variables.
  • Identifying CSV files; understand their structure and how to read all or selected columns.
  • Checking for delimiters, quotes and empty spaces when reading CSV files.
  • Specifying which values in the CSV file should be treated as missing values.
  • Obtaining the number of rows and number of columns of a DataFrame.
  • Obtaining the list of all column names of a DataFrame.
  • Displaying the beginning and the end of a DataFrame.
  • Using the Titanic data set to practice analytics.
  • Adding new data columns to a DataFrame.
  • Iterating over columns in a DataFrame.
  • Accessing columns using integer indices.
  • Appending new rows.
  • Accessing rows using integer indices.
  • Understanding the difference between global and local row indices.
  • Deleting selected rows and/or columns from a DataFrame.
  • Identifying missing values in DataFrames using Boolean masks.
  • Summing up row and column values in DataFrames.
  • Counting missing values in the rows and columns of a DataFrame.
  • Dropping rows and/or columns with missing values.

Unit 3

  • Calculating the minimum, maximum, sum and mean of column values.
  • Calculating variance and standard deviation.
  • Understanding why standard deviation is used more often in practice.
  • Plotting histograms with Pandas.
  • Using the 68–95–99.7 rule in predictive analysis.
  • Filtering rows (= select rows with a given property) in DataFrames.
  • Calculating the median and mode with Pandas.
  • Understanding how the median differs from the mean.
  • Knowing the five values which define a boxplot.
  • Displaying boxplots with Seaborn, and how to interpret them.
  • Confirming information read from a boxplot by calling the method describe.
  • Understanding how a pair of random variables can be correlated.
  • Calculating the Pearson coefficient of correlation R (R-value).
  • Using Pandas to quickly see which quantities in a DataFrame are correlated.
  • Visualizing correlation graphically via heatmaps.
  • Annotating heatmaps, set limits for the values, and change color maps.
  • Understanding the purpose of using the R-squared value, and its advantages over the R-value.
  • Calculating the R-squared value by squaring all values in the correlation matrix.
  • Searching and replacing in DataFrames using the method replace().
  • Modifying values in entire DataFrames and in individual columns using functions.

Unit 4

  • Why it is important to quantify the results of linear regression analysis.
  • What is statistical hypothesis testing and the null hypothesis.
  • What is the P-value and the standard error of the estimate.
  • How to use the function linregress of the Scipy Stats module.
  • How to use Matplotlib to plot Scipy Stats results.
  • How to use the calculated y-intercept and slope for predictions.
  • How to perform simple linear regression with Statsmodels.
  • About the difference between the training and testing datasets.
  • About two “gotchas” that one needs to pay attention to when using Statsmodels.
  • How to use simple linear regression results to make predictions
  • What is multiple linear regression, and how to do it with Statsmodels.
  • How to use multiple linear regression results to make predictions.
  • How to plot the results of multiple linear regression (and understand the plots).
  • What is multicollinearity of independent variables in multiple linear regression.
  • What are the main limitations of linear regression models.
  • That linear regression does not work for categorical dependent variables.
  • What is logistic regression, and what types of models it is used for.
  • How to recognize when the results of logistic regression are wrong.
  • How to perform logistic regression with Seaborn and Statsmodels.
  • Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.

Capstone Project

Trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.