Section 1
- Importing the Pandas, Seaborn and Matplotlib libraries.
- Understanding that Pandas DataFrames are basically Python dictionaries.
- Entering data into DataFrames using dictionaries.
- Visualizing pairwise data using scatterplots.
- Entering data into DataFrames by zipping lists of column values.
- Entering data into DataFrames by using a list of pairs (or tuples) representing rows.
- Understanding that adding a new column to a DataFrame is the same as adding a new item to a dictionary.
- Knowing the equation of the line and the meaning of the y-intercept and the slope.
- Using Seaborn to display the regression plot.
- Using the regression line to make predictions.
- Understanding confidence intervals.
- Using Seaborn to display the residual plot.
- Displaying the regression and residual plots either in the same figure or in separate figures.
- Adjusting the horizontal limits of the regression and residual plots.
- Discussing the residual plot as part of every regression analysis.
- Knowing that for the regression analysis to be acceptable, the residuals must add up to zero.
- Understanding that the residuals must look like random numbers, without showing any non-random patterns such as lines or curves.
- Understanding time series, and generate sequences of integers to work with time series data.
Section 2
- Knowing what types of data files are supported by Pandas.
- Reading files from the hard disk or from a URL.
- Distinguishing between continuous, discrete and categorical random variables.
- Identifying CSV files; understand their structure and how to read all or selected columns.
- Checking for delimiters, quotes and empty spaces when reading CSV files.
- Specifying which values in the CSV file should be treated as missing values.
- Obtaining the number of rows and number of columns of a DataFrame.
- Obtaining the list of all column names of a DataFrame.
- Displaying the beginning and the end of a DataFrame.
- Using the Titanic data set to practice analytics.
- Adding new data columns to a DataFrame.
- Iterating over columns in a DataFrame.
- Accessing columns using integer indices.
- Appending new rows.
- Accessing rows using integer indices.
- Understanding the difference between global and local row indices.
- Deleting selected rows and/or columns from a DataFrame.
- Identifying missing values in DataFrames using Boolean masks.
- Summing up row and column values in DataFrames.
- Counting missing values in the rows and columns of a DataFrame.
- Dropping rows and/or columns with missing values.
Section 3
- Calculating the minimum, maximum, sum and mean of column values.
- Calculating variance and standard deviation.
- Understanding why standard deviation is used more often in practice.
- Plotting histograms with Pandas.
- Using the 68–95–99.7 rule in predictive analysis.
- Filtering rows (= select rows with a given property) in DataFrames.
- Calculating the median and mode with Pandas.
- Understanding how the median differs from the mean.
- Knowing the five values which define a boxplot.
- Displaying boxplots with Seaborn, and how to interpret them.
- Confirming information read from a boxplot by calling the method describe.
Section 4
- Understanding how a pair of random variables can be correlated.
- Calculating the Pearson coefficient of correlation R (R-value).
- Using Pandas to quickly see which quantities in a DataFrame are correlated.
- Visualizing correlation graphically via heatmaps.
- Annotating heatmaps, set limits for the values, and change color maps.
- Understanding the purpose of using the R-squared value, and its advantages over the R-value.
- Calculating the R-squared value by squaring all values in the correlation matrix.
- Searching and replacing in DataFrames using the method replace().
- Modifying values in entire DataFrames and in individual columns using functions.
Section 5
- Why it is important to quantify the results of linear regression analysis.
- What is statistical hypothesis testing and the null hypothesis.
- What is the P-value and the standard error of the estimate.
- How to use the function linregress of the Scipy Stats module.
- How to use Matplotlib to plot Scipy Stats results.
- How to use the calculated y-intercept and slope for predictions.
- How to perform simple linear regression with Statsmodels.
- About the difference between the training and testing datasets.
- About two “gotchas” that one needs to pay attention to when using Statsmodels.
- How to use simple linear regression results to make predictions
- What is multiple linear regression, and how to do it with Statsmodels.
- How to use multiple linear regression results to make predictions.
- How to plot the results of multiple linear regression (and understand the plots).
- What is multicollinearity of independent variables in multiple linear regression.
- What are the main limitations of linear regression models.
- That linear regression does not work for categorical dependent variables.
- What is logistic regression, and what types of models it is used for.
- How to recognize when the results of logistic regression are wrong.
- How to perform logistic regression with Seaborn and Statsmodels.
- Finally, trainees will use the testing Titanic dataset to make a (simplified) prediction of the survival of Titanic passengers.
Capstone Project
Trainees will complete a Capstone Project under the supervision of an NCLab senior Data Analytics instructor in order to graduate and obtain a career certificate.