Data cleaning

Precleaning

Precleaning data formats (float32 for nums)

Standardising file types

Joining data sets

Consistent variable naming

Concatenating data

Joining data

Cleaning categorial data

One Hot Encoding

Checking for consistency

Cross-consistency

Data shaping

Wide and long data

Introduction

Collapsing data

Cleaning text data

Bag-of-words

N-grams

Introduction

We can add start and end of sentence markets. * and STOP

Generally remove punctuation

Feature hashing

Dropping variables

Sensitive information

Dropping unnecessary information, like names and derived variables