Python Simple One-hot Encoding

One-hot Encoding

One-hot encoding transforms categorical data into a fixed-size numeric vector, where each index maps to one of the unique categories present in the input data. While you can use sklearn.preprocessing.OneHotEncoder or pandas.get_dummies to perform this transformation for you, sometimes it’s nice to be able to do this just with the Python standard library for quick’n’dirty scripts. Using Python’s built-in defaultdict data structure and itertools package, we can make a “dictionary-like” data structure that maps any hashable data to a unique integer. If the key has not been seen before, its value will be the next unique identifier (starting with 0), otherwise its index value will be returned.

Code

>>> import itertools
>>> from collections import defaultdict
>>> onehot = defaultdict(itertools.count().__next__)
>>> onehot['a']
0
>>> onehot[('b', 'c')]
1
>>> onehot['d']
2
>>> onehot['d']
2

A normal dict can easily be retrieved with:

>>> dict(onehot)
{'a': 0, ('b', 'c'): 1, 'd': 2}

and since there’s a one-to-one mapping, you can quickly retrieve the reverse mapping—of indices to categories—with the following dict comprehension:

>>> {v: k for (k, v) in onehot.items()}
{0: 'a', 1: ('b', 'c'), 2: 'd'}