Create a complete dataframe from scratch using Python
After submitting a recent article to the editorial team at Towards Data Science, I received a message back with a simple question. “Is the dataset licensed for commercial use?” Great question. My draft dataset comes from Seaborn, a popular Python library with 17 sample datasets. [1]. The datasets do appear to be open source, and as expected, licenses allowing commercial use were easily found for many of them. Unfortunately, he selected one of the few datasets for which a license could not be found. But instead of switching to another Seaborn dataset, I decided to create my own synthetic data.
What is synthetic data?
IBM’s Kim Martineau defines synthetic data as “computer-generated information that augments or replaces real data to improve AI models, protect sensitive data, and reduce bias.” Masu. [2].
The synthetic data is look Like information from real world events, but it’s not. This avoids licensing issues, hides sensitive data, and protects personal information.
Synthetic data differs from anonymized or masked data, which takes real data from real events and changes certain fields to hide data attribution. If you want to anonymize names in your data, you can read how-tos on anonymizing names here.
Synthetic data doesn’t have to be perfect. The use case in the previous article was creating a guide on how to use the Python GroupBy() function. All I needed was a dataset that included numerical data, categorical data, and domains that my readers could understand (in this case, student test scores and grades) to help convey my message. Based on the work in that article, we provide below a guide for building your own synthetic datasets.
code:
A Jupyter notebook containing the complete Python code used in this tutorial is available on the linked github page. Download or clone the repository and follow the instructions.
The code requires the following libraries:
# Data Handling
import pandas as pd
import numpy as np# Data visualization
import plotly.express as px
# Anonymizer:
from faker import Faker