Synthesizing Realistic Data
Useful for testing and toy datasets
In the process of writing about a pandas function, I realized I needed a realistic dataset to effectively demonstrate its use.
Here's how you can use Faker to do this
Start with installing Faker with pip:
pip install faker
Created: 2021-04-30
Updated: 2022-01-04
Using Faker
from faker import Faker
fake = Faker()
print(f"Hello, my name is {fake.first_name()} {fake.last_name()}.\n"
f"I'm a {fake.job()} at {fake.company()}.")
Hello, my name is Kevin Lane. I'm a Call centre manager at Odonnell-Harrell.
Using Faker in a Fixture Factory
Another obvious use case of generating synthetic data is for unit testing.
While it's simple to generate random, independent variables, the same cannot be said for complex, multidimensional models that incorporate business logic.
I've found useful pattern is to construct a model using a data structure that mimics one that we would use in production. This concept is known as the
By leveraging
Let's create a ShopperFactory
class that we can use to generate our Shopper
class
from dataclasses import dataclass
from faker import Faker
from datetime import datetime
@dataclass
class Shopper:
"""The object we want to generate"""
id: int
username: str
timestamp: datetime
product_id: int
action: str
class ShopperFactory:
"""
Factory that produces Fake Shoppers
Parameters
----------
active_begin: str
Leftmost window of time to include for timestamp
active_end: str
Rightmost window of time to include for timestamp
recurring: float
How often to generate unique shoppers
"""
F = Faker()
def __init__(self, active_begin='-30d', active_end='now', recurring: float = 0.15,
actions=('view', 'add_to_cart', 'save', 'share', 'purchase')):
self.active_begin = active_begin
self.active_end = active_end
self.recurring = recurring
self.actions = actions
def create(self) -> Shopper:
return Shopper(
id=self.F.pyint(),
username=self.F.user_name(),
timestamp=self.F.date_time_between(self.active_begin, self.active_end),
product_id=int(self.F.ean()),
action=self.F.random.choice(self.actions)
)
Making a dataset
Now we can create data!
# Create 100 unique shoppers
factory = ShopperFactory()
shoppers = [factory.create() for _ in range(100)]
import pandas as pd
df = pd.DataFrame(shoppers)
# Order by timestamp
df = df.sort_values('timestamp').reset_index(drop=True)
df.head(20)
id | username | timestamp | product_id | action | |
---|---|---|---|---|---|
0 | 9334 | starksheryl | 2021-12-07 00:20:27 | 2481094721503 | add_to_cart |
1 | 8961 | emilylewis | 2021-12-07 19:49:39 | 7680724944021 | share |
2 | 2890 | torresdenise | 2021-12-07 23:46:27 | 5757279973652 | save |
3 | 204 | teresasmith | 2021-12-08 02:41:39 | 300613065442 | add_to_cart |
4 | 4039 | hectorking | 2021-12-08 11:01:59 | 7504905191595 | purchase |
5 | 1572 | lauramorgan | 2021-12-09 02:02:36 | 6301670090402 | save |
6 | 486 | ralph65 | 2021-12-09 04:27:31 | 3343561140738 | share |
7 | 5512 | pamela43 | 2021-12-09 10:05:23 | 9653097856077 | save |
8 | 2946 | ryan45 | 2021-12-09 11:45:00 | 3632738352129 | purchase |
9 | 2136 | xnixon | 2021-12-09 12:42:39 | 5948317234532 | purchase |
10 | 4081 | hughesandre | 2021-12-09 17:45:03 | 5393312494731 | save |
11 | 5846 | spencerblanchard | 2021-12-10 06:20:17 | 9459623162946 | view |
12 | 7888 | kathleen43 | 2021-12-10 10:31:56 | 9871057415724 | purchase |
13 | 5070 | gfrench | 2021-12-10 11:06:15 | 4802957773084 | view |
14 | 7805 | opacheco | 2021-12-10 13:43:48 | 2497493920874 | share |
15 | 1458 | ilucas | 2021-12-11 08:31:41 | 1194856071259 | view |
16 | 9574 | charlesvasquez | 2021-12-13 15:02:27 | 6950640703752 | share |
17 | 6810 | allenbrenda | 2021-12-13 19:28:32 | 9628842019694 | view |
18 | 9191 | savannah52 | 2021-12-13 20:44:41 | 9861781795148 | save |
19 | 3041 | david41 | 2021-12-13 23:52:03 | 440843721630 | share |
This simple example generates valid Shopper
objects on-demand.
However, the dataset, as a whole may not accurately reflect one in production.
- We have no logic in place that assures
id
andusername
are always associated. - There should be repeat users
- A purchase should necessarily include a "view", "add_to_cart" for the product id. These events should be in order (with regards to "timestamp")
- ...etc
Having the dataset make sense will require some additional work!