View All Posts

The Difficulties of Generating Fake Data

Published on Oct 22, 2020 by Ben Tyler

Background

The majority of the applications I work on focus on distilling the complexities of data into simple interfaces. This was the crux of my work with LRE Water and is even more so with my new company Stae. Many different types and structures of data are needed to adequetly test dashboards and data visualizations. For ease of development, I often generate fake datasets.

In the past, I have relied on Google Sheets and a combination of array formulas, vlookups, random value generators to create fake datasets. This approach worked for small, one off datasets. Any time I needed to generate more than 1000 records or needed to update a dataset, this approach proved to be tedious. I also explored Faker and Casual (both of which are fantastic tools) but found that the data they generate was too abstract and that the implementation wasn’t ideal for large datasets. I wanted a tool that:

  • makes it easy to generate realistic fake datasets of any size,
  • is highly configurable,
  • makes updating a dataset trivial,
  • outputs the data in a format that is easy to work with.

To solve this issue, I started developing Data Spring and the associated CLI.

Case Study

Let’s say we are developing a dashboard for visualizing a state government’s daily spending by department over the past 5 years. The dataset has the following requirements:

  1. Multiple spending values are needed for each day for each department over the last 5 years
  2. We need to be able specifiy a realistic list of departments that data will be generated for.
  3. The spending values need to be within a realistic range
  4. Needs to be output in JSON format.

The data structure should be as follows:

[  {    "date": "2020-01-01T08:00:00.007Z",    "department": "Transportation",    "spending": 10000  },  ...many more records]

The Google Sheet Approach

The dataset is relatively simple but still tedious to generate. Using Google Sheets, we would need to implement some sort of date step interval logic for the date field, randomly select a department using a combination of a random number and vlookup, and then generate a random number between our thresholds. Then we would need to copy our formulas down thousands of rows or use an array formula and deal with the incredibly slow resulting spreadsheet. Lastly, we would need to download the spreadsheet as a CSV and then convert it to JSON.

The same scenario with Data Spring

Data Spring is a config based data generator, meaning you define the desired shape of your data using an object or a JSON config file. To create the dataset in the case study you would do the following.

import { DataSpring } from "data-spring"const config = [  { id: "record_id", type: "id" },  {    id: "date",    type: "date",    interval: {      // i.e. 'hour' | 'day' | 'month' | 'year'      type: "hour",      // # of records to generate before stepping to next interval      recordsPerInterval: 2,    },    min: "2015-01-01 00:00:00",    max: "2020-12-31 00:00:00",  },  {    id: "department",    type: "string",    values: ["Transportation", "Environment", "Public Health", "Parks and Rec"],  },  {    id: "spending",    type: "number",    min: 5000,    max: 100000,  },]const dataset = DataSpring(config)

Pretty simple huh? If you realized you forgot to add a department or that the spending range was wrong, it is as easy as updating the config variable.

Wrapping up

This was a pretty contrived example but hopefully it demonstrated how Data Spring can be used to easily generate fake datasets. If you have feedback or comments, please reach out below.

Useful Links

About Me

Howdy, I am Ben Tyler. I specialize in developing aesthetic and functional websites and applications. I love collaborating on projects, so if you need to hire a developer or an adviser for a project, please get in touch!

Get in touch!