Overview

fodder is a command line tool to generate data according to a supplied schema. It is predominantly aimed at rapid prototyping and testing, allowing you to generate data quickly and easily so you can focus on the important stuff! It's feature set is not comprehensive, but it is sufficient for most use cases.

It follows the Unix philosophy of doing just one thing, and doing it well. Output always goes to standard output so it can be piped to another command or redirected to a file.

Features

  • Supports CSV, JSON and NDJSON output formats
  • Field values can be a lookup in an external file, see Maps
  • Field values can be selected from a list defined in an external file, see Categories
  • Schema files can be dynamically linted when using Visual Studio Code with a little set up, see Linting and Autocompletion
  • Single binary distribution, easily installed on any platform, see Installation

Use cases

  • Prototyping (POC's)
  • Supplying data to development environments
  • End-to-end pipeline testing

Goals

The eventual goal is to make it so easy to generate fake data that it becomes a standard part of the development process and it is no longer necessary to use real data outside of production environments.

Why fodder?

There are a number of data generation tools and options out there but there wasn't anything that quite fit. So here's the user story:

As a data engineer, I'd like to quickly generate data for my tables or message stream.

Here are some key drivers for fodder...

  • A single binary. Easy to install and update.
  • Scales with hardware so it can utilise bigger and/or more machines.
  • Straight-forward tool focused on just data generation.
  • Compose your specific solution with other tools.
  • Fairly fast and efficient so useful for generating large data sets.
  • No exposing your schemas to third parties.
  • No external service so can run in secured, isolated environments.
  • Get going quickly. Minimal dependencies means minimal impediments

You can use it in your shell as part of your local development workflow. You can use it in your CICD pipelines for testing. Remember to bake in failure modes, not just the happy path ;) You can put it in a container and run it at scale for integration & performance testing.

Other tools

Faker (Javascript, Python, Ruby etc)

These libraries are great but you need to write your custom generator as it's own thing. They're powerful, flexible and full featured but require a lot of time to use & maintain effectively. The expertise needed can also be daunting. We wanted something easier to get going and address the common cases. There is a lot of boilerplate just to get up and running. This doesn't scale so well across projects and the software development expertise and time required is a barrier for many teams/organisations.

Mockaroo, GenRocket etc

These require commercial arrangements. They use a service outside our environment which can be a complete showstopper for some organisations. Using them at scale, especially around performance testing can be problematic/expensive.

JSONPlaceholder, MockServer, Mockbin

These are designed for mocking out an API, not really flexible data generation at scale.

Installation

Pre-compiled binaries (MacOS only)

  1. Download the binary for your system from the GitLab releases page.

  2. Place the binary in your PATH and make it executable.

    cp fodder /usr/local/bin
    chmod +x /usr/local/bin/fodder
    
  3. Run the fodder command to verify that it is working. The first time you run it you will get a warning about the binary being from and unidentified developer. You will need to go to System Preferences > Security & Privacy > General and click Open Anyway to allow the binary to run.

Install from source

This should work for any system that Rust supports. You will need to have Rust installed. See the Rust installation guide for more information.

  1. Clone the repository

  2. Build the binary

    cargo build --release
    
  3. Copy the binary to your PATH

    cp target/release/fodder /usr/local/bin
    
  4. Run the fodder command to verify that it is working.

Usage

This section outlines the basic features of fodder and how to use them.

CLI

fodder --help
A data generation tool

Usage: fodder [OPTIONS] --schema <SCHEMA>

Options:
  -s, --schema <SCHEMA>  Path to the schema file
  -n, --nrows <NROWS>    Number of rows to generate [default: 3]
  -f, --format <FORMAT>  Output format [default: json] [possible values: csv, json, ndjson]
  -d, --definition       Show the JSON schema definition for allowed inputs, useful for autocompletion
  -h, --help             Print help
  -V, --version          Print version

Quickstart

Create a schema file

Here's an example schema file displaying some of the features THERE ARE MANY MORE.

cat > schema.fodder.yaml <<EOF
fields:
  # Generate a number between 0 and 10
  - name: A
    type: IntegerInRange
    args:
      min: 0
      max: 10
  # Generate a number between 0 and 20
  - name: B
    type: IntegerInRange
    args:
      max: 20
    # That is greater than A
    constraints:
      - type: GreaterThan
        name: A
  # Generate a datetime
  - name: C
    type: DateTime
    # With a 90% probability of being null
    null_probability: 0.9
  # Generate a sentence if C is null
  - name: D
    type: String
    args:
      subtype: Sentences
    constraints:
      - type: IfNull
        name: C
EOF

Generate some data

fodder --schema schema.yaml --format csv --nrows 3
ABCD
918aut nostrum quod vero ratione in numquam qui temporibus.
7142592-10-28T02:43:00+00:00
414accusantium omnis aperiam velit est ea in et.

Schema

The schema is a YAML file that defines the structure of your data. It is used to control the generation of the data.

There are three major components to the schema:

A complete example

This is a mostly complete example of schema, it is missing some field-specific arguments. For more information on the arguments for each field, see the Fields documentation.

It is also missing external data. For more information on external data, see the External Data documentation.

As with all schema's this can be used to output JSON or CSV data. However, in this case JSON is used as it is easier to show the complexity of the nested fields in the output.

Definition

fields:
  # Generate a number between 0 and 10.
  - name: A
    type: IntegerInRange
    args:
      min: 0
      max: 10
  # Generate a number between 0 and 20 that is greater than A.
  - name: B
    type: IntegerInRange
    args:
      max: 20
    constraints:
      - type: GreaterThan
        name: A
  # Generate a sentence if E is null.
  - name: C
    type: String
    args:
      subtype: Sentences
    constraints:
      - type: IfNull
        name: E
  # Generate a random boolean.
  - name: D
    type: Boolean
  # Generate a datetime with a 10% probability (90% chance of being null).
  - name: E
    type: DateTime
    null_probability: 0.9
  # Generate a string from a list of choices with unequal probability.
  - name: F
    type: WeightedCategory
    args:
      choices:
        - ["FOO", 2]
        - ["BAR", 1]
  # Generate a string from a list of choices with equal probability.
  - name: G
    type: WeightedCategory
    args:
      choices:
        - "FOO"
        - "BAR"
  # Generate a string with random substitutions.
  - name: H
    type: Bothify
    args:
      format: "RANDOM_ID: ??-##-??"
  # Generate a nested object.
  - name: I
    type: Nested
    fields:
      - name: J
        type: IntegerInRange
        args:
          min: 0
          max: 10
      - name: K
        type: IntegerInRange
        args:
          min: 0
          max: 11
        constraints:
          - type: GreaterThan
            name: A
      - name: L
        type: IntegerInRange
        args:
          min: 0
          max: 13
      - name: M
        type: Nested
        fields:
          - name: N
            type: IntegerInRange
            args:
              min: 0
              max: 12
          - name: O
            type: IntegerInRange
            args:
              min: 0
              max: 13
            # Reference a nested field.
            constraints:
              - type: GreaterThan
                name: I.M.N
  # Generate a list of nested objects.
  - name: P
    type: Nested
    args:
      subtype: List
    fields:
      - name: Q
        type: Nested
        fields:
          - name: R
            type: IntegerInRange
            args:
              min: 0
              max: 10
      - name: S
        type: Nested
        fields:
          - name: T
            type: IntegerInRange
            args:
              min: 0
              max: 10
  # Demonstrate templating and referencing other fields.
  - name: U
    type: FormattedString
    args:
      format: |
        This is the value of A: {{ refs['A'].raw }}
        This is the value of A + 10: {{ refs['A'].raw + 10 }}
  # Generate and array of integers of varying length.
  - name: V
    type: Array
    args:
      min: 0
      max: 5
      field:
        name: W
        type: IntegerInRange
        args:
          min: 0
          max: 10
  # Generate a "FreeEmail" by deferring the `fake-rs` library.
  - name: X
    type: Fakers
    args:
      subtype: FreeEmail
  # Calculate a duration between two dates, generally used by referencing other fields.
  - name: Y
    type: Duration
    args:
      start: "2010-01-01T00:00:00Z"
      end: "2020-01-01T00:00:00Z"
      component: Years
  # Look up a value in a map. Used by referencing other fields, Map's are
  # defined in the `maps` section and are reference by name.
  # - name: Z
  #   type: Map
  #   args:
  #     from_map: ID_TO_NAME
  #     key: A
  #     default: "No value found"

Output

fodder -s schema.yaml -n 1 -f json
[
  {
    "A": 3,
    "B": 4,
    "C": "dolorem praesentium rerum vel ipsum dolorum veritatis.",
    "D": false,
    "E": null,
    "F": "BAR",
    "G": "BAR",
    "H": "RANDOM_ID: zS-98-TO",
    "I": {
      "J": 3,
      "K": 4,
      "L": 1,
      "M": {
        "N": 3,
        "O": 9
      }
    },
    "P": [
      {
        "R": 3
      },
      {
        "T": 1
      }
    ],
    "U": "This is the value of A: 3\nThis is the value of A + 10: 13\n",
    "V": [],
    "X": "kyle_excepturi@gmail.com",
    "Y": "10"
  }
]

Fields

Fields are the core of the schema, they define the structure of the data that will be generated. They are defined in the fields section of the schema.

A simple example

Below is a simple example of a schema with a single field. This field contains most of the possible options that can be set for a field, excluding Constraints and some features that are only available for certain field types, e.g Nested fields and fields that allow Reference constraints.

Some details about this example:

  • This field is named ID and this is how it will be named in the output.
  • It is of type IntegerInRange meaning it will generate a random integer within a range.
  • It has two (optional) arguments, min and max. These are used to define the range of the integers that will be generated. In this case it is showing the default values of 0 and 9223372036854775807 (i64::MAX).
  • It has a null_probability of 0 meaning that it will never generate a null value.
fields:
  - name: ID
    type: IntegerInRange
    args:
      min: 0
      max: 9223372036854775807
    null_probability: 0

Running the above schema through fodder will generate the following output (in JSON format):

fodder -s schema.yaml
[
  {
    "ID": 4350876185243800642
  },
  {
    "ID": 3998117975203216203
  },
  {
    "ID": 2709943470313341799
  }
]

A more complex example (with constraints)

Below is a more complex example of a schema with multiple fields. This example shows how to use Constraints to ensure that the generated data is representative of the real world.

Some details about this example:

  • All fields will generate a random DateTime.
  • The CreatedAt field will generate a random DateTime between 3 days ago and today.
  • The ModifiedAt field will generate a random DateTime that is greater than CreatedAt and between 3 days ago and today.
  • The DeletedAt field will generate a random DateTime that is greater than ModifiedAt and between 3 days ago and today. It will also have a null_probability of 0.9 meaning that it will have a 90% chance of being null.
fields:
  - name: CreatedAt
    type: DateTime
    args:
      start: -3d
      end: today
      format: "%Y-%m-%d %H:%M:%S"
  - name: ModifiedAt
    type: DateTime
    args:
      start: -3d
      end: today
      format: "%Y-%m-%d %H:%M:%S"
    constraints:
      - type: GreaterThan
        name: CreatedAt
  - name: DeletedAt
    type: DateTime
    args:
      start: -3d
      end: today
      format: "%Y-%m-%d %H:%M:%S"
    null_probability: 0.9
    constraints:
      - type: GreaterThan
        name: ModifiedAt

Running the above schema through fodder will generate the following output (this time in CSV format):

fodder -f csv -s schema.yaml
CreatedAtModifiedAtDeletedAt
2023-01-30 12:49:542023-01-31 19:45:542023-02-01 22:50:54
2023-01-30 11:57:542023-01-31 22:59:54
2023-02-01 10:27:542023-02-01 23:22:54

IntegerInRange

The IntegerInRange field generates a random integer between a minimum and maximum value.

Schema

fields:
  - name: Zero
    type: IntegerInRange
    args:
      min: -20
      max: 20
    null_probability: 0
  - name: One
    type: IntegerInRange
    null_probability: 0
    args:
      min: 0
      max: 500
    constraints:
      - type: GreaterThan
        name: Zero

Output

ZeroOne
-2056
166
-13294

Arguments

NameTypeDescriptionDefault
minintThe minimum value0
maxintThe maximum valuei64::MAX

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
GreaterThanThe value must be greater than the value of another field.
IfNullThe value must only be non-null if another field is null.

String

The String field generates a random string of various types and lengths.

Schema

fields:
  - name: Words
    type: String
    args:
      subtype: Words
      range:
        start: 0
        end: 10
    null_probability: 0.5
  - name: Paragraph
    type: String
    args:
      subtype:
      range:
        start: 1
        end: 3
    constraints:
      - type: IfNull
        name: Name

Ouput

WordsParagraph
quo repellat qui voluptatem dolor.
perspiciatis sapiente aut voluptatibus molestias qui a placeat.
dicta animi distinctio est.
consequuntur fugit praesentium vero.
natus omnis reiciendis officia.
quia sequi esse qui.
est animi voluptas deleniti id.
sint quia cumque eum.
illo incidunt quo adipisci recusandae.
temporibus molestiae rerum culpa.
perspiciatis voluptas qui
sint illum itaque totam

Arguments

NameTypeDescriptionDefault
subtypestringThe type of string to generate. One of Words, Sentences, Paragraphs.Words
range.startRangeThe minimum length of 'things' to generate.1
range.endRangeThe maximum length of 'things' to generate.2

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Boolean

The Boolean field is used to generate a random boolean value.

Schema

fields:
  - name: BoolMaybeNull
    type: Boolean
    null_probability: 0.5
  - name: Bool
    type: Boolean
    constraints:
      - type: IfNull
        name: BoolMaybeNull

Output

BoolMaybeNullBool
true
true
true
true
false

Arguments

NameTypeDescriptionDefault

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

DateTime

The DateTime field generates a random date and time between a start and end.

Schema

fields:
  - name: Created
    type: DateTime
    args:
      start: "-10d"
      end: "yesterday"
      timezone: Australia/Perth
      format: "%+"
  - name: Deleted
    type: DateTime
    args:
      start: "-10d"
      end: "today"
      timezone: Australia/Sydney
      format: "%+"
    constraints:
      - type: GreaterThan
        name: Created
  - name: Refs
    type: DateTime
    args:
      # This is how you reference another field.
      start: "{{ refs['Choice'].raw }}"
      end: "2010-01-01T00:00:00Z"
      format: "%A, %B %e, %Y"
    # You must define the fields that you want to reference.
  - name: Choice
    type: WeightedCategory
    args:
      choices:
        - "2000-01-01T00:00:00Z"
        - "1900-01-01T00:00:00Z"

Output

CreatedDeletedRefsChoice
2023-02-01T13:29:35.622865+08:002023-02-02T08:37:35.622865+11:00Saturday, August 12, 19111900-01-01T00:00:00Z
2023-01-27T21:37:35.622865+08:002023-02-02T01:53:35.622865+11:00Saturday, July 26, 19861900-01-01T00:00:00Z
2023-01-26T03:38:35.622865+08:002023-01-30T09:13:35.622865+11:00Wednesday, September 26, 20072000-01-01T00:00:00Z

Arguments

NameTypeDescriptionDefault Value
startstringThe start date. This will accept a variety of inputs, details can be found here.2000-01-01T00:00:00Z
endstringThe end date. This will accept a variety of inputs, details can be found here.3000-01-01T00:00:00Z
timezonestringThe timezone to use.UTC
formatstringThe format to use. The complete list of specifiers can be found here."%+"

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
GreaterThanThe value must be greater than the value of another field.
IfNullThe value must only be non-null if another field is null.

Digit

The Digit field type generates a random digit.

Schema

fields:
  - name: Digit One
    type: Digit
    null_probability: 0.5
  - name: Digit Two
    type: Digit
    constraints:
      - type: IfNull
        name: Digit One

Output

Digit OneDigit Two
1
3
8
6
4

Arguments

NameTypeDescriptionDefault

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

WeightedCategory

The WeightedCategory field type allows you to specify a list of choices, and an optional weight for each choice.

You can specify categories in-line or externally in a CSV file, with the preference being that the CSV file is used.

Schema

# Load categories from CSV files.
categories:
  - name: LETTERS
    file: "data/LETTERS.csv"
  - name: LETTERS_WEIGHTED
    file: "data/LETTERS_WEIGHTED.csv"
fields:
  # Simple choices, with equal probability.
  - name: simple
    type: WeightedCategory
    null_probability: 0.5
    args:
      choices:
        - "FOO"
        - "bar"
  # Simple choices, with weighted probability.
  - name: simple_weighted
    type: WeightedCategory
    args:
      choices:
        - ["FOO", 1.0]
        - ["bar", 0.5]
  # Choices from a file, with equal probability.
  - name: file
    type: WeightedCategory
    args:
      from_category: LETTERS
  # Choices from a file, with weighted probability.
  - name: file_weighted
    type: WeightedCategory
    args:
      from_category: LETTERS_WEIGHTED
# data/LETTERS.csv
LETTER
H
E
L
L
O
# data/LETTERS_WEIGHTED.csv
LETTER,WEIGHT
H,4
E,1
L,1
L,1
O,1

Output

simplesimple_weightedfilefile_weighted
FOObarL
barFOOO
barHH
barFOOE
barHL

Arguments

NameTypeDescriptionDefault
choiceslistA list of choices to select from. They can be specified both with and without weights as per the schema above.[]
from_categorystringThe name of a category to use. This is a reference to a name defined in the categories section. See Categories for more information.""

Note: choices and from_category are mutually exclusive.

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Bothify

The Bothify field generates a string by replacing selected symbols with random characters.

Schema

fields:
  - name: Bothify One
    type: Bothify
    args:
      format: "^^ ## ??"
    null_probability: 0.5
  - name: Bothify Two
    type: Bothify
    args:
      format: "^^ ## ??"
    constraints:
      - type: IfNull
        name: Bothify One
  - name: Bothify Three
    type: Bothify
    args:
      format: |
        {%- set one = refs['Bothify One']['raw'] -%}
        {%- set two = refs['Bothify Two']['raw'] -%}
        {%- if one -%}
        Bothify One: {{ one }}
        {%- elif two -%}
        Bothify Two: {{ two }}
        {%- else -%}
        {%- endif -%}

Output

Bothify OneBothify TwoBothify Three
53 09 URBothify One: 53 09 UR
74 91 aRBothify One: 74 91 aR
35 79 jqBothify One: 35 79 jq
85 89 bvBothify Two: 85 89 bv
92 55 tYBothify One: 92 55 tY
24 60 xqBothify Two: 24 60 xq

Arguments

NameTypeDescriptionDefault
formatstringThe format of the string to generate.

Format

The format string is a string that contains symbols that will be replaced with random characters. The following symbols are supported:

SymbolDescription
^A random digit [1-9]
#A random digit [0-9]
?A random letter [a-Z]

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Nested

The Nested field allow you to generate data that is nested within other data. This is particularly useful when outputting data in JSON format.

Schema

fields:
  - name: A Top
    type: Nested
    constraints:
      - type: IfNull
        name: B Top.B Nested
  - name: B Top
    type: Nested
    fields:
      - name: B Nested
        type: Nested
        null_probability: 0.5
  - name: C List
    type: Nested
    args:
      subtype: List
    fields:
      - name: C Nested 1
        type: Digit
      - name: C Nested 2
        type: String

Output

[
  {
    "A Top": null,
    "B Top": {
      "B Nested": {}
    },
    "C List": [1, "reiciendis"]
  }
]

Arguments

NameTypeDescriptionDefault
subtypestringThe type of the nested field. One of Object, List.Object

Field arguments

NameTypeDescriptionDefault Value
fieldslistA list of child fields of any type.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

FormattedString

The FormattedString field allows you to generate a string that is formatted according to a template, see Templating. Templates are quite powerful and allow you to generate a wide variety of data. For example, you could generate a string that is a combination of a first name, last name and a random number, e.g. John Smith 1234.

Schema

fields:
  - name: A
    type: Digit
  - name: B
    type: FormattedString
    args:
      format: |
        Doing math (A + 4): {{ refs['A'].raw + 4}}!
        Accessing values: {{ refs['C'].raw }}!
        Formatting dates: {{ refs['D'].raw | date(format="%Y-%m-%d") }}
    constraints:
      - type: IfNull
        name: C
  - name: C
    type: Digit
    null_probability: 0.5
  - name: D
    type: DateTime

Output

ABCD
442667-11-04T03:14:00+00:00
9Doing math (A + 4): 13!
Accessing values: !
Formatting dates: 2603-05-24
2603-05-24T15:36:00+00:00
042312-12-06T02:40:00+00:00
0Doing math (A + 4): 4!
Accessing values: !
Formatting dates: 2910-06-24
2910-06-24T13:18:00+00:00
0Doing math (A + 4): 4!
Accessing values: !
Formatting dates: 2807-02-22
2807-02-22T18:16:00+00:00

Arguments

NameTypeDescriptionDefault Value
formatstringThe format of the string. See Templating for more info.

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Array

The Array field type allows you to generate an array of a particular field. The array can be of varying length, but being an array, all values will be of the same type.

Schema

fields:
  - name: A
    type: Array
    args:
      min: 0
      max: 10
      field:
        name: B
        type: IntegerInRange
        args:
          min: 0
          max: 10
  - name: field_a
    type: IntegerInRange
    null_probability: 0.5
  - name: array
    type: Array
    args:
      min: 1
      max: 4
      field:
        name: num
        type: Digit
        constraints:
          - type: IfNull
            name: field_a

Output

[
  { "A": [4, 9, 5, 2], "field_a": 5185946464695284972, "array": [null] },
  {
    "A": [5, 0, 2, 7, 0, 4, 3],
    "field_a": 7503849539415973306,
    "array": [null, null, null],
  },
  { "A": [1, 9, 0], "field_a": null, "array": [5, 6] },
]

Arguments

NameTypeDescriptionDefault
minintThe minimum length of the array.0
maxintThe maximum length of the array.10
fieldFieldThe field to generate the array of.

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Fakers

The Fakers field type generates fake data by passing directly through to the fake-rs crate.

This is mostly useful for generating your own categories if you don't want to hand craft the content.

Schema

fields:
  - name: Fakers IP
    type: Fakers
    args:
      subtype: IP
    null_probability: 0.5
  - name: Fakers Buzzword
    type: Fakers
    args:
      subtype: Buzzword
    constraints:
      - type: IfNull
        name: Fakers IP
  - name: Fakers CC Number
    type: Fakers
    args:
      subtype: CreditCardNumber

Output

Fakers IPFakers BuzzwordFakers CC Number
Distributed5156398210874490
188.123.216.220372329793829177
155.111.76.884374470709955

Arguments

NameTypeDescriptionDefault
subtypestringThe type of fake data to generate. See below for types (#subtypes).

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that this field will be null.0.0
constraintslistA list of constraints to apply to this field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Subtypes

This uses the fake-rs crate under the hood. The following values are supported.

Subtype
FirstName
LastName
Title
Suffix
Name
NameWithTitle
FreeEmailProvider
DomainSuffix
FreeEmail
SafeEmail
Username
IPv4
IPv6
IP
MACAddress
UserAgent
RfcStatusCode
ValidStatusCode
HexColor
RgbColor
RgbaColor
HslColor
HslaColor
Color
CompanySuffix
CompanyName
Buzzword
BuzzwordMiddle
BuzzwordTail
CatchPhase
BsVerb
BsAdj
BsNoun
Bs
Profession
Industry
CurrencyCode
CurrencyName
CurrencySymbol
CreditCardNumber
CityPrefix
CitySuffix
CityName
CountryName
CountryCode
StreetSuffix
StreetName
TimeZone
StateName
StateAbbr
SecondaryAddressType
SecondaryAddress
ZipCode
PostCode
BuildingNumber
Latitude
Longitude
Isbn
Isbn13
Isbn10
PhoneNumber
CellNumber
Time
Date
DateTime
FilePath
FileName
FileExtension
DirPath
Bic

Duration

The Duration field calculates a duration between a start and an end value, with the output being in the specified unit (component of the duration). As this field does not generate a random value it is most useful when used to reference other field/s.

Schema

fields:
  - name: Age
    type: Duration
    args:
      start: "{{ refs['Birthdate'].raw }}"
      end: now
      component: Years
    constraints:
      - type: IfNull
        name: Seconds
  - name: Birthdate
    type: DateTime
    args:
      format: "%Y-%m-%d"
      start: "1900-01-01T00:00:00Z"
      end: "2010-01-01T00:00:00Z"
  - name: Seconds
    type: Duration
    null_probability: .5
    args:
      start: now
      end: 120s

Output

AgeBirthdateSeconds
361986-09-11
212001-11-24
1995-12-06120
1906-07-29120
1907-05-14120

Arguments

NameTypeDescriptionDefault Value
startstringThe start of the duration. This will accept a variety of inputs, details can be found here.
endstringThe end of the duration. This will accept a variety of inputs, details can be found here.
componentstringThe component to use. This can be one of: Years, Months, Weeks, Days, Hours, Minutes, Seconds.Seconds

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Map

The Map field is used to select a value from a map of key/value pairs. The maps are defined in the maps section of the schema and referenced by name in the Map field. They always refer to externally defined data.

Schema

categories:
  - name: POSTCODE
    file: "data/POSTCODE.csv"
maps:
  - name: POSTCODE_SUBURB_MAP
    file: "data/POSTCODE_SUBURB.csv"
fields:
  - name: Suburb
    type: Map
    args:
      key: Postcode
      from_map: POSTCODE_SUBURB_MAP
    null_probability: 0.5
  - name: Postcode
    type: WeightedCategory
    args:
      from_category: POSTCODE
  - name: Suburb2
    type: Map
    args:
      key: Postcode2
      from_map: POSTCODE_SUBURB_MAP
      default: "N/A"
    constraints:
      - type: IfNull
        name: Suburb
  - name: Postcode2
    type: Bothify
    args:
      format: "####"
# data/POSTCODE.csv
PCODE
6157
6000
6100
6101
6530
PCODE,SUBURB
6157,Palmyra
6000,Perth
6100,Victoria Park
6101,East Victoria Park
6530,Geraldton

Output

SuburbPostcodeSuburb2Postcode2
East Victoria Park61014128
Perth60004642
6000N/A4847
6101N/A9223
Victoria Park61001988

Arguments

NameTypeDescriptionDefault
keystringThe name of the field to use as the key in the map.
from_mapstringThe name of the map to use. This is a reference to a name defined in the maps section. See Maps for more information.

Field arguments

NameTypeDescriptionDefault Value
null_probabilityfloatThe probability that the field will be null.0.0
constraintslistA list of constraints to apply to the field.[]

Supported constraints

NameDescription
IfNullThe value must only be non-null if another field is null.

Constraints

At some point, you will want to generate data that is representative of the real world. For example, you may want to ensure that the CreatedAt field is always less than the ModifiedAt field. This is where constraints come in.

There are a number of constraints that can be applied to fields, NOT ALL FIELD TYPES SUPPORT ALL CONSTRAINTS.

At present, the following constraints are available:

Usage

Constraints are applied directly to fields in the schema. The field that they are defined on is the field that the constraint will be applied to.

GreaterThan

The GreaterThan constraint ensures that the value of the field is greater than the value of another field.

The following will result in the B field being greater than the A field.

fields:
  - name: A
    type: Digit
  - name: B
    type: Digit
    constraints:
      - type: GreaterThan
        name: A

IfNull

The IfNull constraint ensures that the value of the field is only generated if the value of another field is null. This has the effect of making two fields mutually exclusive.

The following will result in the B field being generated if the A field is null which should result in each field being generated 50% of the time.

fields:
  - name: A
    type: Digit
    null_probability: 0.5
  - name: B
    type: Digit
    constraints:
      - type: IfNull
        name: A

External Data

There are multiple types of external data that can be used to assist in the generation of data. These include:

You can generate data using one schema and feed it in to another schema. This is useful for creating multiple tables of data that are related to each other in some way that makes them useful. For example if you reference the same list of IDs in multiple tables, you can generate a list of IDs and then use that list in multiple tables.

Categories

Categories are a way to define a list of things that can be referenced by name in some fields. They are defined in the categories section of the schema.

The contents of the file are expected to be in CSV format. The first row is expected to be a header row by default. The first column is expected to be the value to be used. The second column is expected to be the weight to be used. If the second column is not present, the weight is assumed to be 1.0.

Schema

The below schema will make the contents of data/LETTERS.csv available as a category called LETTERS.

categories:
  - name: LETTERS
    file: "data/LETTERS.csv"
    header: true

Arguments

NameTypeDescriptionDefault
namestringThe name of the category. This is the name by which it can be referenced in other parts of the schema.
filestringThe path to the file containing the category data.
headerboolWhether the first row of the file is a header row.true

Example field using a category

categories:
  - name: LETTERS
    file: "data/LETTERS.csv"
fields:
  - name: letter
    type: WeightedCategory
    args:
      from_category: LETTERS

Example category file

# data/LETTERS.csv
LETTER
H
E
L
L
O

Example category file with weights

# data/LETTERS_WEIGHTED.csv
LETTER,WEIGHT
H,4
E,1
L,1
L,1
O,1

Maps

Maps are a way to define a list of things that can be referenced by name in some fields. They are defined in the maps section of the schema.

The contents of the file are expected to be in CSV format. The first row is expected to be a header row by default. The first column is expected to be the key to be used. The second column is expected to be the value to be used.

This allows for interesting use cases such as mapping consistently between randomly selected categories and their corresponding attributes.

Note: There is a small amount of awkwardness here in that you will probably want to select a category from the first column of the map and then later map that category to the value in a separate column.

The limitation we currently have is that to do this you will need to copy the first column from the map file and create a new category file with the same contents.

Schema

The below schema will make the contents of data/POSTCODE_SUBURB.csv available as a map called POSTCODE_SUBURB_MAP.

maps:
  - name: POSTCODE_SUBURB_MAP
    file: "data/POSTCODE_SUBURB.csv"
    header: true

Arguments

NameTypeDescriptionDefault
namestringThe name of the map. This is the name by which it can be referenced in other parts of the schema.
filestringThe path to the file containing the map data.
headerboolWhether the first row of the file is a header row.true

Example field using a map

categories:
  - name: POSTCODE
    file: "data/POSTCODE.csv"
maps:
  - name: POSTCODE_SUBURB_MAP
    file: "data/POSTCODE_SUBURB.csv"
fields:
  - name: Suburb
    type: Map
    args:
      key: Postcode
      from_map: POSTCODE_SUBURB_MAP
  - name: Postcode
    type: WeightedCategory
    args:
      from_category: POSTCODE

Example map file

# data/POSTCODE_SUBURB.csv
POSTCODE,SUBURB
7000,Hobart
6000,Perth
5000,Adelaide
4000,Brisbane
3000,Melbourne
2000,Sydney
1000,Canberra

Examples

Note: This section attempts to document examples in the repository. It will in-line the code where relevant but the code is the source of truth so it may be out of date.

Basic schema examples

There is a good source of examples schemas in the field documentation. These are located in the fields section.

There are also more examples in the repository. These are located in the schemas/ directory.

'Real' usage examples

A simple example

This is a simple example of how to use fodder to generate some data. It contains a schema.yaml file and some hand crafted input data in data/.

Schema

maps:
  - name: BUYER_NAME
    file: data/BUYER_NAME.csv
  - name: SELLER_NAME
    file: data/SELLER_NAME.csv
categories:
  - name: BUYER
    file: data/BUYER.csv
  - name: SELLER
    file: data/SELLER.csv
fields:
  # SALESID	INTEGER	Primary key, a unique ID value for each row. Each row represents a sale of one or more tickets for a specific event, as offered in a specific listing.
  - name: SALESID
    type: Bothify
    args:
      format: "#########"
  # SELLERID	INTEGER	Foreign-key reference to the USERS table (the user who listed the tickets).
  - name: SELLERID
    type: WeightedCategory
    args:
      from_category: SELLER
  # SELLERNAME	VARCHAR(50)	The name of the user who listed the tickets.
  - name: SELLERNAME
    type: Map
    args:
      from_map: SELLER_NAME
      key: SELLERID
  # BUYERID	INTEGER	Foreign-key reference to the USERS table (the user who bought the tickets).
  - name: BUYERID
    type: WeightedCategory
    args:
      from_category: BUYER
  # BUYERNAME	VARCHAR(50)	The name of the user who bought the tickets.
  - name: BUYERNAME
    type: Map
    args:
      from_map: BUYER_NAME
      key: BUYERID
  # QTYSOLD	SMALLINT	The number of tickets that were sold, from 0 to 9. (A maximum of 8 tickets can be sold in a single transaction.)
  - name: QTYSOLD
    type: Digit
  # PRICEPAID	DECIMAL(8,2)	The total price paid for the tickets, such as 75.00 or 488.00. The individual price of a ticket is PRICEPAID/QTYSOLD.
  - name: PRICEPAID
    type: Bothify
    args:
      format: "^##.##"
  # SALETIME	TIMESTAMP	The full date and time when the sale was completed, such as 2008-05-24 06:21:47.
  - name: SALETIME
    type: DateTime
    args:
      start: 2023-01-01T00:00:00Z
      end: 2023-01-31T00:00:00Z
      format: "%Y-%m-%d %H:%M:%S"

Data

# data/BUYER.csv
ID
0000001
0000002
0000003
0000004
0000005
# data/BUYER_NAME.csv
ID,NAME
0000001,JOHN DOE
0000002,JOHN SMITH
0000003,ALICE DOE
0000004,SALLY SMITH
0000005,MARY JONES
# data/SELLER.csv
ID
0000001
0000002
0000003
0000004
0000005
# data/SELLER_NAME.csv
ID,NAME
0000001,COMPANY INC.
0000002,LOL INC.
0000003,123 INC.
0000004,ANOTHER INC.
0000005,ZZZ PTY LTD.

Output

# tables/SALES.csv
SALESID,SELLERID,SELLERNAME,BUYERID,BUYERNAME,QTYSOLD,PRICEPAID,SALETIME
541725862,0000002,JOHN SMITH,0000003,123 INC.,0,563.54,2023-01-09 21:50:00
751725001,0000004,SALLY SMITH,0000002,LOL INC.,3,889.20,2023-01-05 18:58:00
868507369,0000004,SALLY SMITH,0000004,ANOTHER INC.,7,764.86,2023-01-06 05:00:00
306917643,0000002,JOHN SMITH,0000005,ZZZ PTY LTD.,9,553.21,2023-01-21 17:47:00
731805330,0000005,MARY JONES,0000005,ZZZ PTY LTD.,6,183.24,2023-01-23 19:52:00

A more complex example

This is a more complex example of how to use fodder to generate some data. It contains multiple schema files in the schemas/ directory, it utilises the data/ directory as a temporary location for data that is generated by one schema and used by another schema, finally it outputs the data to the tables/ directory.

The above is all controlled by two short bash scripts, initialise and gen. These two scripts utilise the fodder CLI along with some standard Unix tools to generate the data.

initialise

This script is used to initialise the data by generating the categories that will remain static in future runs. In a way this can be thought of as generating our DIM tables in a data warehouse.

#!/usr/bin/env bash
#
# Generate our primary tables

set -eo pipefail

if [ -z "$1" ]; then
    ROWS=5;
else
    ROWS="$1";
fi

echo "GENERATING SELLER DATA"
fodder -s schemas/ID.fodder.yaml -n "$ROWS" -f csv > data/SELLER_ID.csv
fodder -s schemas/COMPANY.fodder.yaml -n "$ROWS" -f csv > data/SELLER_COMPANY.csv
paste -d "," data/SELLER_ID.csv data/SELLER_COMPANY.csv > tables/SELLER_ID_COMPANY.csv
cp data/SELLER_ID.csv tables/SELLER_ID.csv
rm data/SELLER_ID.csv data/SELLER_COMPANY.csv

echo "GENERATING BUYER DATA"
fodder -s schemas/ID.fodder.yaml -n "$ROWS" -f csv > data/BUYER_ID.csv
fodder -s schemas/COMPANY.fodder.yaml -n "$ROWS" -f csv > data/BUYER_COMPANY.csv
paste -d "," data/BUYER_ID.csv data/BUYER_COMPANY.csv > tables/BUYER_ID_COMPANY.csv
cp data/BUYER_ID.csv tables/BUYER_ID.csv
rm data/BUYER_ID.csv data/BUYER_COMPANY.csv

gen

The gen script is used to generate the main data. This is the data that will change each time the script is run. In a way this can be though of as generating our FACT tables in a data warehouse.

In the scenario that we are generating data for a data warehouse, we would run the initialise script once and then run the gen script each time we want to generate new data.

This could potentially be automated by running the gen script as a cron job, or similar (more complicated) mechanism.

#!/usr/bin/env bash
# 
# Generate latest sales data

set -eo pipefail

if [ -z "$1" ]; then
    ROWS=20;
else
    ROWS="$1";
fi

echo "GENERATING SALES"
fodder -s schemas/SALES.fodder.yaml -f csv -n "$ROWS" > tables/SALES.csv

Templating

Templating language

Where templating has been made available in fodder, it uses Tera as the templating language. Tera is a template engine for Rust that is based on Jinja2. It is a full-featured template engine with a lot of functionality. For more information on Tera, please see the Tera documentation.

References

When we make reference to a field in a template, we are referring to the name of the field. For example, if we have a field called Name and we want to reference it in a template, we would use {{ refs['Name'].raw }}. This will return the raw value of the field. If we wanted to use the formatted value, we would use {{ refs['Name'].formatted }}. Depending on your particular use-case, you may want to use one or the other.

Linting and Autocompletion

This is a way to validate your fodder schema files and to provide helpful hints to you as you write them, such as what fields are available and what arguments they take.

Setting up linting in your IDE

Output the JSON schema to the root of your project, this is what is used to provide linting and autocompletion.

fodder -d > lint.json

VSCode

Configure your IDE to use this JSON schema. For VSCode, ensure you have the correct extension installed (YAML) and add the following to your settings.json:

{
  "yaml.schemas": {
    "./lint.json": "*.fodder.yaml"
  }
}

Neovim

If you are using Neovim with nvim-lspconfig and lazy.nvim you can use the below snippet. If you aren't, you should still be able to use the JSON schema to configure your IDE - but you will have to go on that adventure by yourself!

return {
  {
    "neovim/nvim-lspconfig",
    opts = {
      servers = {
        yamlls = {
          settings = {
            yaml = {
              schemas = {
                ["./lint.json"] = "*.fodder.yaml",
              },
            },
          },
        },
      },
    },
  },
}

Provided you have set everything up as detailed above and named your fodder schema file with the .fodder.yaml suffix you should get linting and autocompletion!