Features yml file

Features are a reference to columns in a SQL table. Rasgo stores metadata about features to help users discover and consume them.

Feature metadata can be defined using a yml file or a dict.

Structure of a yml files

Features that reside in the same table belong to a DataSource. A yml file describes a DataSource (table) and the Features and Dimensions (columns) in it. Each yml file should describe a single DataSource.

Attributes

Attribute Name
Description
Value constraints

name

Name of the DataSource.

(Optional) will default to the sourceTable name if not supplied

sourceTable

The Snowflake table these features are stored in

Mandatory

sourceType

The type of data used to import this DataSource

Mandatory: restricted value in list [table, dataframe, csv]

sourceCode

The sql or python code used to create this feature (assuming there is value in storing this)

(Optional) free-form text field

tags

Free-form text tags to apply to all features

(Optional) List of strings

attributes

Free-form k:v dicts to apply to all features

(Optional) List of dicts

dimensions:

--

--

columnName

SQL column name of the dimension

Mandatory: Standard SQL column rules: no spaces or special characters.

Best practice to CAPITALIZE all letters

dataType

SQL datatype of the column

Mandatory: Standard SQL datatypes allowed:

string, int, float, date, bool

granularity

String describing the grain of this column. This will determine what other features can be joined with these features.

Mandatory: Allowed datetime values:

hour, day, week, month, quarter, year

features:

--

--

columnName

SQL column name of the feature

Mandatory: Standard SQL column rules: no spaces or special characters.

Best practice to CAPITALIZE all letters

dataType

SQL data type of the feature

Mandatory: Standard SQL datatypes allowed:

string, int, float, date, bool

displayName

The name that will display in the Rasgo UI

(Optional) Any string value. Spaces and special characters allowed.

Best practice to avoid double quotes (“) and semicolons (;)

description

A short description of the feature that will display in the Rasgo UI

(Optional) Any string value. Spaces and special characters allowed.

Best practice to avoid double quotes (“) and semicolons (;)

status

Status of the feature: sandbox or production

(Optional) restricted value in list: [Productionized, Sandboxed]

tags

Free-form text tags to apply to this feature only

(Optional) List of strings

attributes

Free-form k:v dicts to apply to this feature only

(Optional) List of dicts

Sample file:

name: "Customer Transactions"
sourceType: table
sourceTable: CUSTOMER_TRANSACTIONS
tags:
- apply_to_all_features
features:
- columnName: TRANS_AMT
  displayName: "Transaction Amount"
  dataType: float
  description: "Total of transaction in USD"
  status: Productionized
  tags:
  - USD
- columnName: ITEM_CT
  displayName: "Item Count"
  dataType: integer
  description: "Number of items in cart"
  status: Productionized
- columnName: STORE_NAME
  displayName: "Store Name"
  dataType: string
  description: "Name of store"
  status: Productionized
- columnName: COUPONS_USED
  displayName: "Coupons Used"
  dataType: bool
  description: "Were any coupons used"
  status: Productionized
dimensions:
- columnName: TRANS_DATE
  dataType: date
  granularity: day
- columnName: CUSTOMER_ID
  dataType: int
  granularity: customer

"dimensions" are index fields that will be used to join features to other features

"granularity" can be any string that helps uniquely describe a dimension. Granularity is used to determine when dimensions across FeatureSets are of the same "grain" and can be joined to each other.

It is often helpful to think of granularity as a way to tag your features with taxonomy metadata. Consider:

Granularity for datetime fields may be logged as: year, quarter, month, day, second - to define the grain of a date or datetime column.

Granularity for geolocation data may be logged as: Country, State, CBG, FIPS, zipcode, latlong

Granularity for healthcare data may be logged as: patient, payer, provider, encounter

The "sourceTable" param can accept just a table name or a fully qualified table name (DB.SCHEMA.TABLE). If database and schema are not supplied, Rasgo will assume your account's default credentials. For most accounts this will be: Database = RASGO & Schema = PUBLIC

Last updated