Data Format Requirements
Overview
FinRL expects data in a specific pandas DataFrame format. Understanding these requirements is crucial for successful implementation of trading strategies.
Required DataFrame Structure
Core Columns
Column | Type | Description | Example |
---|---|---|---|
date |
str |
Trading date in YYYY-MM-DD format | "2023-01-15" |
tic |
str |
Asset ticker/symbol | "AAPL" , "BTC-USD" |
close |
float |
Closing price | 150.25 |
Optional Columns
Column | Type | Description | Usage |
---|---|---|---|
open |
float |
Opening price | OHLCV data, additional features |
high |
float |
Highest price | OHLCV data, volatility calculation |
low |
float |
Lowest price | OHLCV data, support/resistance |
volume |
float |
Trading volume | Liquidity analysis, volume indicators |
vix |
float |
Volatility index | Market sentiment, risk assessment |
turbulence |
float |
Market turbulence metric | Risk management |
Data Structure Examples
Single Asset Format
import pandas as pd
# Single asset data (e.g., Bitcoin)
df_single = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'tic': ['BTC-USD', 'BTC-USD', 'BTC-USD'],
'open': [16500.0, 16800.0, 17100.0],
'high': [16900.0, 17200.0, 17500.0],
'low': [16400.0, 16700.0, 17000.0],
'close': [16800.0, 17100.0, 17400.0],
'volume': [25000000, 28000000, 32000000]
})
print(df_single)
date tic open high low close volume
0 2023-01-01 BTC-USD 16500.0 16900.0 16400.0 16800.0 25000000
1 2023-01-02 BTC-USD 16800.0 17200.0 16700.0 17100.0 28000000
2 2023-01-03 BTC-USD 17100.0 17500.0 17000.0 17400.0 32000000
Multi-Asset Format
# Multi-asset data (stocks + crypto)
df_multi = pd.DataFrame({
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'tic': ['AAPL', 'MSFT', 'AAPL', 'MSFT'],
'close': [130.50, 240.80, 132.20, 243.10],
'volume': [89000000, 45000000, 92000000, 47000000]
})
print(df_multi)
date tic close volume
0 2023-01-01 AAPL 130.50 89000000
1 2023-01-01 MSFT 240.80 45000000
2 2023-01-02 AAPL 132.20 92000000
3 2023-01-02 MSFT 243.10 47000000
Technical Indicators
Built-in Indicators
FinRL includes these technical indicators by default:
from finrl.config import INDICATORS
print("Default indicators:", INDICATORS)
# Output: ['macd', 'boll_ub', 'boll_lb', 'rsi_30', 'cci_30', 'dx_30', 'close_30_sma', 'close_60_sma']
Indicator Descriptions
Indicator | Full Name | Description |
---|---|---|
macd |
Moving Average Convergence Divergence | Trend-following momentum indicator |
boll_ub |
Bollinger Bands Upper | Upper boundary of price volatility |
boll_lb |
Bollinger Bands Lower | Lower boundary of price volatility |
rsi_30 |
Relative Strength Index (30-day) | Overbought/oversold oscillator |
cci_30 |
Commodity Channel Index (30-day) | Momentum oscillator |
dx_30 |
Directional Movement Index (30-day) | Trend strength indicator |
close_30_sma |
Simple Moving Average (30-day) | Price trend smoothing |
close_60_sma |
Simple Moving Average (60-day) | Long-term price trend |
Adding Technical Indicators
from finrl.meta.preprocessor.preprocessors import FeatureEngineer
# Define custom indicators
CUSTOM_INDICATORS = [
"macd", "rsi_14", "rsi_30",
"cci_14", "cci_30",
"dx_14", "dx_30",
"boll_ub", "boll_lb",
"close_14_sma", "close_30_sma", "close_60_sma",
"volume_14_sma", "volume_30_sma"
]
# Initialize feature engineer
fe = FeatureEngineer(
use_technical_indicator=True,
tech_indicator_list=CUSTOM_INDICATORS,
use_turbulence=True,
user_defined_feature=False
)
# Process data with technical indicators
processed_df = fe.preprocess_data(raw_df)
Data Validation and Preprocessing
Required Data Checks
def validate_data(df):
"""Validate DataFrame format for FinRL"""
# Check required columns
required_cols = ['date', 'tic', 'close']
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
# Check data types
if df['date'].dtype != 'object':
raise ValueError("Date column must be string type")
if not pd.api.types.is_numeric_dtype(df['close']):
raise ValueError("Close column must be numeric")
# Check for missing values in critical columns
if df[required_cols].isnull().any().any():
raise ValueError("Found missing values in required columns")
# Check date format
try:
pd.to_datetime(df['date'])
except:
raise ValueError("Date column must be in YYYY-MM-DD format")
print("✅ Data validation passed!")
return True
# Validate your data
validate_data(df)
Data Cleaning Pipeline
def clean_financial_data(df):
"""Clean and prepare financial data for FinRL"""
# 1. Sort data properly
df = df.sort_values(['date', 'tic'], ignore_index=True)
# 2. Remove duplicates
df = df.drop_duplicates(['date', 'tic'])
# 3. Handle missing values
# Forward fill then backward fill
df = df.groupby('tic').fillna(method='ffill').fillna(method='bfill')
# 4. Remove assets with insufficient data
asset_counts = df.groupby('tic')['date'].count()
valid_assets = asset_counts[asset_counts >= 252].index # At least 1 year of data
df = df[df['tic'].isin(valid_assets)]
# 5. Remove outliers (optional)
def remove_price_outliers(group, column='close', z_threshold=3):
z_scores = np.abs((group[column] - group[column].mean()) / group[column].std())
return group[z_scores <= z_threshold]
df = df.groupby('tic').apply(remove_price_outliers).reset_index(drop=True)
# 6. Ensure consistent date range across all assets
date_counts = df.groupby('date')['tic'].count()
complete_dates = date_counts[date_counts == date_counts.max()].index
df = df[df['date'].isin(complete_dates)]
return df
# Clean your data
clean_df = clean_financial_data(raw_df)
DateTime Handling
Date Format Requirements
FinRL expects dates in YYYY-MM-DD string format:
# ✅ Correct format
correct_dates = ["2023-01-15", "2023-02-28", "2023-12-31"]
# ❌ Incorrect formats
incorrect_dates = [
"01/15/2023", # MM/DD/YYYY
"15-01-2023", # DD-MM-YYYY
"2023.01.15", # Dots as separators
datetime.date(2023, 1, 15) # datetime object
]
# Convert to correct format
from datetime import datetime
def standardize_dates(date_series):
"""Convert various date formats to YYYY-MM-DD strings"""
return pd.to_datetime(date_series).dt.strftime('%Y-%m-%d')
# Usage
df['date'] = standardize_dates(df['date'])
Timezone Handling
For international markets, consider timezone differences:
from finrl.config import TIME_ZONE_USEASTERN, TIME_ZONE_SHANGHAI
# Set appropriate timezone based on market
TIME_ZONES = {
'US': TIME_ZONE_USEASTERN, # "US/Eastern"
'CN': TIME_ZONE_SHANGHAI, # "Asia/Shanghai"
'EU': 'Europe/London'
}
def localize_dates(df, market='US'):
"""Localize dates to market timezone"""
df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.tz_localize(TIME_ZONES[market])
df['date'] = df['date'].dt.strftime('%Y-%m-%d')
return df
Data Frequency Considerations
Daily Data (Recommended)
- Most stable and reliable
- Best for long-term strategies
- Lower computational requirements
# Daily data example
daily_data = {
'date': pd.date_range('2023-01-01', '2023-12-31', freq='D'),
'close': np.random.normal(100, 10, 365)
}
Intraday Data (Advanced)
- Higher frequency (hourly, minute)
- More noise, requires careful preprocessing
- Higher computational requirements
# Hourly data example
hourly_data = {
'date': pd.date_range('2023-01-01', '2023-01-31', freq='H'),
'close': np.random.normal(100, 5, 744) # 24*31 hours
}
# Convert to daily for FinRL
def resample_to_daily(df, price_col='close'):
"""Resample intraday data to daily"""
df['date'] = pd.to_datetime(df['date'])
daily_df = df.groupby(['tic', df['date'].dt.date]).agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last',
'volume': 'sum'
}).reset_index()
daily_df['date'] = daily_df['date'].astype(str)
return daily_df
State Space Calculation
The environment's state space dimensions depend on your data structure:
def calculate_state_space(stock_dim, tech_indicators):
"""Calculate state space dimensions"""
# Components of state vector:
cash_dim = 1 # Cash balance
price_dim = stock_dim # Current prices
holdings_dim = stock_dim # Current holdings
tech_dim = len(tech_indicators) * stock_dim # Technical indicators
total_dim = cash_dim + price_dim + holdings_dim + tech_dim
print(f"State space breakdown:")
print(f" Cash: {cash_dim}")
print(f" Prices: {price_dim}")
print(f" Holdings: {holdings_dim}")
print(f" Technical indicators: {tech_dim}")
print(f" Total: {total_dim}")
return total_dim
# Example calculation
stock_dim = 5 # 5 assets
tech_indicators = ["macd", "rsi_30", "boll_ub", "boll_lb", "close_30_sma"]
state_space = calculate_state_space(stock_dim, tech_indicators)
# Output: Total state space = 1 + 5 + 5 + (5 * 5) = 36
Common Data Issues and Solutions
Issue 1: Inconsistent Date Ranges
Problem: Different assets have different date ranges
Solution: Align date ranges
def align_date_ranges(df):
"""Ensure all assets have the same date range"""
# Find common date range
date_counts = df.groupby('date')['tic'].count()
max_assets = date_counts.max()
complete_dates = date_counts[date_counts == max_assets].index
return df[df['date'].isin(complete_dates)]
Issue 2: Missing Price Data
Problem: Some dates have missing price data
Solution: Fill missing values appropriately
def handle_missing_prices(df):
"""Handle missing price data"""
# Group by ticker and forward fill
df = df.groupby('tic').fillna(method='ffill')
# If still missing, backward fill
df = df.groupby('tic').fillna(method='bfill')
# Remove assets with too much missing data
missing_pct = df.groupby('tic')['close'].isnull().mean()
valid_assets = missing_pct[missing_pct < 0.05].index # < 5% missing
return df[df['tic'].isin(valid_assets)]
Issue 3: Price Scale Differences
Problem: Assets with very different price scales
Solution: Use percentage returns or normalization
def normalize_prices(df):
"""Normalize prices to handle scale differences"""
df = df.copy()
# Calculate percentage returns
df['pct_return'] = df.groupby('tic')['close'].pct_change()
# Or normalize to first price
df['normalized_close'] = df.groupby('tic')['close'].transform(
lambda x: x / x.iloc[0]
)
return df
Best Practices
Data Quality First
Always validate and clean your data before training. Poor data quality leads to poor model performance.
Avoid Lookahead Bias
Ensure technical indicators only use historical data. Never use future information.
Memory Considerations
For large datasets (>1M rows), consider data sampling or chunking strategies.
Consistent Formatting
Use consistent date formats, column names, and data types across all datasets.