Model strategy
Our predictive modeling strategy for the Samurai Predictive Event Model involves multi-step data transformation to ensure high-quality inputs. We integrate both real-time and batch processing to capture immediate and aggregate user behavior patterns. The current state of the model architecture is based on Long Short-Term Memory (LSTM) networks, which are ideal for modeling sequential data. We're committed to experimenting with other model architectures. This document outlines the key model design strategies and sets the stage for detailed model architecture considerations.
Key components
Data Transformation Process
Outlier Handling and Data Normalization: We clean and normalize numerical data to ensure accurate model inputs. Event properties like user session index or number of pages viewed can be noisy and prone to outliers, necessitating normalization.
Categorical Variable Encoding: Categorical data, such as marketing sources, devices, or sales representative labels, which form the unified predictive event schema, is crucial for modeling predictive events. We transform nominal variables into a usable format for predictive modeling.
Text Data Tokenization: Events often include properties carrying unstructured texts, such as user agents, which are extremely important from the modelling perspective. We convert these into structured tokenized formats for further use.
Sequence Generation for LSTM Models: User events are arranged into high-quality sequences, reflecting behavior over time. This is essential for training LSTM networks.
These steps form the backbone of our data transformation strategy, ensuring the Samurai Predictive Event Model delivers reliable and actionable predictions. The following sections detail each step, highlighting the methods and techniques used to prepare the data for our predictive algorithms.
Detailed Data Preparation
Numerical Variables
Outlier Removal:
Rows with values below the 5th percentile or above the 95th percentile for any numerical column in the predictive event schema are removed.
Log-Linear Transformation and Min-Max Normalization:
- Values in each numerical column are transformed using the natural logarithm of the column values increased by 1 to handle potential zeros.
- Transformed values are then normalized using the Min-Max method, scaling all values in a column to a range of 0 to 1.
Nominal Variables
Ordinal Encoding:
- Applied to all categorical variables with numerous distinct but countable values.
- This method assigns unique natural numbers to distinct values and replaces the original value with this numerical representation while keeping the original values in an initialized list for potential reverse operations.
One-Hot Encoding:
- Used for categorical variables with few unique values, such as the
etl_tstamp
variable, which captures potential engagement signals like the time of day or day of the week. - Indicator variables are created, generating 23 new attributes for model training.
Text Variables
Tokenization:
- Applied to columns containing unstructured text data, including page titles, URL substrings, and marketing campaign descriptions.
- The Tokenizer class from Keras is used to convert data into vector representations. Each unique word is assigned a natural number, creating a token corpus for the entire dataset.
- A fixed number of tokens is set to balance computational complexity and representational accuracy. Data is padded to a uniform length.
Training Data Preparation
Sequence Generation:
- Defined lists of events and attributes, grouped by a unified user identifier. When the number of events exceeds a set number (e.g. four), sliding windows are created, resulting in sequences like (A, B, C, D) and (B, C, D, E).
- Events are sorted by the timestamp attribute to maintain the correct sequence order, crucial for training the LSTM model.
- Transformed attribute values are added to corresponding lists, with text columns combined into a single long vector.
Final Data Preparation:
- The tokenized sequences are converted to Numpy arrays, essential for LSTM models due to the vectorized computation they facilitate.
- Data is transformed from two-dimensional to three-dimensional, aligning with the input format expected by LSTM networks.
- Labels representing realized events are also vectorized and formatted accordingly.