72% to 89%: What Systematic Feature Engineering Actually Looks Like
At Omdena, I moved a classification model from 72% to 89% accuracy without changing the model architecture. Here is the methodology: ablation first, architecture second.

The number that matters here is 17 percentage points. At Omdena, the platform's primary classification model started at 72% accuracy. After three months of ML engineering work, it was at 89%. The 40% preprocessing latency reduction was a side effect of the same process.
Here is what actually happened.
What 72% Accuracy Meant in Context
The first thing to establish about any accuracy number is what it means for the product. 72% accuracy on a balanced 10-class problem is different from 72% on a binary classification task with severe class imbalance. In our case, the model was making health-related classification decisions on a dataset where errors had direct downstream consequences for the platform's output. 72% was functional. It was not production-ready.
The second thing: do not assume you know what is wrong before you measure. The team had hypotheses - the training data needed more examples in underrepresented classes, the model architecture was too simple, the feature set was missing signal. All of these might have been true. None of them were confirmed. The first step was building infrastructure to measure systematically.
Ablation First, Architecture Second
The instinct when a model underperforms is to find a better model. Swap in a more powerful architecture. Add a layer. Change the loss function. In my experience, this approach rarely helps as much as understanding what the current model is doing wrong - and it takes much longer, because each architecture change requires a full training run to evaluate.
I spent the first two weeks building an ablation framework. Every feature group in the dataset was tagged: demographic features, temporal features, behavioral features, derived aggregate features. The framework trained the model with different subsets of feature groups and reported accuracy on a fixed held-out evaluation set for each subset.
The results were illuminating. Some feature groups added noise - removing them improved accuracy by 1-2 points. Some features that seemed semantically important turned out to have near-zero feature importance in practice. Two derived aggregate features that had been added without documentation were actually correlated with the label in a way that leaked future information into the training set. Removing them improved generalization by 6 points despite reducing training accuracy by 4 points.
The second week of ablation runs saved weeks of architecture experiments that would have been chasing an artifact in the data, not a model limitation.
The Features That Moved the Needle
After the ablation, the features that genuinely mattered were clear. The 17-point accuracy improvement came primarily from three changes:
Better temporal feature construction. The original features used raw timestamps. Constructing relative time features - time since the last event of each type, event frequency over rolling windows of different sizes - gave the model significantly better signal about patterns in behavior over time. The raw timestamps told the model when something happened. The relative features told it how that timing compared to typical patterns for that user.
Fixing the leakage. The two aggregate features that leaked future information were removed entirely. This reduced training accuracy by 4 points and improved held-out accuracy by 6 points. Data leakage is the most common source of optimistically inflated training metrics in production ML pipelines. It is also the hardest to find without systematic ablation - because the leaked features look informative and the model learns to rely on them.
Normalized aggregates. Several aggregate features were computed on raw counts. Users who are highly active in the platform accumulate higher raw counts than users who are less active, for reasons unrelated to the target behavior. Normalizing these aggregates by a relevant baseline - total events, session length, active days - made them comparable across users with different activity levels and improved generalization substantially.
The model architecture did not change. The training procedure did not change. The hyperparameters did not change. Feature engineering moved the number 17 points.
Where the Latency Reduction Came From
The 40% preprocessing latency reduction was almost entirely the result of fixing redundant computation - not optimization in the algorithmic sense.
The original preprocessing pipeline applied transformations per sample inside the training loop. For every training batch, the same base transformations were applied to each sample from scratch. Moving transformations that are sample-independent - normalization statistics computed on the full dataset, categorical encodings, feature group flags - to a preprocessing step that runs once before training reduced per-epoch preprocessing time significantly.
The second improvement was vectorizing operations that were written as Python loops. This is not sophisticated optimization, but it is surprisingly common in ML pipelines that grew incrementally. When code is added feature by feature over months, the result is often a loop that processes each sample individually where a vectorized operation would be equivalent and orders of magnitude faster.
Both improvements required understanding what the pipeline was doing - which required the same systematic measurement discipline as the feature engineering work.
What Transfers to Other ML Problems
The methodology is not specific to this project:
Measure before you change anything. Build the evaluation pipeline first. A 3-point accuracy improvement means nothing if you do not have a stable, reproducible measurement that you can compare before and after. The evaluation set should be fixed throughout the project - changing it mid-project is how you end up with results that look good but do not generalize.
Ablation is cheaper than new architectures. A single ablation run takes the same time as one training run. Understanding which features matter costs far less than the experiments you would run chasing architecture changes that address the wrong problem.
Leakage is everywhere. Any feature derived from data that arrives after the label was set is a potential leak. Temporal datasets are especially prone to this: aggregate statistics computed on the full dataset rather than on the data available at prediction time introduce leakage that can account for many percentage points of apparently good performance.
The process, not the number. The 17-point improvement is the output of a process: measure, ablate, form a hypothesis about what is wrong, make one change, measure again. That process applies regardless of domain or model type. It is faster than the alternative.
