Improving Predictive Accuracy

Practice Area

  • Data Science

Business Impact

  • 20-25% improvement in predictive accuracy


  • Limited data source utilization
  • Inadequate predictive model


  • Python


Allergic rhinitis (AR) affects almost one in every ten Americans. Many of them rely on mobile applications to make informed decisions about potential exposure to outdoor allergens. Such applications deliver value to the customer by aiding their assessment of exposure risk and thus reducing their overall symptom burden, while also serving as an avenue for manufacturers of anti-allergic preparations to learn more about their target customer demographic, in particular their experience with AR.

Our client, a Fortune 50 healthcare conglomerate, offers an application to provide predictions about symptom severity, and wanted to improve its predictive model.

Improving Predictive Accuracy


The application relied on limited data and an unmaintained model based on a simple regression algorithm to make predictions. As a result, there was considerable room for improvement in its predictive accuracy.

There were three key challenges involved in enabling the application to deliver more value for users. First, the metric for measuring predictive accuracy was inadequate, as it failed to reflect relative class imbalances — where the classes are not approximately evenly distributed, such as in this case — and the ordering of symptom severity. Second, improving predictive accuracy would require adding and preparing new input variables. And third, there was no documentation or in-house expertise available for the legacy model at the client, which meant that Starschema would have to first reverse-engineer the model, then improve it by using more effective, cutting-edge predictive algorithms.

Improving Predictive Accuracy


The Starschema team identified the most appropriate metric to measure model performance and replaced the legacy metric with it. The new, custom-adjusted metric reflects both relative class imbalances and the ordering of symptom severity, and it also served as a baseline for evaluating the effectiveness of the developments that followed.

Updating the data sources entailed two main tasks. The first involved making better use of existing sources. The team found that, by joining together strongly correlated symptoms, they could reduce statistical noise and decrease the model’s complexity to improve its overall robustness. They also expanded the range of inputs – which had previously comprised only key symptoms and pollen counts – with weather and patient treatment information to give the application a more comprehensive foundation for predictions.

In addition, identifying typical co-occurring symptoms made it possible to identify whether the symptoms that the user is experiencing are typically allergic, atypical or mixed regime. This way, the application can provide higher-quality feedback while requiring less manual input from the user.

The most important step involved feature engineering, which allows the model to derive secondary variables from a feature variable. For the purposes of the client’s application, this meant an increase in predictive accuracy, as it enabled the model to consider trends in addition to point data as temporal information. Dimensionality reduction and feature selection helped further improve predictive accuracy by simplifying the model.

The Starschema team then rebuilt, from scratch, the underlying machine learning model based on a Gradient Boosting Regressor algorithm and changed the programming language form Java to Python.

Improving Predictive Accuracy


Starschema delivered the solution in two months. The new data sources and ML model resulted in a consistent 20-25% uplift in the application’s predictive accuracy. The application now makes significantly more accurate predictions about symptom severity for the next three days based on pollen and weather data, as well as symptom and treatment data from the user.

The project also paved the way for future developments that will increase the application’s value. Users will benefit from further improvement in predictive accuracy thanks to the addition of air quality data, while the introduction of sales data will enable the indexing of the start of allergy season to help the client find out how it impacts the sales of allergy symptom relief products.

Data Science Project Planning Worksheet

This downloadable worksheet contains five categories of requirements and conditions that a client-side stakeholder in a project involving data science services needs to define to ensure timely delivery, optimized cost and valuable outcomes.

Five AI Trends in Healthcare to Watch in 2022

AI is transforming healthcare, unlocking unprecedented opportunities for enabling easier discovery of deeper insights that drive innovation – but the available technologies can very greatly in their maturity and domain-specific applicability. This white paper introduces five proven, future-resilient solutions to challenges that healthcare providers face today.

Understanding Topic Modeling and Planning Its Implementation

Topic modeling enables the analysis of text-based data to leverage insights that are difficult to extract and understand to help you optimize costs, improve operations and drive innovation. Read this white paper to understand the fundamentals of topic modeling and learn how to get started implementing it.

Innovative Medical R&D Insights Using Machine Learning with Gedeon Richter

Gedeon Richter, a multinational pharmaceutical and biotechnology company, leveraged Starschema's data science expertise to jointly develop an ML-based methodology to quantify the properties of the mitochondrial network within neurons to enable more effective analysis of medications for various neurological diseases.