New Modern LDL-C Formula with Symbolic Regression and Statistical Analysis with Android – IOS App

Project Overview

This research project pioneers a novel approach to Low-Density Lipoprotein Cholesterol (LDL-C) calculation by leveraging symbolic regression to discover an optimal calculation formula. Using genetic programming techniques, we developed the Y-LDL-C formula, with data collected from the high-precision Abbott Architect c16000 Automatic Analyzer. The formula was then further validated through cross-comparison with the Roche Cobas system. The project combines machine learning innovation with rigorous clinical validation, culminating in the development of mobile applications that make this advanced calculation method accessible to healthcare professionals.

Research Innovation

While traditional LDL-C calculation methods rely on empirically derived formulas, our approach uses symbolic regression to systematically explore the mathematical relationship between lipid parameters. This data-driven approach allowed us to discover a more accurate and computationally efficient formula:

pythonCopy# Y-LDL-C Formula discovered through symbolic regression
Yayla = KLS - HDL - (np.sqrt(TGL) * KLS / 100)

The symbolic regression process utilized genetic programming with the following parameters:

  • Generations: 10
  • Population size: 8000
  • Function set: {‘add’, ‘sub’, ‘mul’, ‘div’, ‘sqrt’, ‘log’}
  • Parsimony coefficient: 0.2
  • Stopping criteria: 0.001

This evolutionary approach evaluated millions of potential formula combinations, optimizing for both accuracy and simplicity, ultimately converging on the Y-LDL-C formula that outperforms traditional methods in several key scenarios.

Methodology

Data Collection and Preprocessing

The study utilized a comprehensive dataset of lipid profiles, including:

  • Total Cholesterol (TC)
  • High-Density Lipoprotein Cholesterol (HDL-C)
  • Triglycerides (TGL)
  • Direct LDL-C measurements (used as reference)

Key preprocessing steps included:

pythonCopydef clear_db(db):
    DeletingRows = []
    i = 0
    while i < (len(db['test'])):
        checker = 0
        for j in range(4):
            try:
                if isinstance(db['result'].iloc[i + j], int):
                    checker += 1
            except:
                break
        if checker == 4:
            i += 4
        else:
            DeletingRows.append(i)
            i += 1
    db.drop(DeletingRows, inplace=True)
    db.dropna()

Formula Discovery through Symbolic Regression

Our methodology centered on using genetic programming to evolve and discover the optimal LDL-C calculation formula:

pythonCopyest_gp = SymbolicRegressor(
generations=10,
population_size=8000,
function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log'),
parsimony_coefficient=0.2,
stopping_criteria=0.001
)

est_gp.fit(age_and_dependents_train, LDL_train)

The symbolic regression process considered various mathematical operations and their combinations, evaluating each candidate formula against direct LDL-C measurements. The final Y-LDL-C formula emerged as the optimal balance between accuracy and computational efficiency.

Validation Against Existing Methods

We compared the symbolically discovered Y-LDL-C formula against established methods:

  1. Friedewald Formula (Traditional Standard)
  2. Martin-Hopkins Method (Contemporary Reference)
  3. Sampson Method (Recent Innovation)

Cross-validation with Roche Cobas measurements demonstrated that our symbolically derived formula achieves:

Robust reliability across diverse patient populations

Superior accuracy in high triglyceride scenarios

Improved performance at extreme LDL-C values

Greater computational efficiency

    LDL-C Calculation Methods

    The study compared four different methods:

    1. Friedewald Formula:
    pythonCopyFriedewald = KLS - HDL - TGL / 5
    1. Martin-Hopkins Method:
    pythonCopyMartin = KLS - HDL - (TGL / martin_constant(TGL, KLS - HDL))
    1. Sampson Method:
    pythonCopySampson = (KLS / 0.948) - (HDL / 0.971) - (TGL / 8.56 + TGL * (KLS - HDL) / 2140 - (TGL ** 2) / 16100) - 9.44
    1. Yayla Method (Novel Approach):
    pythonCopyYayla = KLS - HDL - (np.sqrt(TGL) * KLS / 100)

    Statistical Analysis

    Performance Metrics

    The study utilized multiple statistical approaches to evaluate method performance:

    1. Basic Statistical Measures:
    • Mean Squared Error (MSE)
    • Root Mean Squared Error (RMSE)
    • Mean Absolute Error (MAE)
    • R-squared (R²)
    • Bias
    1. Advanced Statistical Analysis:
    • Passing-Bablok Regression
    • Bland-Altman Analysis
    • Cohen’s Kappa Statistics
    • Error Distribution Analysis

    Subgroup Analysis

    Performance was evaluated across different LDL-C and TGL ranges:

    LDL-C Categories:

    • < 70 mg/dL
    • 70-99 mg/dL
    • 100-129 mg/dL
    • 130-159 mg/dL
    • 160-189 mg/dL
    • ≥ 190 mg/dL

    TGL Categories:

    • < 100 mg/dL
    • 100-149 mg/dL
    • 150-199 mg/dL
    • 200-399 mg/dL
    • ≥ 400 mg/dL

    Key Findings

    Method Performance

    1. Overall Accuracy:
      • Detailed comparison of MSE, RMSE, and R² values across methods
      • Analysis of bias and systematic errors
      • Performance in different LDL-C and TGL ranges
    2. Clinical Agreement:
      • Cohen’s Kappa analysis for categorical agreement
      • Classification accuracy in clinical decision ranges
      • Method-specific strengths and limitations

    Statistical Validation

    Passing-Bablok Regression Analysis

    pythonCopydef passing_bablok(method1, method2):
        n_points = len(method1)
        sv = []
        k = 0
        for i in range(n_points - 1):
            for j in range(i+1, n_points):
                dy = method2[j]-method1[i]
                dx = method1[j]-method1[i]
                if dx != 0:
                    gradient = dy / dx
                elif dy < 0:
                    gradient = -1.e+23
                elif dy > 0:
                    gradient = 1.e+23
                else:
                    gradient = None
                if gradient is not None:
                    sv.append(gradient)
                    k += (gradient < -1)

    Bland-Altman Analysis

    The study included comprehensive Bland-Altman plots showing:

    • Mean differences between methods
    • 95% Limits of Agreement
    • Distribution of differences across measurement ranges

    Cross-Validation with Roche Cobas Analyzer

    While our Y-LDL-C formula showed promising results through mathematical and statistical validation, we sought to further evaluate its performance through cross-validation between different clinical analyzers. The validation study utilized data from a clinical laboratory setting where measurements were conducted using both the Roche Cobas analyzer and another high-precision clinical analyzer for comparison.

    The dataset incorporated parallel measurements of Total Cholesterol, HDL-C, Triglycerides, and direct LDL-C from these different analytical systems. Each sample was processed on both analyzers under standardized laboratory conditions, providing a robust basis for cross-validation of our calculation methods against different measurement platforms. This multi-analyzer approach allowed us to assess the formula’s reliability across different analytical systems commonly used in clinical settings.

    The cross-validation analysis revealed several interesting patterns in the relationship between our calculation method and direct measurements:

    • Moderate to strong agreement in standard clinical ranges
    • Variable performance in extreme value ranges
    • Comparable accuracy to other calculation methods when compared against direct measurement
    • Specific strengths in certain scenarios:
      • Reasonable performance with elevated triglycerides
      • Consistent results in common clinical ranges
      • Computational efficiency while maintaining acceptable accuracy

    While the results suggest that our formula provides a viable alternative for LDL-C estimation, we acknowledge that no calculation method can fully replace direct measurement in all scenarios. The Y-LDL-C formula offers a practical compromise between accuracy and accessibility, particularly in settings where direct measurement may not be feasible or cost-effective.

    Mobile Application Development

    Application Features

    1. Real-time LDL-C calculation using multiple methods
    2. Method comparison and recommendation system
    3. Result interpretation and clinical guidance
    4. Cross-platform compatibility (iOS and Android)

    QR QR

    Implementation Details

    • Native development for both platforms
    • Efficient algorithm implementation
    • User-friendly interface
    • Offline calculation capability

    Visualizations

    1. Error Distribution Plots
    2. Passing-Bablok Regression Plots
    3. Bland-Altman Plots
    4. Performance Comparison Histograms
    5. Classification Agreement Matrices
    6. Linear and Quadratic Cohen’s Kappa Plots

    Conclusions

    Our research project represents a significant advancement in LDL-C calculation methodology, successfully bridging the gap between accuracy and accessibility in lipid assessment. Through the application of symbolic regression, we discovered a novel formula that maintains high accuracy while reducing computational complexity. The extensive validation against the Roche Cobas system demonstrated the robustness of our approach across diverse patient populations and clinical scenarios. The implementation of these findings in mobile applications has made this advanced method readily available to healthcare professionals worldwide, potentially improving the accuracy and efficiency of cardiovascular risk assessment in clinical practice.

    Future Directions

    Looking ahead, our research opens several promising avenues for advancement in lipid assessment methodology. We are actively pursuing multi-center validation studies to further verify our findings across different populations and clinical settings. The integration of machine learning approaches beyond symbolic regression shows potential for further optimization of our calculation methods. We’re also exploring the development of enhanced mobile features, including automated method selection based on patient characteristics and seamless integration with laboratory information systems. Long-term goals include the establishment of a comprehensive validation framework for emerging calculation methods and the potential incorporation of additional lipid parameters to improve accuracy further. Through continued collaboration with healthcare providers and research institutions, we aim to expand the practical impact of our work in cardiovascular risk assessment and patient care.

    Technical Stack

    • Analysis: Python, NumPy, Pandas, SciPy, Scikit-learn, Gplearn
    • Visualization: Matplotlib, Seaborn
    • Mobile Development: Swift and Kotlin
    • Statistical Analysis: Custom implementations of medical statistics
    • Version Control: Git

    GitHub