New Modern LDL-C Formula with Symbolic Regression and Statistical Analysis with Android – IOS App
Project Overview
This research project pioneers a novel approach to Low-Density Lipoprotein Cholesterol (LDL-C) calculation by leveraging symbolic regression to discover an optimal calculation formula. Using genetic programming techniques, we developed the Y-LDL-C formula, with data collected from the high-precision Abbott Architect c16000 Automatic Analyzer. The formula was then further validated through cross-comparison with the Roche Cobas system. The project combines machine learning innovation with rigorous clinical validation, culminating in the development of mobile applications that make this advanced calculation method accessible to healthcare professionals.
Research Innovation
While traditional LDL-C calculation methods rely on empirically derived formulas, our approach uses symbolic regression to systematically explore the mathematical relationship between lipid parameters. This data-driven approach allowed us to discover a more accurate and computationally efficient formula:
pythonCopy# Y-LDL-C Formula discovered through symbolic regression
Yayla = KLS - HDL - (np.sqrt(TGL) * KLS / 100)
The symbolic regression process utilized genetic programming with the following parameters:
- Generations: 10
- Population size: 8000
- Function set: {‘add’, ‘sub’, ‘mul’, ‘div’, ‘sqrt’, ‘log’}
- Parsimony coefficient: 0.2
- Stopping criteria: 0.001
This evolutionary approach evaluated millions of potential formula combinations, optimizing for both accuracy and simplicity, ultimately converging on the Y-LDL-C formula that outperforms traditional methods in several key scenarios.
Methodology
Data Collection and Preprocessing
The study utilized a comprehensive dataset of lipid profiles, including:
- Total Cholesterol (TC)
- High-Density Lipoprotein Cholesterol (HDL-C)
- Triglycerides (TGL)
- Direct LDL-C measurements (used as reference)
Key preprocessing steps included:
pythonCopydef clear_db(db):
DeletingRows = []
i = 0
while i < (len(db['test'])):
checker = 0
for j in range(4):
try:
if isinstance(db['result'].iloc[i + j], int):
checker += 1
except:
break
if checker == 4:
i += 4
else:
DeletingRows.append(i)
i += 1
db.drop(DeletingRows, inplace=True)
db.dropna()
Formula Discovery through Symbolic Regression
Our methodology centered on using genetic programming to evolve and discover the optimal LDL-C calculation formula:
pythonCopyest_gp = SymbolicRegressor(
generations=10,
population_size=8000,
function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log'),
parsimony_coefficient=0.2,
stopping_criteria=0.001
)
est_gp.fit(age_and_dependents_train, LDL_train)
The symbolic regression process considered various mathematical operations and their combinations, evaluating each candidate formula against direct LDL-C measurements. The final Y-LDL-C formula emerged as the optimal balance between accuracy and computational efficiency.
Validation Against Existing Methods
We compared the symbolically discovered Y-LDL-C formula against established methods:
- Friedewald Formula (Traditional Standard)
- Martin-Hopkins Method (Contemporary Reference)
- Sampson Method (Recent Innovation)
Cross-validation with Roche Cobas measurements demonstrated that our symbolically derived formula achieves:
Robust reliability across diverse patient populations
Superior accuracy in high triglyceride scenarios
Improved performance at extreme LDL-C values
Greater computational efficiency
LDL-C Calculation Methods
The study compared four different methods:
- Friedewald Formula:
pythonCopyFriedewald = KLS - HDL - TGL / 5
- Martin-Hopkins Method:
pythonCopyMartin = KLS - HDL - (TGL / martin_constant(TGL, KLS - HDL))
- Sampson Method:
pythonCopySampson = (KLS / 0.948) - (HDL / 0.971) - (TGL / 8.56 + TGL * (KLS - HDL) / 2140 - (TGL ** 2) / 16100) - 9.44
- Yayla Method (Novel Approach):
pythonCopyYayla = KLS - HDL - (np.sqrt(TGL) * KLS / 100)
Statistical Analysis
Performance Metrics
The study utilized multiple statistical approaches to evaluate method performance:
- Basic Statistical Measures:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R-squared (R²)
- Bias
- Advanced Statistical Analysis:
- Passing-Bablok Regression
- Bland-Altman Analysis
- Cohen’s Kappa Statistics
- Error Distribution Analysis
Subgroup Analysis
Performance was evaluated across different LDL-C and TGL ranges:
LDL-C Categories:
- < 70 mg/dL
- 70-99 mg/dL
- 100-129 mg/dL
- 130-159 mg/dL
- 160-189 mg/dL
- ≥ 190 mg/dL
TGL Categories:
- < 100 mg/dL
- 100-149 mg/dL
- 150-199 mg/dL
- 200-399 mg/dL
- ≥ 400 mg/dL
Key Findings
Method Performance
-
Overall Accuracy:
- Detailed comparison of MSE, RMSE, and R² values across methods
- Analysis of bias and systematic errors
- Performance in different LDL-C and TGL ranges
-
Clinical Agreement:
- Cohen’s Kappa analysis for categorical agreement
- Classification accuracy in clinical decision ranges
- Method-specific strengths and limitations
Statistical Validation
Passing-Bablok Regression Analysis
pythonCopydef passing_bablok(method1, method2):
n_points = len(method1)
sv = []
k = 0
for i in range(n_points - 1):
for j in range(i+1, n_points):
dy = method2[j]-method1[i]
dx = method1[j]-method1[i]
if dx != 0:
gradient = dy / dx
elif dy < 0:
gradient = -1.e+23
elif dy > 0:
gradient = 1.e+23
else:
gradient = None
if gradient is not None:
sv.append(gradient)
k += (gradient < -1)

Bland-Altman Analysis
The study included comprehensive Bland-Altman plots showing:
- Mean differences between methods
- 95% Limits of Agreement
- Distribution of differences across measurement ranges

Cross-Validation with Roche Cobas Analyzer
While our Y-LDL-C formula showed promising results through mathematical and statistical validation, we sought to further evaluate its performance through cross-validation between different clinical analyzers. The validation study utilized data from a clinical laboratory setting where measurements were conducted using both the Roche Cobas analyzer and another high-precision clinical analyzer for comparison.
The dataset incorporated parallel measurements of Total Cholesterol, HDL-C, Triglycerides, and direct LDL-C from these different analytical systems. Each sample was processed on both analyzers under standardized laboratory conditions, providing a robust basis for cross-validation of our calculation methods against different measurement platforms. This multi-analyzer approach allowed us to assess the formula’s reliability across different analytical systems commonly used in clinical settings.

The cross-validation analysis revealed several interesting patterns in the relationship between our calculation method and direct measurements:
- Moderate to strong agreement in standard clinical ranges
- Variable performance in extreme value ranges
- Comparable accuracy to other calculation methods when compared against direct measurement
- Specific strengths in certain scenarios:
- Reasonable performance with elevated triglycerides
- Consistent results in common clinical ranges
- Computational efficiency while maintaining acceptable accuracy
While the results suggest that our formula provides a viable alternative for LDL-C estimation, we acknowledge that no calculation method can fully replace direct measurement in all scenarios. The Y-LDL-C formula offers a practical compromise between accuracy and accessibility, particularly in settings where direct measurement may not be feasible or cost-effective.
Mobile Application Development
Application Features
- Real-time LDL-C calculation using multiple methods
- Method comparison and recommendation system
- Result interpretation and clinical guidance
- Cross-platform compatibility (iOS and Android)
Implementation Details
- Native development for both platforms
- Efficient algorithm implementation
- User-friendly interface
- Offline calculation capability
Visualizations


- Error Distribution Plots
- Passing-Bablok Regression Plots
- Bland-Altman Plots
- Performance Comparison Histograms
- Classification Agreement Matrices
- Linear and Quadratic Cohen’s Kappa Plots
Conclusions
Our research project represents a significant advancement in LDL-C calculation methodology, successfully bridging the gap between accuracy and accessibility in lipid assessment. Through the application of symbolic regression, we discovered a novel formula that maintains high accuracy while reducing computational complexity. The extensive validation against the Roche Cobas system demonstrated the robustness of our approach across diverse patient populations and clinical scenarios. The implementation of these findings in mobile applications has made this advanced method readily available to healthcare professionals worldwide, potentially improving the accuracy and efficiency of cardiovascular risk assessment in clinical practice.
Future Directions
Looking ahead, our research opens several promising avenues for advancement in lipid assessment methodology. We are actively pursuing multi-center validation studies to further verify our findings across different populations and clinical settings. The integration of machine learning approaches beyond symbolic regression shows potential for further optimization of our calculation methods. We’re also exploring the development of enhanced mobile features, including automated method selection based on patient characteristics and seamless integration with laboratory information systems. Long-term goals include the establishment of a comprehensive validation framework for emerging calculation methods and the potential incorporation of additional lipid parameters to improve accuracy further. Through continued collaboration with healthcare providers and research institutions, we aim to expand the practical impact of our work in cardiovascular risk assessment and patient care.
Technical Stack
- Analysis: Python, NumPy, Pandas, SciPy, Scikit-learn, Gplearn
- Visualization: Matplotlib, Seaborn
- Mobile Development: Swift and Kotlin
- Statistical Analysis: Custom implementations of medical statistics
- Version Control: Git