Rewired 2025: DataLine - Agentic AI for Clinical Data
Improving data reliability and workflow efficiency in healthcare research
At the Wake Forest School of Medicine Rewired 2025 Hackathon, our team, presented DataLine, a modular AI system that automates the most tedious aspects of clinical data preparation.
AI learning is only as good as the data it learns from. Yet, much of the preprocessing, cleaning, and validation of datasets in the clinical setting remains manual, fragmented, and prone to error. DataLine was designed to change that.
The Challenge
Clinical data analysis is inherently complex and fragmented. Across hospitals, research labs, and innovation centers, teams repeatedly face the same challenges when preparing datasets for machine learning and statistical modeling. Issues such as label inconsistencies, duplicate records, missing or incomplete data, potential PHI leakage, and bias in sampling or representation often recur across projects. These problems are frequently addressed in isolation, forcing researchers to spend valuable time re-solving the same issues without a shared or standardized way to capture, validate, and reuse solutions.
These inefficiencies slow progress, inflate research costs, and make it harder to build reliable, generalizable AI systems that can truly support clinical decision making.
The Solution: DataLine
DataLine integrates automated pipelines and agentic AI reasoning to deliver a unified workflow for data analysis, preprocessing, and visualization. Built with Python and LLM integration, it provides:
- Automated pipelines using state-of-the-art solutions.
- LLM-driven reasoning for detecting issues and proposing pipeline steps. LLM-driven reasoning for detecting issues and proposing pipeline steps only fed on non-sensitive metadata such as column names, types (e.g., integer, string), and inferred structures. no PHI or raw patient data is ever transmitted.
- Interactive dashboard via Gradio for guided exploration and human oversight.
Together, these modules provide a faster and more reliable way to move from raw datasets to actionable insights.
Features
(Developed over a three-week period during the ReWired 2025 Hackathon)
- Near-Duplicate Detection and Label Issue Identification via Cleanlab libraries.
- ECG and Statistical Visualizations, including histograms, scatter plots, and rolling averages.
- Copilot-Style Interaction, where researchers can issue natural-language commands for analysis or pipeline steps execution.
Future Steps
- Introduce bias and data-drift detection.
- Add pipelines for state-of-the-art synthetic data generation and intelligent data imputation.
- Automate documentation generation to improve reproducibility and streamline onboarding for new researchers.
Team
I was delighted to collaborate with incredibly talented colleagues and to grow under the mentorship of Josh Cherian.
Members:
- Suraj Prasai
- Saroj Bhatta
- Alejandro Gonzalez Rubio
Hosted by the Wake Forest School of Medicine — Center for Remote Health Monitoring.