Protege
applied data scientist
Published: 4 days ago
Job Description:
Company Overview:
We are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy-centric exchange of AI training data, starting in the healthcare industry.
Solving AI's data problem is a generational opportunity. The company that succeeds will be one of the largest in AI — and in tech.
Summary
The Applied Data Scientist bridges the gap between our data assets and our customers' needs in our healthcare vertical. They play a key role in ensuring our datasets are well-matched to the AI models our customers are building and well-understood by those customers. This role requires both healthcare data expertise, extensive experience with statistical analysis, and some customer collaboration.
We are open to hiring someone for part-time, temp-to-hire, and full-time opportunities in this role. Part-time would require at least 20 hours per week.
Responsibilities
Data Analysis: Conduct feasibility analyses by querying healthcare datasets to assess patient cohort availability based on complex inclusion/exclusion criteria (i.e. procedures, diagnoses, diversity, longitudinal completeness, regulatory constraints).
Trade-off Assessments: Assess privacy-preservation techniques to maximize dataset utility.
Customer Collaboration: Work directly with prospective customers to understand their data requirements and help curate the best data assets for their use cases.
Data Strategy: Identify gaps in our data offerings and provide insights to our partnerships team on the highest-priority data acquisitions.
Data Quality Assurance: Evaluate potential data partnerships, ensuring the data is high-quality, well-documented, and commercially viable.
Technical Skill Set
Data Expertise: Experience working with healthcare/medical datasets: some combination of imaging, EHR, genomic, claims, and pathology data as well as comfort with SQL, R , and/or Python for data analysis. The bigger the dataset you have worked with, the better!
Longitudinal & Cohort Analysis: Ability to evaluate datasets for completeness over time, ensuring sufficient patient follow-up and retention for model training.
Diversity & Bias Mitigation: Knowledge of techniques to assess and improve dataset diversity across demographics, geographies, and clinical subpopulations.
Privacy-Preserving Technologies: Familiarity with de-identification techniques such as Safe Harbor and Expert Determination.
Qualifications
2+ years experience in a health data role (e.g., biomedical informatics, computational biology, AI/ML in healthcare) or equivalent experience, e.g., Ph.D. or Masters in healthcare economics, statistics or data science with healthcare focus, etc.
Excellent communication skills with the ability to translate complex data concepts.
Proficiency in Snowflake and a stats coding language (SQL, R, Python), including writing complex queries and working with large datasets.
Experience in a customer-facing role preferred.