We believe the world is changing faster than ever

Join a Two Sigma Ventures portfolio company to help shape the future of technology.
Leverage our network to build your career.
Tell us about your professional DNA to get discovered by any company in our network with opportunities relevant to your career goals.

Research Intern - Cheminformatics



South San Francisco, CA, USA
Posted on Wednesday, January 24, 2024

The Opportunity

Most of the chemistry knowledge in the world is stored in published papers and patents, often as PDFs. Everything from assay values to reaction conditions are stored in these documents, and being able to reliably and accurately extract this information would increase our ability to harness this knowledge.

As an intern, you will work directly on this problem, employing advanced image recognition, LLMs, and other machine learning tools to extract data from the literature relevant to ongoing drug discovery projects. You will be a part of the drug discovery department and report to the head of Computational Drug Design. You will work on site in South San Francisco with the team 2-3x/week (remote can be considered for strong candidates).


  • Utilize advanced image recognition and natural language processing (LLMs) to extract chemical data from published papers and patents
  • Apply or fine-tune machine learning models, including LLMs (Large Language Models), to identify and interpret assay values, reaction conditions, and other information relevant to drug discovery
  • Work closely with the drug discovery team to understand the data needs of drug discovery projects and ensure the extracted information aligns with project goals
  • Regularly validate the accuracy of extracted data and refine methodologies to improve data reliability and model precision.

About You

  • A PhD student studying cheminformatics, natural language processing, or a related field.
  • Familiarity with machine learning techniques, especially DNN methods in image recognition and natural language processing.
  • Experience with python and ML frameworks relevant to data extraction and analysis, including domain-specific packages and methods.
  • Genuine interest in drug discovery and a keenness to help advance cures for patients.

Compensation & Benefits at insitro

Our target starting salary for successful US-based applicants for this role is $55/hr - $65/hr. To determine starting pay, we consider multiple job-related factors including a candidate's skills, education and experience, market demand, business needs, and internal parity. We may also adjust this range in the future based on market data.

In addition, insitro also provides our interns:

  • Excellent medical, dental, and vision coverage (insitro pays 100% of premiums for employees on our base plans)
  • Excellent mental health and well-being support
  • Access to free onsite baristas and cafe with daily lunch and breakfast
  • Access to free onsite fitness center
  • Commuter benefits

insitro is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

We believe diversity, equity, and inclusion need to be at the foundation of our culture. We bring together diverse teams—grounded in broad expertise and life experiences—and work even harder to ensure those teams excel in inclusive, growth-oriented environments supported by equitable company and team practices. All candidates can expect equitable treatment, respect, and fairness throughout the interview process.


About insitro
insitro is a drug discovery and development company using machine learning (ML) and data at scale to decode biology for transformative medicines. At the core of insitro’s approach is the convergence of in-house generated multi-modal cellular data and high-content phenotypic human cohort data. We rely on these data to develop ML-driven, predictive disease models that uncover underlying biologic state and elucidate critical drivers of disease. These powerful models rely on extensive biological and computational infrastructure and allow insitro to advance novel targets and patient biomarkers, design therapeutics and inform clinical strategy. insitro is advancing a wholly owned and partnered pipeline of insights and therapeutics in neuroscience, oncology and metabolism. Since launching in 2018, insitro has raised over $700 million from top tech, biotech and crossover investors, and from collaborations with pharmaceutical partners. For more information on insitro, please visit www.insitro.com.