I am a data scientist, researcher, and author based in Edmonds, WA. I have worked on a diverse set of problems and try to solve them in the simplest way possible, but I have been specializing in stochastic modeling and machine learning (including deep learning).
Highlights
- Technical lead for multiple DS / research teams. I like turning cool ideas into POCs and products 😊
- Author 2x, with books translated into multiple languages
- Personally built the analytics stack for Semantic Scholar
- Stanford-educated physicist by training: research in stochastic analysis and mathematical biology
Books
The Data Science Handbook
A self-contained overview of Data Science, this book covers the math, programming and business. Currently in its 2nd Edition.
Also available in Chinese and Korean
Published by Wiley & Sons
Data Science: The Executive Summary
This is for people who don't personally want to do data science, but need to leverage it in their organization. It gives a broad overview of the tools and techniques of data science, including the technical depth needed to critique models, interpret analytical results yourself, and see through bullshit.
Also available in Chinese
Published by Wiley & Sons
What is Math?
What is Math? is an exploration of mathematics not just as a tool, but as a deeply human pursuit. Weaving together cognitive science, linguistics, and rich historical context, this book offers a novel perspective on how we understand and interact with quantitative concepts.
Self-published
DeepFakes and Consulting
Building on my work at True Media I have been building and licensing datasets of DeepFakes to help with training / measuring DeepFake detection models. I monitor social media to identify new ways that people are making deepfakes, so that models can capture htem before they are widely used by scammers. I also have a patent-pending method for generating training datasets tailored to specific verticals.
Previously I have consulted on a range of other projects, such as:
- Building data science teams
- Behavior classification in oil wells
- Analysis of biomedical images
Feel free to reach out if you'd like to discuss a project!
Selected Articles
China May Overtake US in AI Research
In 2019, while at the Allen Institute for AI, I predicted that China was on track to surpass the US as the leader in cutting-edge AI research in the mid-2020s. Well, it just happened! My original article analyzed a large corpus of scholarly articles, showing that China's share of top-tier research was growing steadily. The article was widely covered in the press, including WSJ, Wired, MIT Tech Review, and GeekWire.
Open-system thermodynamic analysis of DNA polymerase fidelity
Probably my best purely-academic work! DNA replication in organisms has an insanely low mutation rate, which biologists often attribute to "kinetic checkpoints". I showed that this low mutation rate actually comes at an energetic cost. In particular, kinetic checkpoints only work in the presence of a non-equilibrium steady state (NESS) between - the polymerase reaction degrades the NESS, and the body must spend energy to maintain it. To do this analysis I developed a novel variant of Markov chains.
A Stochastic Analysis of Hard Disks
I wrote this paper at CMU, computing the average wait time for requests to an idealized hard disks. It turns out to be a very subtle problem; many previously published papers botched the math in subtle ways. I later expanded the technique to polling systems in generala>.
Hidden Markov Models Tutorial Series
A tutorial series on Hidden Markov models, their applications, and variants of them.
Fun Projects
ESDM Therapist Finder
ESDM is a play-based therapy for kids on the autistic spectrum. Unfortunately their website is very hard to navigate. I vibe-coded a map that lets you easily find therapists all over the world.
Alpine Lakes Finder
This is an interactive map of all alpine fishing lakes in the Pacific Northwest. You can filter by fish species, elevation of the lake, etc. All of this data is technically already available, but I find a map makes it MUCH easier to explore and interact with.
Mandarin Anki Cards
I am learning Mandarin, and I made this app to help me make flashcards. You just put in a big blob of Chinese text (for me it tends to be song lyrcs) and it will extract the words and make them into Anki-friendly CSV that you can just paste in to make flashcards. It uses Llama on Hugging Face for the actual translation.
Other
CtHMM
A Python library I developed that supports continuous-time Hidden Markov Models. Basically it's HMMs but with irregularly-spaced observations - super useful in situations like medicine or customer interactions where observations arrive at irregular intervals, rather than a fixed schedule.
Patent US10162881B2
For machine-assisted discovery of join keys between different datasets. I led the team at Maana that developed this patent and integrated it into our production code.