PyMCon Afterword

A couple of weeks ago I gave a talk for the 2023 PyMCon Web Series. The aim of the talk was to advocate the advantages of a first principles approach to modeling, i.e. one that places focus on modeling the data generating process (DGP) and not simply outcomes alone. This post covers two slides, Tips and Resources, that didn’t make the final cut. Below, each would-have-been bullet point is covered briefly in two sentences or less.

Read More
Posted on 2023-02-22

Time Series With Pandas

While pandas has been a part of my daily workflow for the last 6 years, it wasn’t until recently that I began to appreciate some of it’s most powerful features: in particular its timeseries feature set, especially when combined with the unsung hero of indexing.

Read More
Posted on 2021-01-03

Evaluating my 2020 MLB Predictions - Part 2, The Postseason

This was my third year in a row making some sort of world series predictions (you can find previous year’s predictions here). This time around, however, I took it one step further and predicted the outcomes of the entire MLB postseason. Now that the season is over let’s see how they did.

Read More
Posted on 2020-11-10

MLB 2020 Postseason Projections

Just over 6 months after the 2020 MLB season was postponed indefinitely and just under 3 months after the 60-game schedule was announced the 2020 postseason begins today. While MLB postseason results are often compared to a crapshoot, it doesn’t stop us from trying to predict the outcome.

Read More
Posted on 2020-09-29

A Beginner's Guide to Why You Should or Shouldn't Be Using Kubernetes for Machine Learning (With Illustrations)

Kubernetes is a powerful tool that receives a lot of hype which can sometimes make it seem intimidating. Perhaps you’ve wondered whether you should be using Kubernetes to deploy your machine learning applications but aren’t sure where to start or even if it’s the right move. If so, this post is for you. I’ll try to start simple and work from the ground up to motivate the rationale for deploying models to Kubernetes.

Read More
Posted on 2020-08-07

My Quarantine Playlist

When Philadelphia first went into self-quarantine last month, realizing that I was going to be working from home for the foreseeable future, I went to building a playlist that I could listen to for a “while” without getting bored of.

Read More
Posted on 2020-04-20

Scaling Predictions

Ever notice that predictions almost always look better in aggregate? That is, summing a group of predictions seem to seems to result in an error lower than that of the individual predictions themselves.

Take a look at the following plot which shows this effect on the Boston housing dataset. On the left are predictions of median home price of towns near Boston (127 in all, taken from a randomly assigned test set) vs. to the actual median home price of the town. For the purpose of this post, each town was (randomly) assigned to one of 25 groups and predicted prices for each group were summed and compared sums of the actual prices as shown on the right.

Read More
Posted on 2020-02-08

Neural Networks Explained

Last night - at the first ever PyData Philly meetup! (thanks to all of the organizers for taking the initiative to get this started) - I gave a lightening talk titled “Neural Networks Explained”. This was an idea I had been thinking about for about a week and the talk was a great opportunity to get my thoughts together before writing them down in a post. Since I received a lot of positive feedback from the talk I’m going to try and keep that “lightening talk feel” in this post.

Read More
Posted on 2019-12-12

Tensorflow 2 Feature Columns and Keras

tensorflow 2.0 was just recently introduced and one of the most anticipated features, in my opinion, was the revamping its feature columns. Last July I published a blog post which was enthusiastic about the idea of tensorflow's feature columns but disappointed by the actual implementation. Mainly because they weren’t compatible with keras even though tensorflow had already adopted the keras API.

Read More
Posted on 2019-10-24

Deep learning for tabular data 2 - Debunking the myth of the black box

Not knowing how to apply deep neural networks (DNNs) to the sort of tabular data commonly found in industry is a common reason for not adopting DNNs in production and the topic of my last post. Another common reason for hesitating to use DNNs is what I’ll call the “myth of the black box”. That is, the idea that unlike simpler models, such as linear regression or decision trees which come with “interpretable” linear coefficients or feature importances, DNNs are hard to understand.

Read More
Posted on 2019-06-14

Deep learning for tabular data

There’s no shortage of hype surrounding deep learning these days. And for good reason. Advances in AI are crossing frontiers that many didn’t expect to see crossed for decades with the advent of new algorithms and the hardware that makes training these models feasible.

Read More
Posted on 2019-01-30

Active learning and deep probabilistic ensembles

Active learning, loosely described, is an iterative process for getting the most out of your training data. This is especially useful for cases where you have a lot of unlabeled data that you would like to use for supervised training but labeling the data is extremely time consuming and/or costly. In this case you want to be able intelligently choose which data points to label next so that you total training set consists of a rich and diverse set of examples with minimal redundancy.

Read More
Posted on 2019-01-19

Model Evaluation For Humans

This post is the basis of a talk I gave at PyData Miami 2019. You can find the slides for that talk on my GitHub linked at the bottom of this page.

Read More
Posted on 2019-01-07

Hierarchical Bayesian Ranking

As the title suggests, this post will examine how to use Bayesian models for ranking. As I’ve been on a kick with the MLB Statcast data we’ll use this technique to create a ranked list of professional baseball teams and project who will win the world series.

Read More
Posted on 2018-09-20

An Overview of Attention Is All You Need

About a year ago now a paper called Attention Is All You Need (in this post sometimes referred to as simply “the paper”) introduced an architecture called the Transformer model for sequence to sequence problems that achieved state of the art results in machine translation. The name of the paper is a reference to the fact that the Transformer leverages the widely successful attention mechanism within a more traditional “feed forward” framework without using recurrence or convolutions.

Read More
Posted on 2018-07-06

Know Your Trees

Tree based algorithms are among the most widely used algorithms in machine learning. This popularity is due largely to tree based algorithms being capable of achieving good results “out of the box”, that is, with little tuning or data processing. Additionally tree based methods are versatile. They can be applied to both classification and regression problems and easily support mixing categorical and numeric values. Additionally algorithms such as the Random Forest inherits all of these benefits and are robust against overfitting making it an ideal candidate for almost any problem. This post will take a look at 7 properties of Decision Trees and offer practical insights on how to use these properties to get the most out of trees.

Read More
Posted on 2018-06-15

Monitoring Machine Learning Models in Production

Every production machine learning system is susceptible to covariate shift, when the distribution of production data predicted on at run time “drifts” from the distribution you trained on. This phenomenon can severely degrade a model’s performance and can occur for a variety of reasons - for example, your company now sells new products or no longer offers old ones, the price of inventory has increased or there is a bug in the software that sends data to your model. The effect of covariate drift can be subtle and hard to detect but nevertheless important to catch. In Google’s machine learning guide they report that refreshing a single stale table resulted in a 2% increase in install rates for Google Play. Retraining and redeploying your model is an obvious fix, but it can be ambiguous as to how often you should do this, as soon as new data arrives, every day, every week, every month? What’s more this would only work for the first two examples mentioned above and can’t be used to fix integration bugs.

Read More
Posted on 2018-06-12

From Docker to Kubernetes

Docker has been a huge win for deploying software across the board and in particular for deploying machine learning models from my own experience. Perhaps you’ve already adopted this practice but wonder how to take the next step and deploy your image at scale via Kubernetes. If so this post is for you.

Read More
Posted on 2018-05-24

Bayesian Online Learning

This post explores the usefulness of conjugate priors in the context of online learning. Specifically we’ll consider the following

Read More
Posted on 2018-05-11

A brief primer on conjugate priors

This post attempts to introduce conjugate priors and give some intuition as to why they work. While this post focuses largely on the technical details of conjugate priors, my next post will focus on conjugate priors in practice.

Read More
Posted on 2018-05-11