PyMCon Afterword

A couple of weeks ago I gave a talk for the 2023 PyMCon Web Series. The aim of the talk was to advocate the advantages of a first principles approach to modeling, i.e. one that places focus on modeling the data generating process (DGP) and not simply outcomes alone. This post covers two slides, Tips and Resources, that didn’t make the final cut. Below, each would-have-been bullet point is covered briefly in two sentences or less.

Posted on 2023-02-22

Time Series With Pandas

While pandas has been a part of my daily workflow for the last 6 years, it wasn’t until recently that I began to appreciate some of it’s most powerful features: in particular its timeseries feature set, especially when combined with the unsung hero of indexing.

Posted on 2021-01-03

Evaluating my 2020 MLB Predictions - Part 2, The Postseason

This was my third year in a row making some sort of world series predictions (you can find previous year’s predictions here). This time around, however, I took it one step further and predicted the outcomes of the entire MLB postseason. Now that the season is over let’s see how they did.

Posted on 2020-11-10

Evaluating my 2020 MLB Predictions - Part 1, Pete Alonso

At the beginning of March I published predictions on how many home runs Pete Alonso would hit this season. Now that the season is over, let’s see how they stack up.

Posted on 2020-11-10

MLB 2020 Postseason Projections

Just over 6 months after the 2020 MLB season was postponed indefinitely and just under 3 months after the 60-game schedule was announced the 2020 postseason begins today. While MLB postseason results are often compared to a crapshoot, it doesn’t stop us from trying to predict the outcome.

Posted on 2020-09-29

A Beginner's Guide to Why You Should or Shouldn't Be Using Kubernetes for Machine Learning (With Illustrations)

Kubernetes is a powerful tool that receives a lot of hype which can sometimes make it seem intimidating. Perhaps you’ve wondered whether you should be using Kubernetes to deploy your machine learning applications but aren’t sure where to start or even if it’s the right move. If so, this post is for you. I’ll try to start simple and work from the ground up to motivate the rationale for deploying models to Kubernetes.

Posted on 2020-08-07

A Tutorial on Collaborative Filtering in sklearn

Given the vast amount of entertainment consumed on Netflix and amount of shopping done through Amazon it’s a safe bet to claim that collaborative filtering gets more public exposure (wittingly or not) than any other machine learning application.

Posted on 2020-04-21

My Quarantine Playlist

When Philadelphia first went into self-quarantine last month, realizing that I was going to be working from home for the foreseeable future, I went to building a playlist that I could listen to for a “while” without getting bored of.

Posted on 2020-04-20

Predicting Pete Alonso's 2020 Performance

Pete Alonso had a stellar 2019 season. One of his most significant achievements of this season was hitting 53 home, which set the record for number of home runs hit in a rookie season and helped him to win the Rookie of the Year award.

Posted on 2020-03-03

Scaling Predictions

Ever notice that predictions almost always look better in aggregate? That is, summing a group of predictions seem to seems to result in an error lower than that of the individual predictions themselves.

Take a look at the following plot which shows this effect on the Boston housing dataset. On the left are predictions of median home price of towns near Boston (127 in all, taken from a randomly assigned test set) vs. to the actual median home price of the town. For the purpose of this post, each town was (randomly) assigned to one of 25 groups and predicted prices for each group were summed and compared sums of the actual prices as shown on the right.

Posted on 2020-02-08

Deep Learning for Time Series

Recurrent Neural Networks (RNNs), a deep learning architecture designed to operate on sequences of data, seem like a natural fit for modeling time series. Most of the literature, however, has focused on applying RNNs strictly within the realm of natural language processing (NLP). In this post we’ll dive into a model that introduces a framework that bridges this gap.

Posted on 2020-01-02

Neural Networks Explained

Last night - at the first ever PyData Philly meetup! (thanks to all of the organizers for taking the initiative to get this started) - I gave a lightening talk titled “Neural Networks Explained”. This was an idea I had been thinking about for about a week and the talk was a great opportunity to get my thoughts together before writing them down in a post. Since I received a lot of positive feedback from the talk I’m going to try and keep that “lightening talk feel” in this post.

Posted on 2019-12-12

Clustering and Image Segmentation

Earlier this week I wrote a post on pitching matchups for this year’s world series. As I was working on the post, I needed to remove gray backgrounds from headshots of the players so that the images wouldn’t overlap in the scatter plot I was working on. During the process of removing these backgrounds I came across an application of clustering that I thought was worth writing about on its own.

Posted on 2019-10-27

Tensorflow 2 Feature Columns and Keras

tensorflow 2.0 was just recently introduced and one of the most anticipated features, in my opinion, was the revamping its feature columns. Last July I published a blog post which was enthusiastic about the idea of tensorflow's feature columns but disappointed by the actual implementation. Mainly because they weren’t compatible with keras even though tensorflow had already adopted the keras API.

Posted on 2019-10-24

2019 World Series Pitcher Matchups

Last year I posted some world series projections.

Posted on 2019-10-22

Deep learning for tabular data 2 - Debunking the myth of the black box

Not knowing how to apply deep neural networks (DNNs) to the sort of tabular data commonly found in industry is a common reason for not adopting DNNs in production and the topic of my last post. Another common reason for hesitating to use DNNs is what I’ll call the “myth of the black box”. That is, the idea that unlike simpler models, such as linear regression or decision trees which come with “interpretable” linear coefficients or feature importances, DNNs are hard to understand.

Posted on 2019-06-14

Deep learning for tabular data

There’s no shortage of hype surrounding deep learning these days. And for good reason. Advances in AI are crossing frontiers that many didn’t expect to see crossed for decades with the advent of new algorithms and the hardware that makes training these models feasible.

Posted on 2019-01-30

Active learning and deep probabilistic ensembles

Active learning, loosely described, is an iterative process for getting the most out of your training data. This is especially useful for cases where you have a lot of unlabeled data that you would like to use for supervised training but labeling the data is extremely time consuming and/or costly. In this case you want to be able intelligently choose which data points to label next so that you total training set consists of a rich and diverse set of examples with minimal redundancy.

Posted on 2019-01-19

Model Evaluation For Humans

This post is the basis of a talk I gave at PyData Miami 2019. You can find the slides for that talk on my GitHub linked at the bottom of this page.

Posted on 2019-01-07

World Series Projections

In this post we’ll use the model from my previous post Hierarchical Bayesian Ranking to project the world series winner.

Posted on 2018-10-22

Hierarchical Bayesian Ranking

As the title suggests, this post will examine how to use Bayesian models for ranking. As I’ve been on a kick with the MLB Statcast data we’ll use this technique to create a ranked list of professional baseball teams and project who will win the world series.

Posted on 2018-09-20

Hypothesis Testing For Humans - Do The Umps Really Want to Go Home

In this post we’ll discuss and apply two techniques for hypothesis testing beyond the typical approach presented in most introductory statistics courses within the context of Major League Baseball.

Posted on 2018-09-17

Image Search Take 2 - Convolutional Autoencoders

This post continues the discussion of an earlier post Image search with autoencoders by demonstrating how to use convolutional autoencoders for the task of image retrieval on the CIFAR data set. Since I’ve already covered the basics of this topic this post will be short and sweet.

Posted on 2018-09-12

Keras Feature Columns

tensorflow's feature columns are a great idea. However the implementation leaves much to be desired.

Posted on 2018-07-17

An Overview of Attention Is All You Need

About a year ago now a paper called Attention Is All You Need (in this post sometimes referred to as simply “the paper”) introduced an architecture called the Transformer model for sequence to sequence problems that achieved state of the art results in machine translation. The name of the paper is a reference to the fact that the Transformer leverages the widely successful attention mechanism within a more traditional “feed forward” framework without using recurrence or convolutions.

Posted on 2018-07-06

Know Your Trees

Tree based algorithms are among the most widely used algorithms in machine learning. This popularity is due largely to tree based algorithms being capable of achieving good results “out of the box”, that is, with little tuning or data processing. Additionally tree based methods are versatile. They can be applied to both classification and regression problems and easily support mixing categorical and numeric values. Additionally algorithms such as the Random Forest inherits all of these benefits and are robust against overfitting making it an ideal candidate for almost any problem. This post will take a look at 7 properties of Decision Trees and offer practical insights on how to use these properties to get the most out of trees.

Posted on 2018-06-15

Monitoring Machine Learning Models in Production

Every production machine learning system is susceptible to covariate shift, when the distribution of production data predicted on at run time “drifts” from the distribution you trained on. This phenomenon can severely degrade a model’s performance and can occur for a variety of reasons - for example, your company now sells new products or no longer offers old ones, the price of inventory has increased or there is a bug in the software that sends data to your model. The effect of covariate drift can be subtle and hard to detect but nevertheless important to catch. In Google’s machine learning guide they report that refreshing a single stale table resulted in a 2% increase in install rates for Google Play. Retraining and redeploying your model is an obvious fix, but it can be ambiguous as to how often you should do this, as soon as new data arrives, every day, every week, every month? What’s more this would only work for the first two examples mentioned above and can’t be used to fix integration bugs.

Posted on 2018-06-12

From Docker to Kubernetes

Docker has been a huge win for deploying software across the board and in particular for deploying machine learning models from my own experience. Perhaps you’ve already adopted this practice but wonder how to take the next step and deploy your image at scale via Kubernetes. If so this post is for you.

Posted on 2018-05-24

Bayesian Online Learning

This post explores the usefulness of conjugate priors in the context of online learning. Specifically we’ll consider the following

Posted on 2018-05-11

A brief primer on conjugate priors

This post attempts to introduce conjugate priors and give some intuition as to why they work. While this post focuses largely on the technical details of conjugate priors, my next post will focus on conjugate priors in practice.

Posted on 2018-05-11

A fast one hot encoder with sklearn and pandas

If you’ve worked in data science for any length of time, you’ve undoubtedly transformed your data into a one hot encoded format before. In this post we’ll explore implementing a fast one hot encoder with scikit-learn and pandas.

Posted on 2018-05-04

Image search with autoencoders

Autoencoders are an architecture that have a variety of applications from denoising data to generative models.

Posted on 2018-05-01