Building Machine Learning Recommendations: Lessons Learned
As students increasingly expect the kind of personalized recommendations and insights they have in their consumer online experiences during their learning journey, college and universities are using machine learning models to create them. As our institution has traveled this path over the past four years, here are some of the things that sounded difficult but were fairly easy and others that sounded easy but were difficult.
Things That Sound Difficult But Are Easy
Finding people to run machine learning models
Ten years ago, developers who understood the underlying math behind had to hand-code machine learning models. Today, free Python libraries and R packages cover almost all the commonly used modeling methods, meaning that most people who have just an applied statistics background (for example, from doing quantitative work in a social science discipline) rather a math or statistics degree are able to use these packages effectively.
Software to run these models
In addition to the free Python libraries and R packages, some vendor products make the process even easier by cleaning up messy data and automatically trying dozens of different model types (such as logistic regression vs. decision trees vs. recurrent neural networks) with multiple sets of hyperparameters (the various settings you can modify as you run a model). When selecting a package, most will be good at finding the best performing models, but what can differentiate them is their ability to graphically display the results in a way that allows you to compare models easily, including which features (aka variables) are most important across the various model types.
Identifying the best models
Spoiler alert: the answer is almost always XGBoost or Light Gradient Boosting. It’s possible that the best models for your data will be different, but we have used packages that try out many different models and hyperparameters and the top performing models are almost always one of the two. And just as importantly, you’ll find the model fit measures (such as AUC or r-squared) are pretty close across all modeling methods. So, what may be more important is how reasonable the selected features are and how the predictions are distributed. For example, if your model is trying to predict the probability of a student both failing an assessment and dropping out, a particular method may overemphasize one of those two measures, rather than creating a balanced approach.
Computing power and memory for model building
Occasionally, we’ll run into our cloud cluster’s limitations when a model is being built, but we much more often run into these cluster problems when we are running queries and joining tables, given the large size of our datasets. Plus, your IT department can create large spin-up clusters if you need temporary access to extra-large computing resources to run a particular model.
Needing to retrain models regularly
Apart from truly disruptive events like COVID-19, student behavior tends not to change dramatically month to month and certainly not day to day, so frequent and costly retrainings for your model are usually not necessary, and, in fact, can introduce risk if the retraining process fails overnight. Quarterly retrainings are usually fine.
Things That Sound Easy But Are Difficult
Getting clean, reliable data from external vendors
Often, the vendors that serve higher education don’t have easy-to-use APIs for access to data partly because their customers haven’t demanded it. As a large institution, we are sometimes the first large customer to access data at such a granular level with such frequent refresh cycles. On an ongoing basis, integration can break on the vendor’s side or on the institution side, so having ways to monitor daily data feed accuracy is important.
Figuring out the right user interface
Even when creating user interfaces for internal WGU employees, there is a wide range of interest in and skill at consuming data, from data skeptics who don’t believe that data can tell them much to data enthusiasts who consider themselves best able to digest and interpret data (rather than having a model tell them what to do) and everything in between. As a result, you’ll want users to find just the amount of data they want. Some helpful techniques for layering the data to meet this objective include: (1) putting color-coded summary statistics or recommendations at the top of the page and detail at the bottom, (2) color-coding the detail to highlight key values, (3) putting even more detailed data into the hover tooltip, (4) creating numerous filters for data and (5) putting the most detailed data in separate tabs. Also important when possible is building recommendations and predictions into the tools that are part of the users’ primary workflow rather than in a separate dashboards off to the side.
Faculty change management
If you are highlighting new information for faculty to use, you must have a clear plan for how it affects what faculty do or for which they can be held accountable. For example, a new dashboard that highlights which of an instructor’s students are ready to attempt the assessment but haven’t done so yet must be balanced against other alerts or metrics already in use. The other key factor is addressing faculty skepticism all along the spectrum of data interest and literacy, convincing data skeptics that data can indeed complement their own judgment and convincing data enthusiasts that your approach is sound.
Setting up experiments for efficacy testing
You can test the accuracy of your model predictions by seeing how they perform on a holdout set of data (students who were not included in the training data for the model) but seeing if interventions based on those predictions are effective requires a real-world experiment. Even though the model is accurate, it’s possible students may just ignore our advice and we won’t see pass rates or retention rates improve, which is our ultimate goal. But if that happens, it doesn’t mean the model is wrong. It just means we haven’t found the right intervention yet. Setting up an experiment structured enough to provide a real test of the treatment to better serve tomorrow’s students while giving today’s students the flexibility they need can be a difficult balancing act.
The relatively low cost of computing power these days and the availability of machine learning tools means we are better able than ever to pull valuable insights from our data that benefit students and faculty. If we can get the right data, the right user interface, the right operational rollout and training with the proper proof of efficacy, we can truly use our data to improve student outcomes.