Using Spark as a cornerstone for an Analytic Initiative

Posted by

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Set up 

The team at has an excellent API (application programming interfaces) that allows you to query different endpoints and extract data for analysis. 
To get started, head over to and get your own API key via email. We will be using this key in the subsequent code modules.  

There are several endpoints that the team has published at 

As a part of this work, I want to take historical data on a ratio (Win Ratio – Total Wins to Total Games in a season) and predict this ratio for the upcoming season. We are going to be using multiple endpoints to get relevant data (feel free to add/remove data!) and see what sort of model we can build.   

Breakdown of code modules


Set up components 

  1. Note down the API key you generated from because we will be using it pretty extensively 
  1. (Optional): Navigate to your Microsoft Fabric tenant and create a new workspace with Fabric capacity enabled. 
  1. If you already have a Fabric workspace and want to reuse it, you can create a new Lakehouse to keep all your artifacts separate from your other work OR you can use an existing lake house 
  1. Navigate to the Data Science section and start a new Notebook 



First Code Module: Win-Loss Records 

Code can be found HERE 
Code loops through an endpoint for Win/Loss Records for the previous 20 years and creates a Pandas/Spark data frame which you can then insert into your Lakehouse as a file or as a table.  






Second Code Module: Talent Rankings 

Code can be found HERE 

An important aspect of any team is how much talent the team has. Ranking agencies release a composite score of each team, and this could be a nice predictor of on-field performance. 



Third Code Module: Recruiting Rankings

Code can be found HERE 

Another important aspect of any team is how well the college is doing from a recruiting perspective. In theory, higher the Recruiting Rank, better the talent on the field leading to superior performance – there is no objective way to score the coaching :smiling_face_with_smiling_eyes: but we will try to see if recruiting has a place in the final model.  




Fourth Code Module: Season Stats 

Code can be found HERE 

This is an interesting code block because it gives you the high level and key statistics for a team at the end of the season.  




Fifth Code Module: Advanced Season Stats 

Code can be found HERE 

There are some advanced statistics also available, and it is worth bringing this data into the data engineering piece. 


Now that we have all the data as separate files or tables in our Lakehouse, we are ready to move on to the next phase – where we predict the Win Ratio 


Machine Learning Model 
First Step: Feature Engineering 

We want to first create an integrated data set with properly defined columns so that we can build a ML (machine learning) model 
Code can be found HERE 
There are some assumptions I have made which can be easily modified 

  • I am using 19 years' worth of data to predict the prior year for model accuracy. This can be changed on either side to reduce the period of the data (giving more importance to the recent years) and use the current year as your prediction year. College Football typically runs from September through December so you could extend the code to run once a week where it picks up data and constantly adjusts the Win Ratio Prediction based on current weekly data that gets added to the model data 
  • I am only having records where we have Talent scores – this is a big assumption and can be taken out of the model mix 
  • There are a lot more features that can be extracted from the Season Stats and Advanced Stats data frames 

Second Step: ML Code 

I am predicting the Win Ratio based on a set of algorithms (This can be refined to add/remove algorithms from the list I have) and using features we have from the Feature Engineering code. The evaluation metric I am using is R2 but again this can be changed based on your mix of algorithm(s).  
Entire code can be found HERE 
The code should write the output to your Lakehouse but will also generate an output in the notebook 

A final optional step would be to plot the feature importance



PowerBI Dashboards 

Once you have all the data in – you can connect to the different Lakehouse tables via PowerBI and build your own visualization layer.  
A sample starter workbook can be found HERE 

Use this template file to fill in your Fabric details 




You can adapt the code (minus the Lakehouse pieces) to work in native Azure Synapse as a Spark Notebook also. 


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.