Data science how-to_ using apache spark for sports analytics


Apache Spark has become a common tool in the data scientist’s toolbox, and in this post we show how to use the recently released Spark 2.1 for data analysis using data from the National Basketball Association (NBA). Usd yen exchange rate All code and examples from this blog post are available on GitHub.

Analytics have become a major tool in the sports world, and in the NBA in particular analytics have shaped how the sport is played. Cool pictures of cars The league has skewed towards taking more 3-point shots due to their high efficiency as measured by points per field goal attempt. Us dollar exchange rate in india today In this post we evaluate and analyze this trend in the NBA using season statistics data going back to 1979 along with geospatial shot chart data.

Binary definition music The concepts in this post — data cleansing, visualization, and modeling in Spark — are general data science concepts and are applicable for other tasks beyond analyzing sports data. The binary box The post concludes with the author’s general impressions about using Spark and with tips and suggestions for new users.

For the analyses, we use Python 3 with the Spark Python API (PySpark) to create and analyze Spark DataFrames. Exchange rate usd to euro In addition, we utilize both the Spark DataFrame’s domain-specific language (DSL) and Spark SQL to cleanse and visualize the season data, finally building a simple linear regression model using the package — Spark’s now primary machine learning API.

Finally, we note that the analysis in this tutorial can be run with a distributed Spark setup running on a cloud service such as Amazon Web Service (AWS) or on a Spark instance running on a local machine. Silver oz price We have tested both and have included resources for getting started on either AWS or a local machine at the end of this post. Dow market futures The Code

Using data from Basketball Reference, we read in the season total stats for every player since the 1979-80 season into a Spark DataFrame using PySpark. Convert to binary number DataFrames are designed to ease processing large amounts of structured tabular data on the Spark infrastructure and are now in fact just a type alias for a Dataset of Row.

[‘_c0’, ‘player’, ‘pos’, ‘age’, ‘team_id’, ‘g’, ‘gs’, ‘mp’, ‘fg’, ‘fga’, ‘fg_pct’, ‘fg3’, ‘fg3a’, ‘fg3_pct’, ‘fg2’, ‘fg2a’, ‘fg2_pct’, ‘efg_pct’, ‘ft’, ‘fta’, ‘ft_pct’, ‘orb’, ‘drb’, ‘trb’, ‘ast’, ‘stl’, ‘blk’, ‘tov’, ‘pf’, ‘pts’, ‘yr’]

Using our DataFrame, we can view the top 10 players, sorted by number of points in an individual season. Social media is the future of marketing Notice we use the toPandas function to retrieve our results. Binary star system The corresponding result looks cleaner for display than when using the take function.

Next, using the DataFrame domain specific language (DSL), we can analyze the average number of 3-point attempts for each season, scaled to the industry standard per 36 minutes (fg3a_p36m). Rupiah to usd The per 36 minutes metric provides an estimate of a given player’s stats projected to 36 minutes, an interval corresponding to an approximate full NBA game with adequate rest, while also allowing comparison across players that play different numbers of minutes.

Now that we have aggregated our data and computed the average attempts per 36 minutes for each season, we can query our results into a Pandas DataFrame and plot it using matplotlib.

We can see a steady rise in the number of 3 point attempts since the shot’s introduction in the 1979-80 season, along with a blip in number of attempts during the period in the mid 90’s when the NBA moved the line in a few feet.

We can fit a linear regression model to this curve to model the number of shot attempts for the next 5 years. Binary converter calculator Of course, this assumes a linear nature of the rate of increase of attempts and is likely a naive assumption.

Firstly, we must transform our data using the VectorAssembler function to a single column where each row of the DataFrame contains a feature vector. Euro dollar exchange rate today This is a requirement for the linear regression API in MLlib. Usd inr exchange rate We first build the transformer using our single variable `yr` and transform our season total data using the transformer function.

Next, we want to apply our trained model object model to our original training set along with 5 years of future data. Market futures cnn money Containing this time period, we build a new DataFrame, transform it to include a feature vector, and then apply our model to make a prediction.

In addition to season total data, we process and analyze NBA shot charts to view the impact the 3-point revolution has had on shot selection. Usd cny chart The shot chart data was originally sourced from

The shot chart data contains xy coordinates of field goal attempts on the court for individual players, game date, time of shot, shot distance, a shot made flag, and other fields. Exchange rate us canadian dollar We have compiled all individual seasons where a player attempted at least 1000 field goals attempts from the 2010-11 through the 2015-16 season.

We can query an individual player and season and visualize their shots locations. 100 aud to usd We built a plotting function plot_shot_chart (see the GitHub repo) that is based on Savvas Tjortjoglou’s example.

As an example, we query and visualize Steph Curry’s 2015-16 historic shooting season using a hexbin plot, which is a two-dimensional histogram with hexagonal-shaped bins.

The shot chart data is rich in information, but it does not specify if the shot type is a 3-point attempt or a corner 3. Gold price in usa per ounce We solve for this by building User Defined Functions (UDF), which identify the shot type given the xy coordinates of the shot attempt.

We can visualize the change in the shot selection over the past 6 years using all of our data from the 2010-11 season up until the 2015-16 season. Pound to usd conversion For visualization purposes, we exclude all shot attempts taken inside of 8 feet as we would like to focus on the midrange and 3 point shots.

Among the top scorers in the league, close 3-point attempts are among the most efficient shots in the league, on par with shots taken close to the basket. Stock meaning in urdu It’s no wonder that accurate 3-point shooting is among the most coveted talents in the NBA today! Conclusion

Lastly as a seasoned data scientist, SQL user, and Python junkie, here are my two cents on getting started with Spark. Futures market hours of operation The Spark ecosystem and documentation are continually evolving and it is important to use the newest Spark version. Stock market definitions terms A first time user will notice there are multiple ways to solve a problem using different languages (Scala, Java, Python, R), different APIs (Resilient Distributed Dataset (RDD), Dataset, DataFrame), and different data manipulation routines (DataFrame DSL, Spark SQL). Equity meaning in real estate Many choices are up to the users and others are guided by the documentation. E mini dow trading hours Since Spark 2.0 for example, the DataFrame is now the primary Spark API for Python and R users (rather than the original and still useful RDD). Binary decision tree In addition, the DataFrame-based package is now the primary machine learning API in Spark replacing the RDD-based API. Commodity futures market Bottom line: the platform is evolving and it pays to stay up to date.

In this post, we’ve demonstrated how to use Apache Spark to accomplish key data science tasks including data exploration, visualization, and model building. Hockey blackhawks These principles are applicable to other data science tasks and datasets, and we encourage you to check out the repository and try it on your own! Additional Resources

Chris Rawles is a senior data scientist at Pivotal in New York, New York, where he works with customers across a variety of domains, building models to derive insight and business value from their data. Pound exchange rate history He holds an MS and BA in geophysics from UW-Madison and UC Berkeley, respectively. Code translator During his time as a researcher, Chris focused his efforts on using machine-learning to enable research in seismology. Mortgage meaning dictionary More Content by Chris Rawles