import pandas as pd
# Load the dataset
= pd.read_csv('/mnt/data/nba-data-historical.csv')
df
# Display the first few rows of the dataframe
df.head()
# player_id name_common year_id type age team_id pos tmRtg franch_id \
# 0 youngtr01 Trae Young 2020 RS 21 ATL PG NaN ATL
# 1 huntede01 De'Andre Hunter 2020 RS 22 ATL SF NaN ATL
# 2 huertke01 Kevin Huerter 2020 RS 21 ATL SG NaN ATL
# 3 reddica01 Cam Reddish 2020 RS 20 ATL SF NaN ATL
# 4 collijo01 John Collins 2020 RS 22 ATL PF NaN ATL
# G ... BLK% ORtg %Pos DRtg 2P% 3P% FT% 3PAr FTAr Pace +/-
# 0 60 ... 0.3 NaN NaN NaN NaN NaN NaN 45.5 44.8 2.9
# 1 63 ... 0.7 NaN NaN NaN NaN NaN NaN 44.5 21.1 0.0
# 2 56 ... 1.3 NaN NaN NaN NaN NaN NaN 54.8 10.5 0.1
# 3 58 ... 1.5 NaN NaN NaN NaN NaN NaN 45.1 22.7 0.9
# 4 41 ... 4.1 NaN NaN NaN NaN NaN NaN 24.3 24.8 0.1
# [5 rows x 42 columns]
Code Interpreter & Data Analysis
In this post I’ll go over some observations I’ve had while using OpenAI’s Code Interpreter for the first time. It is not available as an API, rather only through the ChatGPT web interface for ChatGPT Plus subscribers ($20/month).
If you aren’t familiar with Code Interpreter, it is:
- ChatGPT hooked up to a Python interpreter
- You can upload (limited to 100MB, though you can zip files) and download files
Really the only change is a “+” button in the ChatGPT interface but this small change unlocks quite a bit of use cases. One of which is data analysis since you can upload data and have the LLM analyze and reason over it.
Overall I am quite impressed with Code Interpreter’s capabilities. I would characterize Code Interpreter as a very capable intern whose output you need to validate. That being said it is a very capable agent in doing data analysis tasks. I would estimate this analyses took me 20 minutes to do. If I actually wanted to do it, I’d estimate it would’ve taken me 2 hours, so 6x longer. But it’s not just a matter of time savings, it is a matter of cognitive-load savings. It was not very cognitively-intense to use Code Interpreter whereas I, a human, doing these analyses would’ve taken me a lot of mental energy. Because of that, I’d held off on running this type of analyses for a long time (sitting on this idea for a year?) but with Code Interpreter, I was able to do it in 20 minutes. And this opens up many other analyses that I would love to do but have not had the time or energy to do.
Code Interpreter
OpenAI released Code Interpreter to the public on July 7, 2023. Code Interpreter is GPT-4 hooked up with a Python interpreter with a bunch of different libraries. You can upload any file, CSV files, zipped git repositories, you name it. Given the file you upload and GPT-4, this unlocks quite a bit of use cases. If you haven’t seen any before, Ethan Mollick has a bunch of examples here on his Substack.
Someone posted the requirements.txt
file for Code Interpreter here. It has common libraries like:
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- tensorflow
- pytorch
- nltk
- spacy
- gensim
There’s been a lot of chatter on whether Code Interpreter and models like it will take over data science and data analyst jobs. I don’t think that will necessarily happen so soon. I think expert data analysts and data scientists with a tool like Code Interpreter can be much more productive. I haven’t used Code Interpreter for very long but right now I see it as a very obedient and capable intern that has a surprising amount of knowledge. You wouldn’t productionize code that an intern wrote without a thorough review, nor would you share the results of a data science interns analysis or model without also a thorough review. The same applies to Code Interpreter.
But let me share the results of my first experiment with Code Interpeter and it’s ability to do data analysis.
NBA Data Analysis (Part 1) - Anything interesting?
I’m a fan of the NBA. So one dataset I started with a dataset that had player statistics by year for every NBA player where every row is a player-year. I downloaded this 5.5mb file and uploaded it to Code Interpreter.
The code Code Interpreter runs in its own Python sandbox is by default hidden, you can click show work and there’s runnable code! This could be a great training tool for junior data analysts and data scientists. One somewhat surprising thing is this is just a CSV file with no metadata but Code Interpreter was able to infer not just the column data types (which Pandas does already) but the meaning if each column (e.g. “3P%” means 3-point percentage). It’s probably based on the colunmn name and the fact I prompted it that it was an NBA related dataset.
But based on “seeing” the first 5 and last 5 rows of the dataset, the agent actually generates 5 interesting questions:
- The distribution of players’ ages.
- The average performance statistics (like points, assists, rebounds) by player position.
- Trends in performance over time for specific players.
- Comparison of teams based on average player statistics.
- The relationship between different performance metrics (like points and assists).
so naturally I ask it to run all 5 of them.
1. Distribution of players’ ages
The first use case was simple and the agent was able to do this correctly.
2. The average performance statistics (like points, assists, rebounds) by player position.
The second use case was more interesting in the sense it ran into an error producing this analyses but was able to debug itself. It realized the column names were wrong and it was able to find the correct column!
After fixing the issue and finding that the column is P/36
and not PTS
, it was able to produce the correct plot.
3. Trends in performance over time for specific players.
The third use case is more complicated but Code Interpreter is able to produce this plot without any issues. You can see the code is more complex and not that trivial to write.
# 3. Trends in performance over time for specific players
# For this plot, let's pick a few popular players to analyze
= ['LeBron James', 'Kevin Durant', 'Stephen Curry']
players_to_analyze
# Filter the dataframe for the selected players
= df[df['name_common'].isin(players_to_analyze)]
player_df
# Create a list of statistics to analyze
= ['P/36', 'A/36', 'R/36']
stats_to_analyze
# Initialize the plot
= plt.subplots(len(players_to_analyze), len(stats_to_analyze), figsize=(18, 12))
fig, axes
# Create a subplot for each player and each statistic
for i, player in enumerate(players_to_analyze):
for j, stat in enumerate(stats_to_analyze):
= player_df[player_df['name_common'] == player][['year_id', stat]].dropna()
player_stat 'year_id'], player_stat[stat])
axes[i, j].plot(player_stat[f'{player} - {stat}')
axes[i, j].set_title(
# Add labels and title
0.5, 0.04, 'Year', ha='center', va='center')
fig.text(0.08, 0.5, 'Statistic Value', ha='center', va='center', rotation='vertical')
fig.text('Trends in Performance Over Time for Specific Players (per 36 minutes)')
plt.suptitle(=[0.03, 0.03, 1, 0.95])
plt.tight_layout(rect plt.show()
4. Comparison of teams based on average player statistics.
This plot is not very good as it’s a side-by-side bar chat which makes it difficult to compare teams.
5. The relationship between different performance metrics (like points and assists).
This is also not a very good plot but it’s a valiant attempt.
Explaining itself
I’m surprised that Code Interpreter is able to explain itself. I think it’s able to do this by “looking” and examining the plots because that’s the only thing the interpreter is generating. The intermediate datasets are generated but not printed out. This is not something I’m completely sure how it’s doing it but it’s pretty amazing.
Improving number 4 - team statistics
After a couple rounds of modification, I was able to get Code Interpreter to produce a sorted bar chart of team statistics that made it easier to interpret. I found it surprising that some teams like the New York Knicks have some low assists per 36 minutes. It seems like assists are more variable than points and rebounds (which I didn’t know before!).
NBA Data Analysis (Part 2) - Joining Data
I found another dataset that had every player’s draft year, round and number. I uploaded it to Code Interpreter. The LLM was smart enough to name the dataframe as df_new
so it didn’t overwride df
from the previous analysis.
With this data, I was interested in seeing if Code Interpreter could join the two datasets and do an analysis that I’ve always been interested in, are some teams better at drafting (and developing) players? One way to estimate this is to look at the average cumulative WAR (wins above replacement) for players drafted by each team.
Joining datasets
I was surprised the LLM was able to join the two datasets because it was not a trivial join where the column names from both tables were the same. You can see it had to extract season
from the year_id
and then it uses a compound key to join the two tables.
# Merge the two datasets on player name and season/year
# First, we need to create a 'season' column in the first dataset to match with the second dataset
# Extract the season year from 'year_id'
'season'] = (df['year_id'] - 1).astype(str) + '-' + df['year_id'].astype(str).str[-2:]
df[
# Merge the two datasets
= pd.merge(df_new, df[['name_common', 'season', 'Raptor WAR']],
df_merged =['player_name', 'season'],
left_on=['name_common', 'season'],
right_on='left')
how
# Display the first few rows of the merged dataframe
df_merged.head()
Average Cumulative WAR by draft position
The LLM produces a fascinating plot with the average cumulative WAR by draft position. You can see the #1 draft pick has a very high average cumulative WAR. There is a strange outlier with #57 draft pick. I probably could’ve asked Code Interpreter what was going on but I didn’t.
Cumulative WAR
The LLM is able to produce the following plot which is really interesting! It shows the cumulative WAR for each team for their draft picks. In hindsight I dug a little deeper here because some teams may have higher or more draft picks than others so I should’ve applied some normalization but I didn’t.
NBA Analysis (Part 3) - Embeddings
I had another idea to look at embeddings of players to see if it could generate similar players. So I asked Code Interpreter to generate embeddings for each player based on basketball statisics and physical attributes (weight, height).
Keep in mind these are very simple poor-man’s embeddings (PCA with 2 dimensions). If someone really wanted to generate real embeddings you could use OpenAI or sentence-transformers to generate larger more meaningful embeddings but Code Interpreter does not have network access to install new libraries.
In terms of the most similar players to Lebron James, this passes my basketball smell test in that it returns Kobe Bryant, Scottie Pippen, Vince Carter, Draymond Green and Jimmy Butler. This is a key point in that it’s important that the users have some expertise to be able to interpret and validate the results. I, having data literacy, Python and basketball am able to do this.
NBA Data Analysis (Part 4) - Variance of WAR
A question that came up as I was interacting with Code Interpreter. I became interested in identifying players who had high variance WAR. I imagine players that have more consistent WAR are more valuable and should command higher salaries.
These are players that had dramatic fluctations in WAR year over year. I imagine this is probably due to injuries.
I was interested a plot in players that visualized total WAR vs variance in WAR. This would answer questions like which players had really good careers (total WAR) and were consistent (low variance in WAR). It was able to do this. This plot was not very good because there is too much noise.
I asked the LLM to trim the axes and label the points with names. This is common in data analyses tasks where there is this interactive back and forth of seeing the output and iterating to produce a analysis or visualization that makes sense.
Conclusion
Any data scientist that sees this the first time will probably have a small existential crisis, wondering will my job be automated away too?! I think there will still be a need for data scientists to validate the output of LLMs because what product owner would blindly implement a model that was trained by an LLM agent? But I think this is a good example of how LLMs can be used to augment data scientists’ work. This takes Github Copilot’s AI assistance to another level. I see there being a few levels to AI assistance:
- Level 0: No AI assistance
- Level 1: AI assistance that helps you write code, [Github Copilot]
- Level 2: AI agent that helps you write code, can reason and debug itself, e.g.
gpt-3.5-turbo
+langchain
from a previous post - Level 3: An even more intelligent AI agent that can delight you, [OpenAI’s Code Interpreter]
- Level 4: An AI agent that you would trust to run full data analyses, can query it’s own data, train it’s own models [doesn’t exist yet]
For experienced data analysts and data scientists, I can see Code Interpreter being a valuable tool to save time and reduce cognitive load. I’ll probably use it to explore new datasets as it’s basically pandas-profiling on steroids.
Code from Code Interpreter
Below is all of the code generated by Code Interpreter for this NBA data analysis.
NBA Data Analysis (Part 1)
Code
import pandas as pd
# Load the dataset
= pd.read_csv('/mnt/data/nba-data-historical.csv')
df
# Display the first few rows of the dataframe
df.head()
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Set the style of all graphs to seaborn default
set()
sns.
# 1. The distribution of players' ages
=(8, 6))
plt.figure(figsize'age'], bins=20, kde=True)
sns.histplot(df['Distribution of Players\' Ages')
plt.title('Age')
plt.xlabel('Count')
plt.ylabel(
plt.show()
# 2. The average performance statistics (like points, assists, rebounds) by player position
# NOTE this errors
# Filter out the necessary columns
= df[['pos', 'PTS', 'AST', 'TRB']]
position_stats
# Group by player position and calculate the mean
= position_stats.groupby('pos').mean()
position_stats_avg
# Reset the index
= position_stats_avg.reset_index()
position_stats_avg
# Melt the dataframe for easier plotting
= pd.melt(position_stats_avg, id_vars='pos', value_vars=['PTS', 'AST', 'TRB'])
position_stats_melted
# Create the plot
=(10, 6))
plt.figure(figsize='pos', y='value', hue='variable', data=position_stats_melted)
sns.barplot(x'Average Performance Statistics by Player Position')
plt.title('Position')
plt.xlabel('Average Statistic Value')
plt.ylabel(
plt.show()
# Display the column names
df.columns# RESULT
# Index(['player_id', 'name_common', 'year_id', 'type', 'age', 'team_id', 'pos',
# 'tmRtg', 'franch_id', 'G', 'Min', 'MP%', 'MPG', 'P/36', 'TS%', 'A/36',
# 'R/36', 'SB/36', 'TO/36', 'Raptor O', 'Raptor D', 'Raptor+/-',
# 'Raptor WAR', 'PIE%', 'AWS%', 'USG%', 'AST%', 'TOV%', 'ORB%', 'DRB%',
# 'TRB%', 'STL%', 'BLK%', 'ORtg', '%Pos', 'DRtg', '2P%', '3P%', 'FT%',
# '3PAr', 'FTAr', 'Pace +/-'],
# dtype='object')
# 2. The average performance statistics (like points, assists, rebounds) by player position
# Filter out the necessary columns
= df[['pos', 'P/36', 'A/36', 'R/36']]
position_stats
# Group by player position and calculate the mean
= position_stats.groupby('pos').mean()
position_stats_avg
# Reset the index
= position_stats_avg.reset_index()
position_stats_avg
# Melt the dataframe for easier plotting
= pd.melt(position_stats_avg, id_vars='pos', value_vars=['P/36', 'A/36', 'R/36'])
position_stats_melted
# Create the plot
=(10, 6))
plt.figure(figsize='pos', y='value', hue='variable', data=position_stats_melted)
sns.barplot(x'Average Performance Statistics by Player Position (per 36 minutes)')
plt.title('Position')
plt.xlabel('Average Statistic Value')
plt.ylabel(
plt.show()
# 3. Trends in performance over time for specific players
# For this plot, let's pick a few popular players to analyze
= ['LeBron James', 'Kevin Durant', 'Stephen Curry']
players_to_analyze
# Filter the dataframe for the selected players
= df[df['name_common'].isin(players_to_analyze)]
player_df
# Create a list of statistics to analyze
= ['P/36', 'A/36', 'R/36']
stats_to_analyze
# Initialize the plot
= plt.subplots(len(players_to_analyze), len(stats_to_analyze), figsize=(18, 12))
fig, axes
# Create a subplot for each player and each statistic
for i, player in enumerate(players_to_analyze):
for j, stat in enumerate(stats_to_analyze):
= player_df[player_df['name_common'] == player][['year_id', stat]].dropna()
player_stat 'year_id'], player_stat[stat])
axes[i, j].plot(player_stat[f'{player} - {stat}')
axes[i, j].set_title(
# Add labels and title
0.5, 0.04, 'Year', ha='center', va='center')
fig.text(0.08, 0.5, 'Statistic Value', ha='center', va='center', rotation='vertical')
fig.text('Trends in Performance Over Time for Specific Players (per 36 minutes)')
plt.suptitle(=[0.03, 0.03, 1, 0.95])
plt.tight_layout(rect
plt.show()
# 4. Comparison of teams based on average player statistics
# Filter out the necessary columns
= df[['team_id', 'P/36', 'A/36', 'R/36']]
team_stats
# Group by team and calculate the mean
= team_stats.groupby('team_id').mean()
team_stats_avg
# Reset the index
= team_stats_avg.reset_index()
team_stats_avg
# Melt the dataframe for easier plotting
= pd.melt(team_stats_avg, id_vars='team_id', value_vars=['P/36', 'A/36', 'R/36'])
team_stats_melted
# Create the plot
=(20, 10))
plt.figure(figsize='team_id', y='value', hue='variable', data=team_stats_melted)
sns.barplot(x'Comparison of Teams Based on Average Player Statistics (per 36 minutes)')
plt.title('Team')
plt.xlabel('Average Statistic Value')
plt.ylabel(=90)
plt.xticks(rotation
plt.show()
# 5. The relationship between different performance metrics (like points and assists)
# Filter out the necessary columns
= df[['P/36', 'A/36', 'R/36']]
performance_stats
# Create a pairplot to visualize the relationships
sns.pairplot(performance_stats)'Relationship Between Different Performance Metrics (per 36 minutes)')
plt.title(
plt.show()
# Create separate sorted plots for Points, Assists and Rebounds
# Initialize the figure
= plt.subplots(3, 1, figsize=(20, 18))
fig, axes
# Create a bar plot for each statistic
for i, stat in enumerate(stats):
# Filter the data for the current statistic
= team_stats_avg[['team_id', stat]]
team_stat
# Sort the values
= team_stat.sort_values(by=stat, ascending=False)
team_stat
# Create the bar plot
='team_id', y=stat, data=team_stat, ax=axes[i], color='b')
sns.barplot(xf'Comparison of Teams Based on Average {titles[i]} per 36 minutes')
axes[i].set_title('Team')
axes[i].set_xlabel('Average ' + titles[i])
axes[i].set_ylabel(='x', rotation=90)
axes[i].tick_params(axis
# Adjust the layout
plt.tight_layout() plt.show()
NBA Data Analysis (Part 2)
Code
# Load the new dataset
= pd.read_csv('/mnt/data/all_seasons.csv')
df_new
# # Display the first few rows of the dataframe
# df_new.head()
# Unnamed: 0 player_name team_abbreviation age player_height \
# 0 0 Dennis Rodman CHI 36.0 198.12
# 1 1 Dwayne Schintzius LAC 28.0 215.90
# 2 2 Earl Cureton TOR 39.0 205.74
# 3 3 Ed O'Bannon DAL 24.0 203.20
# 4 4 Ed Pinckney MIA 34.0 205.74
# player_weight college country draft_year draft_round \
# 0 99.790240 Southeastern Oklahoma State USA 1986 2
# 1 117.933920 Florida USA 1990 1
# 2 95.254320 Detroit Mercy USA 1979 3
# 3 100.697424 UCLA USA 1995 1
# 4 108.862080 Villanova USA 1985 1
# ... pts reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct \
# 0 ... 5.7 16.1 3.1 16.1 0.186 0.323 0.100 0.479
# 1 ... 2.3 1.5 0.3 12.3 0.078 0.151 0.175 0.430
# 2 ... 0.8 1.0 0.4 -2.1 0.105 0.102 0.103 0.376
# 3 ... 3.7 2.3 0.6 -8.7 0.060 0.149 0.167 0.399
# 4 ... 2.4 2.4 0.2 -11.2 0.109 0.179 0.127 0.611
# ast_pct season
# 0 0.113 1996-97
# 1 0.048 1996-97
# 2 0.148 1996-97
# 3 0.077 1996-97
# 4 0.040 1996-97
# [5 rows x 22 columns]
# Merge the two datasets on player name and season/year
# First, we need to create a 'season' column in the first dataset to match with the second dataset
# Extract the season year from 'year_id'
'season'] = (df['year_id'] - 1).astype(str) + '-' + df['year_id'].astype(str).str[-2:]
df[
# Merge the two datasets
= pd.merge(df_new, df[['name_common', 'season', 'Raptor WAR']],
df_merged =['player_name', 'season'],
left_on=['name_common', 'season'],
right_on='left')
how
# Display the first few rows of the merged dataframe
df_merged.head()
# Calculate the cumulative WAR for each player by year
'cumulative_war'] = df_merged.groupby('player_name')['Raptor WAR'].cumsum()
df_merged[
# Display the first few rows of the updated dataframe
df_merged.head()
# Calculate the average expected WAR for each draft position for the first and second round picks
# Filter the dataframe for the first and second round picks
= df_merged[df_merged['draft_round'].isin(['1', '2'])]
first_second_round_picks
# Convert the draft_number to numeric
'draft_number'] = pd.to_numeric(first_second_round_picks['draft_number'], errors='coerce')
first_second_round_picks[
# Group by draft_number and calculate the mean cumulative_war
= first_second_round_picks.groupby('draft_number')['cumulative_war'].mean().reset_index()
average_war_by_draft_position
# Display the result
average_war_by_draft_position.head()
# Create a bar plot to visualize the average expected WAR by draft position
=(15, 8))
plt.figure(figsize='draft_number', y='cumulative_war', data=average_war_by_draft_position)
sns.barplot(x'Average Expected WAR by Draft Position for First and Second Round Picks')
plt.title('Draft Position')
plt.xlabel('Average Cumulative WAR')
plt.ylabel(
plt.show()
# Analyze whether certain teams are consistently better at drafting players
# Group by team and calculate the mean cumulative_war
= first_second_round_picks.groupby('team_abbreviation')['cumulative_war'].mean().reset_index()
average_war_by_team
# Sort the result by average cumulative_war
= average_war_by_team.sort_values(by='cumulative_war', ascending=False)
average_war_by_team
# Display the result
average_war_by_team.head()
# Create a bar plot to visualize the average WAR by team
=(15, 8))
plt.figure(figsize='team_abbreviation', y='cumulative_war', data=average_war_by_team, color='b')
sns.barplot(x'Average WAR by Team for First and Second Round Picks')
plt.title('Team')
plt.xlabel('Average Cumulative WAR')
plt.ylabel(=90)
plt.xticks(rotation plt.show()
NBA Data Analysis (Part 3)
Code
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Specify the features to use
= ['pts', 'ast', 'reb', 'usg_pct', 'ts_pct', 'ast_pct', 'player_height', 'player_weight']
features
# Prepare the data
= df_merged[features]
data
# Create a pipeline for preprocessing
# The pipeline includes filling missing values with the mean and standardizing the data
= Pipeline([
pipeline 'imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
(
])
# Preprocess the data
= pipeline.fit_transform(data)
data_preprocessed
# Create a PCA object
= PCA(n_components=2)
pca
# Fit the PCA model to the data and transform the data
= pca.fit_transform(data_preprocessed)
embeddings
# Add the embeddings to the dataframe
'embedding1'] = embeddings[:, 0]
df_merged['embedding2'] = embeddings[:, 1]
df_merged[
# Display the first few rows of the updated dataframe
df_merged.head()
from sklearn.neighbors import NearestNeighbors
# Fit a NearestNeighbors model to the embeddings
= NearestNeighbors(n_neighbors=6) # 5 nearest neighbors + the player himself
nn 'embedding1', 'embedding2']])
nn.fit(df_merged[[
# Find the embeddings for Michael Jordan
= df_merged[df_merged['player_name'] == 'Michael Jordan'][['embedding1', 'embedding2']]
mj_embeddings
# Use the NearestNeighbors model to find the players with the most similar embeddings
= nn.kneighbors(mj_embeddings)
distances, indices
# Get the names of the most similar players
= df_merged.iloc[indices[0]]['player_name'].tolist()
similar_players
# Remove 'Michael Jordan' from the list
'Michael Jordan')
similar_players.remove(
# Display the similar players
similar_players
# Update NearestNeighbors model to find the top 10 nearest neighbors
= NearestNeighbors(n_neighbors=11) # 10 nearest neighbors + the player himself
nn 'embedding1', 'embedding2']])
nn.fit(df_merged[[
# Find the embeddings for LeBron James
= df_merged[df_merged['player_name'] == 'LeBron James'][['embedding1', 'embedding2']]
lbj_embeddings
# Use the NearestNeighbors model to find the players with the most similar embeddings
= nn.kneighbors(lbj_embeddings)
distances, indices
# Get the names of the most similar players
= df_merged.iloc[indices[0]]['player_name'].tolist()
similar_players
# Remove 'LeBron James' from the list
'LeBron James')
similar_players.remove(
# Display the similar players
similar_players
# Plot the embeddings for LeBron James and the top 10 similar players
# Create a dataframe for the similar players
= df_merged[df_merged['player_name'].isin(similar_players)]
similar_players_df
# Plot the embeddings for all players in light grey
=(10, 8))
plt.figure(figsize'embedding1'], df_merged['embedding2'], color='lightgrey', label=None)
plt.scatter(df_merged[
# Plot the embeddings for the similar players in blue
'embedding1'], similar_players_df['embedding2'], color='blue', label=None)
plt.scatter(similar_players_df[
# Add labels for the similar players
for _, row in similar_players_df.iterrows():
'embedding1'], row['embedding2'], row['player_name'], fontsize=8)
plt.text(row[
# Highlight LeBron James in the plot
= df_merged[df_merged['player_name'] == 'LeBron James'].iloc[0]
lbj_row 'embedding1'], lbj_row['embedding2'], color='red')
plt.scatter(lbj_row['embedding1'], lbj_row['embedding2'], 'LeBron James', fontsize=8, weight='bold')
plt.text(lbj_row[
# Set the title and labels
'Embeddings for LeBron James and Top 10 Similar Players')
plt.title('Embedding 1')
plt.xlabel('Embedding 2')
plt.ylabel( plt.show()
NBA Data Analysis (Part 4)
Code
# Calculate the year-over-year change in Raptor WAR for each player
'Raptor WAR Change'] = df_merged.groupby('player_name')['Raptor WAR'].diff()
df_merged[
# Get the absolute value of the change
'Raptor WAR Change Absolute'] = df_merged['Raptor WAR Change'].abs()
df_merged[
# Find the players with the greatest absolute year-over-year change in Raptor WAR
= df_merged.nlargest(10, 'Raptor WAR Change Absolute')
dramatic_change_players
# Display the result
'player_name', 'season', 'Raptor WAR', 'Raptor WAR Change']]
dramatic_change_players[[
# Calculate the total and variance of Raptor WAR for each player
= df_merged.groupby('player_name')['Raptor WAR'].agg(['sum', 'var']).reset_index()
war_stats
# Rename the columns
= ['player_name', 'Total WAR', 'Variance of WAR']
war_stats.columns
# Display the players with the highest and lowest variance of WAR
= war_stats.loc[war_stats['Variance of WAR'].idxmax()]
highest_variance_player = war_stats.loc[war_stats['Variance of WAR'].idxmin()]
lowest_variance_player
highest_variance_player, lowest_variance_player
# Create a scatter plot of Total WAR vs. Variance of WAR
=(10, 8))
plt.figure(figsize='Total WAR', y='Variance of WAR', data=war_stats)
sns.scatterplot(x'Total WAR vs. Variance of WAR')
plt.title('Total WAR')
plt.xlabel('Variance of WAR')
plt.ylabel(
# Highlight the players with the highest and lowest variance of WAR
'Total WAR'], highest_variance_player['Variance of WAR'], color='red')
plt.scatter(highest_variance_player['Total WAR'], highest_variance_player['Variance of WAR'], 'Chris Paul', fontsize=8, ha='right')
plt.text(highest_variance_player['Total WAR'], lowest_variance_player['Variance of WAR'], color='red')
plt.scatter(lowest_variance_player['Total WAR'], lowest_variance_player['Variance of WAR'], 'Ike Anigbogu', fontsize=8, ha='right')
plt.text(lowest_variance_player[
plt.show()
# Filter players with over 75 total WAR
= war_stats[war_stats['Total WAR'] > 75]
war_stats_filtered
# Create a scatter plot of Total WAR vs. Variance of WAR for these players
=(15, 10))
plt.figure(figsize='Total WAR', y='Variance of WAR', data=war_stats_filtered)
sns.scatterplot(x
# Label all points on the plot
for _, row in war_stats_filtered.iterrows():
'Total WAR'], row['Variance of WAR'], row['player_name'], fontsize=8)
plt.text(row[
'Total WAR vs. Variance of WAR for Players with Over 75 Total WAR')
plt.title('Total WAR')
plt.xlabel('Variance of WAR')
plt.ylabel(
plt.show()