<<<<<<< HEAD

Visualization for Others

Shujonixi

    =======

    Visualization for Others

    Shujonixi

      >>>>>>> 8f1c31212fc181e0a9325cf927655d8849fce133
    • Joshua Allen
    • Shuihui Tang
    • Xingxing Zhang
    • Nihali Jain ### Data Vis IS590

    The data set is named Lahman’s Baseball Database and it can be find in seanlahman.com. The URL is http://seanlahman.com/baseball-archive/statistics/

    The database has 28 csv files and most of them have more than 5000 records and the biggest one has more than a hundred thousand rocords.

Limited Use License

This database is copyright 1996-2018 by Sean Lahman.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/

Salary by position across the years

This plot shows the range of salaries at each position at different year. We can use the slider to choose which year you want to looking at. Also, when you move to each bar, it can show the exact salary. We are curious about which position can make the most money in a baseball game. So we combined "salary" and "fielding" these two datasets together in one dataframe called "player_position." Because we already explored these two datasets and found out that they both have "playerID" and "yearID," then we merged them and created a new dataset with only four colums.

Initially, we showed the data as a line graph for each position across the years. This plot shows all the positions as a bar graph for a given year. Sliding through the years gives a more visual contrast to how the pay rates have changed.

Salaries by age for different positions

This graph demonstrates how salaries change by age at each position, cumulating data from 1985 until 2015. Ages 30-40 years old are players golden ages at every position. That makes sense, given that those tend to be the ages when players enter free agency, even though the data also shows that player performance tends to decline after age 30. Most players' salaries drop after 40 years old, which is the age many players retire. The pitcher spike in the late 40s is mainly due to a single outlier: Jamie Moyer, who finally retired at age 50 (!!) was an effective starting pitcher well into his 40s, when he helped the Phillies to several post-season appearances and a World Series title.

For this plot, we merged three tables: People, Fielding and Salary using the playerID field to link them. To clean this dataset, we only choose time period when they playing during 1985 to 2015.

Salaries by age over the years

Here we show a graphic that reflects how players at different positions have been paid over the years. You can see similar trends over the years to what is shown above, where players tend to be paid more in what is considered their primer production years.

Payrolls for the top teams

For this part, we show a relationship between team success in the regular season and their total payroll. As the graphic shows, we find that, in general, the teams that had the most success were the teams that spent the most money. The dataset "Teams" has the variable of Rank, which roughly corresponds to how teams ended the regular season (though this is a little confused by post-season structure, as you can see by the lines that stop or disappear), so we looked the relationship between teams' rank and salary. We merged Teams.csv and Salaries.csv into one dataframe and grouped by the yearID and Rank.Then we renamed each column because the columns' name were numbers initially and could not be used in a dataframe slice. We defined a dictionary to set every rank's color. Finally, we defined a function to plot a picture about the rank vs year and salary.

One interesting aspect of these data are that you can see the changes the league has gone through reflected in this plot. In 1993, divisions were reorganized from 2 divisions to 3 with the addition of two new expansion teams, and again in 1998 with the addition of 2 more new teams, and the leagues were reorganized in 2012 due to interleague play and the addition of the wildcard playoff game.

Heatmap of pitching attributes

Here, we present a heatmap of different attributes of pitching correlated against each other to see if any patterns emerge. As this graphic shows, performance in terms of wins and losses has a very low correlation with salary. While things like batters faced and hits against correlate much more strongly with salary, which you expect since starting pitchers make more money and they face more batters and thus will give up more hits.

Other unsuprising trends is a correlation between giving up hits and losing, and a pretty good correlation between number of wins and number of shutouts pitched. While there wasn't anything overly surprising, it gives a good way to quickly evaluate the raw data.

On-base-plus-slugging comparison

In baseball, on-base-plus-slugging is a stat that reflects players' hitting performance. As it is two percentages addedd together, it is essentially meaningless, but serves as a way to compare players across the league. On-base percentage is a measure of the hits, walks, and hit-by-pitches a player received, divided by total plate appearances. Slugging percentage is basically a weighted batting average, where doubles are weighted 2, triples 3 and homeruns 4. The total bases are divided by the number of official at-bats (defined as a plate appearance that doesn't end in a walk or sacrifice). OPS is a good predictor of future performance as OBP and power tend to be consistent across players' careeres. There are further refinements, such as OPS+, which normalizes OPS by the league average. Due to some outliers for players with few plate appearances, we set a reasonable lower limit to the number of at-bats. This should have the effect of focusing the data only on regular hitters (American League pitchers, for example, only bat in interleague play, and relief pitchers hardly bat at all).

Specifically, we are examining here the performance of players who played in the post-season by comparing their post-season and regular season OPS, to look for interesting features. Do players step it up in the postseason, or crack under the pressure?

For a first pass at the post-season OPS v regular season OPS data, we made a hexbin plot with interactions that allow you to select an OPS range and date range. The OPS range is set to show all values at the lowest end, beyond that it shows ranges that coincide with Bill James categories of hitter abilities:

Category Classification OPS Range
A Great .9000 or higher
B Very Good .8333 to .8999
C Above Average .7667 to .8333
D Average .7000 to .7666
E Below Average .6334 to .6999
F Poor .5667 to .6333
G Very Poor .5666 and lower

For the years, the lowest setting shows all years, and beyond that it shows the decade before the shown date, so 1990 shows the data from the '90s. Histograms show the distribution of the OPS values in the regular season and post-season. Only regular season for players who appeared in the post-season is shown. While there are some interesting features of this data, the main takeaway was that the performance in both realms followed a normal distribution overall. Some good players played bad, some bad players played well, but overall they tended to perform at the same level. So Billy Beane's claim that his advanced analytics techniques don't work in the postseason was untrue. Turned out his teams were just unlucky outliers.

Gaussian KDE + interactivity

For this plot, we look at the same data but in a different way. The base of this plot is a scatter plot of all the data, which is then overlaid with a Gaussian kernel density estimate, shown as color. This is an approximation of the probability density at each point, and gives a visual representation of how concentrated the points are in the scatter. The interesting takeaways for me is that the density was never in the red. So the data do follow a normal distribution, but with pretty wide standard deviation (which is also fairly evident in the histograms above)