2025

Analysis of US avocado sales and prices (2015–2020)

R

Exploratory analysis and modelling of avocado sales data (organic and conventional) in multiple US markets using R. The project includes outlier detection, correlation analysis, calculation of price-sales elasticities and price forecasting using time series models.

Analysis of US avocado sales and prices (2015–2020)

Introduction

In this project, I carry out a data analysis focused on avocado prices in the US (2015–2020), as a case study applying Business Intelligence techniques. The objective is to generate relevant insights from real data, balancing a commercial approach with technical rigour. The methodology includes an exploratory data analysis (EDA), calculation of price-sales elasticities, and the implementation of a price prediction model.

Dataset

For the analysis, I used the Avocado Prices 2020 dataset from Kaggle, which contains weekly information on avocado sales in the USA. The sample includes 33,045 complete records, with categorical variables such as date, region, type of avocado, and year, as well as numerical variables such as average price, sales volume, product codes, and packaging formats.

The average price was $1.38, with a minimum of $0.44 and a maximum of $3.25; the third quartile is at $1.62, indicating the presence of upper outliers. Regarding the weekly sales volume, the average was 968,400 units, with high dispersion (Q3 ≈ 505,828, maximum ≈ 63,716,144), confirming the existence of significant outliers. These metrics served as the basis for a detailed analysis of price patterns and market behaviour.

Exploratory Analysis

During the exploratory phase, I used R functions such as summary(), boxplot(), and the dplyr package to examine key variables. Boxplots revealed a large number of outliers, especially for organic avocados.

When calculating correlations, I observed a weak but inverse relationship between price and sales volume. For organic avocados, the covariance was −3.027 and the correlation −0.047; for conventional avocados, these values were −122.979 and −0.092, respectively. This suggests that high prices have a stronger negative effect on conventional avocado sales, while for organic avocados other factors such as perceived value or consumer segmentation may play a role.

Additionally, I compared average prices across regions, highlighting Albany ($1.684) and Boston ($1.743), providing relevant information for formulating local pricing strategies.

Price-Sales Elasticity

To measure sales sensitivity to price, I developed linear regression models using R’s lm() function. I fitted logarithmic models separately, which indicated a price-sales elasticity of −1.32 for conventional avocados (a 10% price increase reduces sales by 13.2%) and −0.767 for organic avocados (a 7.67% reduction). These results suggest that consumers of organic products are less sensitive to price increases, prioritising factors such as quality and sustainability.

Time Series Prediction

I analysed the time series of average prices for organic avocados in Albany using an R ts object. After decomposing the series with the decompose() function, I performed a 12-week forecast employing ARIMA and exponential smoothing models. The projections, obtained with forecast, suggest stable prices without abrupt fluctuations, providing valuable information for commercial planning.

Conclusions

The analysis yields three main conclusions:

  1. The high market variability requires continuous monitoring to anticipate changes.
  2. The lower elasticity in organic avocados allows for higher margins, while conventional avocados require more rigorous optimisation strategies.
  3. The forecasted stability in Albany suggests a good opportunity for purchase and inventory management.

This project has allowed me to demonstrate how data analysis generates robust business recommendations oriented towards decision-making.

Tools Used

I performed the analysis using R within the RStudio environment, employing the following key packages and functions for data manipulation, visualisation, and modelling:

  • readr (tidyverse): for reading CSV data.
  • dplyr (tidyverse): for data manipulation and filtering.
  • ggplot2:** for creating graphical visualisations.
  • Basic statistical functions: for calculating descriptive statistics and correlations.
  • Linear regression model (lm()): for price-sales elasticity analysis.
  • stats / decompose():** for time series decomposition.
  • forecast:** for generating price predictions.