Shopify Technical Challenge Submission

Symone Hohensee
5 min readSep 20, 2021

--

This is my technical challenge for my application to become a Data Science Intern with Shopify. I hope you enjoy my analysis and submission!

Question 1

The problem: The Average Order Value of sneakers sold over a 30 day window seems skewed.

  1. What could be wrong with the calculations?

First, let’s ask ourselves what we know about the dataset we are given: There are 100 sneaker stores, each store only sells one kind of sneaker, and the Average Order Value (AOV) is $3,145.13.

Second, let’s explore the data.

Quick search for missing values

I saved the given CSV file to a variable to be turned into a workable dataset and quickly checked the entire set for any missing data. Since no data is missing, I will assume all data moving forward to be accurate and complete. Now that we know all of the data is there — let’s take a look at what it all means. For example, I want to see the mean of the order amounts (AOV), as well as the standard deviation, and the maximum order amount in the case of large/bulk sales.

Description of the values of the order-amount column

As we can see above, we have a very high standard deviation in comparison to the mean (AOV), and the rest of the description gives us a good guess as to why. 50% of the sneaker orders are $284 or less, and 75% of the order totals are $390 or less, but the maximum amount an order sold for was $704,000. That’s a lot! Just from this little bit of information, I can see my hypothesis may be correct that one or more of the stores has regular bulk sales of their sneakers.

In order to simplify my search for the specific shops causing this large gap in values, I will search the dataset for any shops that have orders over the amount of 3500 (this is the Average Order Value amount (3145.128) with a little wiggle room), and as you can see with the results below — shops 42 and 78 seem to be my main culprits.

Visualization of the data table representing shops 42 and 78 with abnormally high order amounts

After giving the distribution of sales some consideration:

It appears to me that Shop 42 periodically sells bulk orders of sneakers (as I hypothesized) with some orders containing 2000 sneakers with an order total of $704,000.00 whilst Shop 78 simply sells very expensive sneakers that clock in at around $25,000 a pair. If it’s hard to tell simply from viewing this graph, that’s because these specific values are seen by hovering over the dots on the graphs that can be found on my Github blog post.

Now maybe you wonder — well what percentage of the total orders/sales do these very large and expensive orders make up? The answer is most of it. Those orders from Shop 42 where 2000 total sneakers were in a given order made up 76.1% of the total order amount in our data. For this reason I believe that Shop 42, more so than Shop 78, singlehandedly skewed the distribution of order_amount, therefore rendering the Average Order Value inaccurate.

2. What Metric would you report for this dataset?

This is subjective based on preference and how willing you are to clean and explore data to try and reduce the standard deviation of the values, but as the situation is, Average Order Value is simply not accurate due to the giant variation in order amounts between Shop 42 and all of the other shops.

Given my lack of professional experience in calculating metrics specifically regarding consumers and orders, I sought a little help from a few sources. I’ve gathered that one could potentially find more value in Median Order Value when considered in conjunction with Average Order Value and the Standard Deviation, but no one metric alone will completely or efficiently encapsulate what exactly is going on with the orders for all of the shops combined.

3. What is its value?

The Median Order Value will help give additional information about order_amounts that we didn’t know before, allowing sellers or shop owners to make better educated decisions for their shops. For instance, the Median Order Value shows us that roughly half of the order amounts were over the median, and roughly half were below. In conjunction, all three metrics shown give us a clearer understanding of the skew in the data distribution, as well as just how much the orders vary.

Question 2

For this part of the technical challenge, I was to use SQL to find the answers to 3 questions.

  1. How many orders were shipped by Speedy Express in total?

54 were shipped by Speedy Express.

Image showing how I queried to find how many orders were shipped by Speedy Express

2. What is the last name of the employee with the most orders?

Their last name is Peacock

Image showing how I queried to find the last name of the employee with the most orders

3. What product was ordered the most in Germany?

Simply basing this on the sheer number of the product ordered in Germany, 100 Steeleye Stout were ordered in Germany.

Image showing how I queried to find the item ordered the most in Germany.

I am grateful for the opportunity offered by Shopify and hope to bring good news along with my technical submission!

--

--

Symone Hohensee
Symone Hohensee

No responses yet