Exploring eBay Car Sales Data

In this project, I am going to work with dataset of used cars from eBay Kleinanzeigen (classified section of German eBay webpage).

You can access the original dataset from here

The original dataset differs from the dataset that I am working with in this analysis due to two reasons:

50,000 data points were selected from the full dataset to ensure high performance.
Dataset was dirtied. (The original dataset is cleaner. Dataset was dirtied for a training of data cleaning skills.)

Dictionary:

dateCrawled - When this ad was first crawled.
name - Name of the car.
seller - Whether the seller is private or dealer.
offerType - The type of listing.
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - Type of the vehicle.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
odometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - Type of fuel that car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

My goal of the project is to clean the data and analyze the car listings in this dataset.

Introduction - importing libraries and the dataset

import pandas as pd
import numpy as np

# autos = pd.read_csv('autos.csv') 
# We can't read the file with default UTF-8 encoding

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
# Another encoding type also works if you want to try: 
# autos = pd.read_csv('autos.csv', encoding = 'Windows-1252')

autos # The data is very messy and raw. I should start cleaning it. 
autos.info() # There are also missing values in few columns. 
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

We can see several issues in the dataset:

1. There are a lot of null values in some columns.
1. Column names are also not clear, as columns are written in camelcase.

Cleaning Column Names

autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

I will change the columns from camelcase to snakecase
I will change the wordings of the columns so they will be more accurate and understandable.

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']
autos.head() # Columns are cleaner now.

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Initial Exploration and Cleaning

Now let's do some basic data exploration:

Text columns where all or almost all values are the same should be removed, as they are not useful for the analysis.
Numeric data stored in text format which can be cleaned and converted.

autos.describe(include = 'all')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_photos	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 11:37:04	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Columns seller, offer_type and num_photos look odd. These columns should ve investigated further.

autos['seller'].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

autos['offer_type'].value_counts()

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

autos['num_photos'].value_counts()

0    50000
Name: num_photos, dtype: int64

1. seller and offer_type are columns where nearly all of the values are the same.
1. num_photos has 0 for every column.

We are going to drop these 3 columns as they are not useful for the data analysis.

autos = autos.drop(['seller', 'num_photos', 'offer_type'], axis = 1)

autos.columns # Columns are sucessfully dropped.

Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

Let's look which columns have numeric data written in the text format. I am going to clean data of such columns and turn it into numeric type.

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   price               50000 non-null  object
 3   ab_test             50000 non-null  object
 4   vehicle_type        44905 non-null  object
 5   registration_year   50000 non-null  int64 
 6   gearbox             47320 non-null  object
 7   power_ps            50000 non-null  int64 
 8   model               47242 non-null  object
 9   odometer            50000 non-null  object
 10  registration_month  50000 non-null  int64 
 11  fuel_type           45518 non-null  object
 12  brand               50000 non-null  object
 13  unrepaired_damage   40171 non-null  object
 14  ad_created          50000 non-null  object
 15  postal_code         50000 non-null  int64 
 16  last_seen           50000 non-null  object
dtypes: int64(4), object(13)
memory usage: 6.5+ MB

It can be observed that columns price and odometer columns have numeric values written in text form.

Let's start from price column.

autos['price'].value_counts() # We should remove $ and , signs 
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)

C:\Users\Beibarys Nyussupov\AppData\Local\Temp\ipykernel_46968\222734509.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)

autos['price'].head() # Completed!

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

Now, let's change values of odometer column.

autos['odometer'].value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km','').astype(int)

autos['odometer'].head() # Completed!

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int32

We cleaned values of odometer column and transformed these values into int-64 form. However, let's make a change to the name of the odometer column to make accurate description of these values.

autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)

autos['odometer_km'].head()

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32

Translating German words to English words

autos.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date_crawled	name	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

autos['vehicle_type'].unique()

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

First column that has entries with German language is - vehicle_type. It is an important column and we need to change entries written in German to English language.

Kleinwagen - Compact/Subcompact

Kombi - Wagon

andere - other

We also have NA values which should be grouped into other type of cars too.

before = autos['vehicle_type'].value_counts()

before

limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

autos['vehicle_type'] = autos['vehicle_type'].replace('kleinwagen', 'compact/subcompact')
autos['vehicle_type'] = autos['vehicle_type'].replace('kombi', 'wagon')
autos['vehicle_type'] = autos['vehicle_type'].replace('andere', 'other')
autos['vehicle_type'] = autos['vehicle_type'].fillna('undefined')

after = autos['vehicle_type'].value_counts().sort_values(ascending = False)

after

limousine             12859
compact/subcompact    10822
wagon                  9127
undefined              5095
bus                    4093
cabrio                 3061
coupe                  2537
suv                    1986
other                   420
Name: vehicle_type, dtype: int64

Everything in vehichle_type column was cleaned. Now every vehicle type is clear and understandable in English language.

Limousine, compact/subcompact, wagon, bus and cabrio are vehicle types with the most number of car listings on the website.

The next two columns that we need to clean are gearbox and unrepaired_damage

automatik - automatic

manuell - manual

nein - no

ja - yes

autos['gearbox'].unique()

array(['manuell', 'automatik', nan], dtype=object)

autos['gearbox'].value_counts()

manuell      36993
automatik    10327
Name: gearbox, dtype: int64

autos['gearbox'] = autos['gearbox'].replace('manuell', 'manual')
autos['gearbox'] = autos['gearbox'].replace('automatik', 'automatic')
autos['gearbox'] = autos['gearbox'].fillna('undefined')

autos['gearbox'].value_counts()

manual       36993
automatic    10327
undefined     2680
Name: gearbox, dtype: int64

autos['unrepaired_damage'].unique()

array(['nein', nan, 'ja'], dtype=object)

autos['unrepaired_damage'] = autos['unrepaired_damage'].replace('nein', 'no')
autos['unrepaired_damage'] = autos['unrepaired_damage'].replace('ja', 'yes')
autos['unrepaired_damage'] = autos['unrepaired_damage'].fillna('undefined')

autos['unrepaired_damage'].value_counts()

no           35232
undefined     9829
yes           4939
Name: unrepaired_damage, dtype: int64

The rest two columns were cleaned and entries were transformed into understandable English language. Most of the cars on the market are on manual engine and do not have unrepaired damage.

Exploring the Odometer and Price Columns

autos['odometer_km'].value_counts().sort_index(ascending = True)

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Mileage values of cars are rounded, which means that sellers of vehicles had to choose from pre-existing options on the market to define the number of kilometres driven by each car. Furthermore, there are no listed cars with 0 mileage on the market. However it can be a drawback of prexisting options on eBay, as if there are cars that have a mileage below 5000 kilometres - sellers still had to choose 5000 due to limited options.

autos['price'].unique().shape # Number of unique values in price column

(2357,)

autos['price'].value_counts().head(20).sort_index(ascending = True)

0       1421
300      384
500      781
600      531
650      419
700      395
750      433
800      498
850      410
900      420
950      379
999      434
1000     639
1200     639
1500     734
2000     460
2200     382
2500     643
3500     498
4500     394
Name: price, dtype: int64

Again, values are very rounded. However, given that there are 2357 unique values in the price column, it means that sellers are just tended to round numbers and they are not limited to pre-existing options on eBay webpage.

autos['price'].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

autos['price'].value_counts().sort_index(ascending = True).head(50)

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price, dtype: int64

autos['price'].value_counts().sort_index(ascending = False).head(20)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

(len(autos[autos['price'] > 350000]) / len(autos)) * 100 # Percentage of cars with a cell price higher than 350 000 USD

0.027999999999999997

There are plenty of car listings with a prices below 30 dollars and 1421 listings with 0 prices. There are also 14 listings with prices above 350 000 dollars.

Assuming that eBay is an auction website, there can be cars that where the opening equals to 1 dollar. However, I am going to remove all car listings with prices above 350 000 $ as prices increase regularly and then jummp up to very high and less realistic numbers.

autos = autos[autos['price'].between(1, 350000)]

autos['price'].describe() # We removed the outliers in prices

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Exploring the date columns

Columns with date information:

date_crawled
last_seen
ad_created
registration_month
registration_year

There is a mix of information that was created by crawler and information that was produced by website itself. Let's explore these columns a bit more.

autos[['date_crawled', 'ad_created', 'last_seen']][0:5]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

(autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)) * 100

2016-03-05    2.532688
2016-03-06    1.404304
2016-03-07    3.601359
2016-03-08    3.329558
2016-03-09    3.308967
2016-03-10    3.218367
2016-03-11    3.257490
2016-03-12    3.691959
2016-03-13    1.566972
2016-03-14    3.654896
2016-03-15    3.428395
2016-03-16    2.960980
2016-03-17    3.162772
2016-03-18    1.291053
2016-03-19    3.477813
2016-03-20    3.788737
2016-03-21    3.737259
2016-03-22    3.298672
2016-03-23    3.222485
2016-03-24    2.934212
2016-03-25    3.160712
2016-03-26    3.220426
2016-03-27    3.109235
2016-03-28    3.486050
2016-03-29    3.409863
2016-03-30    3.368681
2016-03-31    3.183363
2016-04-01    3.368681
2016-04-02    3.547823
2016-04-03    3.860805
2016-04-04    3.648718
2016-04-05    1.309585
2016-04-06    0.317101
2016-04-07    0.140019
Name: date_crawled, dtype: float64

According to the data, the website was crawled daily in a period of one month from April to March 2016.

(autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True))* 100

2016-03-05     0.107073
2016-03-06     0.432410
2016-03-07     0.539483
2016-03-08     0.741275
2016-03-09     0.959539
2016-03-10     1.066612
2016-03-11     1.237517
2016-03-12     2.378256
2016-03-13     0.889529
2016-03-14     1.260167
2016-03-15     1.587563
2016-03-16     1.645218
2016-03-17     2.808607
2016-03-18     0.735097
2016-03-19     1.583445
2016-03-20     2.065273
2016-03-21     2.063214
2016-03-22     2.137342
2016-03-23     1.853186
2016-03-24     1.976732
2016-03-25     1.921137
2016-03-26     1.680222
2016-03-27     1.564913
2016-03-28     2.085864
2016-03-29     2.234119
2016-03-30     2.477093
2016-03-31     2.378256
2016-04-01     2.279419
2016-04-02     2.491506
2016-04-03     2.520334
2016-04-04     2.448265
2016-04-05    12.476063
2016-04-06    22.180583
2016-04-07    13.194688
Name: last_seen, dtype: float64

last_seen column has a data recorded by crawler which shows the last date on which any car listing was seen. Looking on these values can help us with determining on what day the listing was removed, possibly because the car was sold.

Last three days of March (05, 06, 07) have disproportinate last_seen percentages, which are around 10 times bigger from other days. It is very unlikely that these 3 days experienced a rapid hike in sales and it is more likely that these values are related to crawling period ending and don't show spike in in car sales.

(autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)) * 100

2015-06-11    0.002059
2015-08-10    0.002059
2015-09-09    0.002059
2015-11-10    0.002059
2015-12-05    0.002059
                ...   
2016-04-03    3.885514
2016-04-04    3.685782
2016-04-05    1.181921
2016-04-06    0.325337
2016-04-07    0.125605
Name: ad_created, Length: 76, dtype: float64

print('Number of dates crawled:', autos['date_crawled'].str[:10].unique().shape)
print('Number of dates ad_created:', autos['ad_created'].str[:10].unique().shape)
(autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)) * 100

Number of dates crawled: (34,)
Number of dates ad_created: (76,)





2015-06-11    0.002059
2015-08-10    0.002059
2015-09-09    0.002059
2015-11-10    0.002059
2015-12-05    0.002059
                ...   
2016-04-03    3.885514
2016-04-04    3.685782
2016-04-05    1.181921
2016-04-06    0.325337
2016-04-07    0.125605
Name: ad_created, Length: 76, dtype: float64

There is a variety of dates when car listings were created. Some of them are within 1-2 months of the listing date, but some other dates are old and few of them even fall within around 8-9 months.

We can also turn dates in date_crawled, ad_created and last_seen columns to uniform numeric data.

autos['date_crawled'] = autos['date_crawled'].str.replace('-','').str.split(' ').str[0].astype(int)
autos['ad_created'] = autos['ad_created'].str.replace('-','').str.split(' ').str[0].astype(int)
autos['last_seen'] = autos['last_seen'].str.replace('-','').str.split(' ').str[0].astype(int)

autos['date_crawled'].head()

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int32

autos['ad_created'].head()

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: ad_created, dtype: int32

autos['last_seen'].head()

0    20160406
1    20160406
2    20160406
3    20160315
4    20160401
Name: last_seen, dtype: int32

autos.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	date_crawled	name	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	20160326	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manual	158	andere	150000	3	lpg	peugeot	no	20160326	79588	20160406
1	20160404	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatic	286	7er	150000	6	benzin	bmw	no	20160404	71034	20160406
2	20160326	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manual	102	golf	70000	7	benzin	volkswagen	no	20160326	35394	20160406
3	20160312	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	compact/subcompact	2007	automatic	71	fortwo	70000	6	benzin	smart	no	20160312	33729	20160315
4	20160401	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	wagon	2003	manual	0	focus	150000	7	benzin	ford	no	20160401	39218	20160401

autos['registration_year'].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

It is more likely that the date when the car was first registered will indicate the age of the car. Some strange values can be seen in above statistics. Minimum registration year is 1000 when cars even weren't used at that time and maximum is 9999 - a year which we even didn't reach yet.

Dealing with Incorrect Registration Year Data

Because a car can't be registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. However determining the earliest valid year is more complex. It is more likely to be somewhere in 1900s.

Let's count the number of listings with cars that fall outside the 1900-2016 interval and see if it's safe to tremove those rows entirely, or if we need more custom logic.

((~autos["registration_year"].between(1900,2016)).sum() / autos.shape[0]) * 100

3.8793369710697

As the inaccurate dates are less than 4% of the data, we can safely remove such rows.

autos = autos[autos["registration_year"].between(1900,2016)]
(autos["registration_year"].value_counts(normalize = True, dropna = False).sort_index(ascending = False).head(25))*100

2016    2.613483
2015    0.839742
2014    1.420278
2013    1.720186
2012    2.806281
2011    3.476789
2010    3.403954
2009    4.466485
2008    4.744971
2007    4.877788
2006    5.719672
2005    6.289497
2004    5.790364
2003    5.781796
2002    5.325507
2001    5.646837
2000    6.760781
1999    6.205951
1998    5.062017
1997    4.179431
1996    2.941239
1995    2.628478
1994    1.347443
1993    0.910435
1992    0.792614
Name: registration_year, dtype: float64

(autos['registration_year'].value_counts(normalize = True).head(25))*100

2000    6.760781
2005    6.289497
1999    6.205951
2004    5.790364
2003    5.781796
2006    5.719672
2001    5.646837
2002    5.325507
1998    5.062017
2007    4.877788
2008    4.744971
2009    4.466485
1997    4.179431
2011    3.476789
2010    3.403954
1996    2.941239
2012    2.806281
1995    2.628478
2016    2.613483
2013    1.720186
2014    1.420278
1994    1.347443
1993    0.910435
2015    0.839742
1992    0.792614
Name: registration_year, dtype: float64

As inaccurate rows were removed, it can be seen that most of the cars were first registered in past 22 years.

Exploring Price by Brand

(autos['brand'].value_counts(normalize = True)) * 100

volkswagen        21.126368
bmw               11.004477
opel              10.758124
mercedes_benz      9.646323
audi               8.656627
ford               6.989996
renault            4.714980
peugeot            2.984083
fiat               2.564212
seat               1.827296
skoda              1.640925
nissan             1.527388
mazda              1.518819
smart              1.415994
citroen            1.400998
toyota             1.270324
hyundai            1.002549
sonstige_autos     0.981127
volvo              0.914719
mini               0.876159
mitsubishi         0.822604
honda              0.784045
kia                0.706926
alfa_romeo         0.664082
porsche            0.612669
suzuki             0.593389
chevrolet          0.569825
chrysler           0.351321
dacia              0.263490
daihatsu           0.250637
jeep               0.227073
subaru             0.214220
land_rover         0.209936
saab               0.164949
jaguar             0.156381
daewoo             0.149954
trabant            0.139243
rover              0.132816
lancia             0.107110
lada               0.057839
Name: brand, dtype: float64

(autos['brand'].value_counts(normalize = True).head(7))*100

volkswagen       21.126368
bmw              11.004477
opel             10.758124
mercedes_benz     9.646323
audi              8.656627
ford              6.989996
renault           4.714980
Name: brand, dtype: float64

There are a wide range of car brands that do not have significant percentages of listings, that's why I have filtered down the data to the manufacturers that have more than 4% of total listings.

German transport manufacturers represent top 5 brands with around 60% of overall listings. Volkswagen is the most popular car brand in the market having approximately 21% of overall listings.

brand_counts = (autos['brand'].value_counts(normalize = True)) * 100
top_brands = brand_counts[brand_counts > 4].index
print(top_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford',
       'renault'],
      dtype='object')

mean_prices = {}
for brand in top_brands: 
    names = autos[autos['brand'] == brand]
    price = names['price'].mean()
    mean_prices[brand] = int(price)
    
mean_prices

{'volkswagen': 5402,
 'bmw': 8332,
 'opel': 2975,
 'mercedes_benz': 8628,
 'audi': 9336,
 'ford': 3749,
 'renault': 2474}

According to the data:

    - Audi, Bmw and Mercedes Benz are more expensive car brands. 
    - Ford, Openal and Renault are less expensive car manufacturers. 
    - Folkswagen - It is a car brand that is "something in between", 
    which also explains the popularity of given brand.

Exploring mileage

bmp_series = pd.Series(mean_prices).sort_values(ascending = False)
price_df = pd.DataFrame(bmp_series, columns = ['mean_price($)'])
price_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mean_price($)
audi	9336
mercedes_benz	8628
bmw	8332
volkswagen	5402
ford	3749
opel	2975
renault	2474

mean_mileage = {}
for brand in top_brands: 
    names = autos[autos['brand'] == brand]
    mileages = names['odometer_km'].mean()
    mean_mileage[brand] = int(mileages)

bmm_series = pd.Series(mean_mileage).sort_values(ascending  = False)


mileage_df = pd.DataFrame(bmm_series, columns = ['mean_mileage(km)'])
mileage_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mean_mileage(km)
bmw	132572
mercedes_benz	130788
opel	129310
audi	129157
volkswagen	128707
renault	128071
ford	124266

#Common dataframe 
mileage_df['mean_price($)'] = price_df

mileage_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	mean_mileage(km)	mean_price($)
bmw	132572	8332
mercedes_benz	130788	8628
opel	129310	2975
audi	129157	9336
volkswagen	128707	5402
renault	128071	2474
ford	124266	3749

The range of kilometers passed doesn't really vary by car brand as prices do. Although, there is a weak trend that more expensive cars have more mileage than less expensive cars.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitattributes		.gitattributes
Basics.ipynb		Basics.ipynb
README.md		README.md
autos.csv		autos.csv
eBay Car Sales - Jupyter Notebook.pdf		eBay Car Sales - Jupyter Notebook.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring eBay Car Sales Data

Introduction - importing libraries and the dataset

Cleaning Column Names

Initial Exploration and Cleaning

Translating German words to English words

Exploring the Odometer and Price Columns

Exploring the date columns

Dealing with Incorrect Registration Year Data

Exploring Price by Brand

Exploring mileage

About

Releases

Packages

Languages

NBeibarys/eBay_CarSales

Folders and files

Latest commit

History

Repository files navigation

Exploring eBay Car Sales Data

Introduction - importing libraries and the dataset

Cleaning Column Names

Initial Exploration and Cleaning

Translating German words to English words

Exploring the Odometer and Price Columns

Exploring the date columns

Dealing with Incorrect Registration Year Data

Exploring Price by Brand

Exploring mileage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages