数据分析案例-笔记本电脑价格数据可视化分析2-USB迷|专注于互联网分享

🤵‍♂️ 个人主页：@艾派森的个人主页

✍🏻作者简介：Python学习者
🐋 希望大家多多支持，我们一起进步！😄
如果文章对你有帮助的话，
欢迎评论 💬点赞👍🏻 收藏 📂加关注+

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

5.数据预处理

5.数据可视化

源代码

1.项目背景

随着科技的飞速发展，笔记本电脑已经成为现代社会不可或缺的重要工具，其广泛应用于教育、商务、娱乐等多个领域。然而，随着市场的不断扩张和竞争的日益激烈，笔记本电脑的价格也变得日益复杂多变，这给消费者和商家都带来了不小的挑战。

在这样的背景下，数据可视化分析显得尤为重要。数据可视化，作为一种强大的数据处理和展示工具，能够将海量的、复杂的数据转化为直观、易懂的图形和图像，帮助人们快速洞察数据的内在规律和趋势。对于笔记本电脑价格而言，数据可视化分析不仅能够帮助消费者更好地理解市场价格动态，做出更明智的购买决策，还能够为商家提供有力的市场分析和预测工具，指导其制定更有效的市场策略。

具体来说，通过数据可视化分析，我们可以将笔记本电脑的价格数据以时间序列图、柱状图、散点图等多种形式展示出来。这些图形能够清晰地反映出笔记本电脑价格在不同时间段、不同品牌、不同配置之间的变化趋势和差异。例如，时间序列图可以帮助我们观察价格随时间的波动情况，柱状图可以比较不同品牌或配置之间的价格差异，散点图则可以揭示价格与其他因素（如性能、评价等）之间的潜在关系。

此外，数据可视化分析还可以结合其他分析方法（如回归分析、聚类分析等）对笔记本电脑价格进行更深入的研究。例如，通过回归分析，我们可以探索价格与性能、品牌知名度等因素之间的定量关系；通过聚类分析，我们可以将不同品牌、配置的笔记本电脑按照价格水平进行分类，进一步揭示市场结构和竞争格局。

总之，笔记本电脑价格的数据可视化分析是一个复杂而重要的研究领域。它不仅能够为消费者提供直观、易懂的市场信息，还能够为商家提供有力的市场分析和预测工具。随着数据科学和可视化技术的不断发展，我们有理由相信这一领域将在未来发挥更加重要的作用。

2.数据集介绍

本实验数据集来源于Kaggle，原始数据集共有1303条数据，13个变量，各变量含义如下：

0 laptop_ID-数字-产品ID

1 Company-字符串-笔记本电脑制造商

2 Product-字符串-品牌和型号

3 TypeName-字符串-类型（笔记本电脑、超极本、游戏机等）

4 Inches-数字-屏幕尺寸

5 ScreenResolution-字符串-屏幕分辨率

6 Cpu-字符串-中央处理器 (CPU)

7 Ram-字符串-笔记本电脑 RAM

8 Memory-字符串-硬盘/SSD 内存

9 GPU-字符串-图形处理单元 (GPU)

10 OpSys-字符串-操作系统

11 Weight-字符串-笔记本电脑重量

12 Price_euros-数字-价格（欧元）

3.技术工具

Python版本:3.9

代码编辑器：jupyter notebook

4.导入数据

导入第三方库并加载数据集

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("laptop_price.csv",encoding='latin-1') 
df.head()

查看数据大小

查看数据基本信息

查看数值型变量的描述性统计

查看非数值型变量的描述性统计

5.数据预处理

查看缺失值和重复值情况

发现并不存在

变量处理

# 将列名改为小写，以便于书写。
df = df.rename(columns=str.lower)
# 处理各大变量
df['resolution'] = df['screenresolution'].str.extract(r'(\d+x\d+)')
df['touch_screen'] = df['screenresolution'].str.extract(r'(Touchscreen)',re.IGNORECASE).fillna('NO')
df['touch_screen']=df['touch_screen'].replace("Touchscreen","Yes")
df['IPS Panel_screen'] = df['screenresolution'].str.extract(r'(IPS Panel)',re.IGNORECASE).fillna('No')
df['IPS Panel_screen']=df['IPS Panel_screen'].replace("IPS Panel","Yes")
df = df.drop('screenresolution', axis=1)
df['ram'] = df['ram'].str.replace('GB', '')
df['weight'] = df['weight'].str.replace('kg', '')
df['ram']=df['ram'].astype(int)
df['weight']=df['weight'].astype(float)
df['cpu_frq(GHz)']=df['cpu'].str.extract(r"(\d+(?:\.\d+)\s*GHz)")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].str.replace("GHz","")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].astype(float)
df['cpu_brand']=df['cpu'].str.extract(r"^([\w\-]+)")
df['memory']=df['memory'].str.replace('2TB','2000GB')
df['memory']=df['memory'].str.replace('1.0TB','1TB', regex=True)
df['memory']=df['memory'].str.replace('1TB','1000GB')
df[['memory1_type','memory2_type']]=df['memory'].str.split('+', expand=True)
df['memory1_capacity']=df['memory'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory1_type']=df['memory1_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory2_capacity']=df['memory2_type'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory2_type']=df['memory2_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory1_capacity']=df['memory1_capacity'].astype(float)
df['memory2_capacity']=df['memory2_capacity'].astype(float)
df["memory2_capacity"]= df["memory2_capacity"].replace({'NaN': np.nan})
df["memory2_capacity"]= df["memory2_capacity"].fillna(0)
df=df.drop(['memory'], axis=1)
df["gpu_brand"]=df['gpu'].str.extract(r"^([\w\-]+)")
df['opsys']=df['opsys'].replace({'Windows 10' : 'Windows', 'Windows 10 S' : 'Windows',
                                  'Windows 7' : 'Windows', 'Mac OS X' : 'macOS' }) 
df.head()

5.数据可视化

# 可视化设置
sns.set(rc={"axes.facecolor":"#FAF3FC","figure.facecolor":"#FAF3FC",'figure.figsize':(14,5)})
pallet = ["#998289","#9981A0","#F5B7B1","#F9E79F"]

# 制造及型号
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="company",ax=axes[0],palette=pallet,data=df,order=df["company"].value_counts().index)
sns.boxplot(x ='price_euros',y ="company" ,palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Company",fontsize=15)
axes[1].set_title("Laptop Price by Company",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="typename",ax=axes[0],palette=pallet,data=df,order=df["typename"].value_counts().index)
sns.boxplot(x ='price_euros',y ="typename" ,ax=axes[1],palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Type",fontsize=15)
axes[1].set_title("Laptop Price by Type ",fontsize=15)
plt.show()

fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="price_euros",ax=axes[0],color="grey",data=df)
sns.barplot(x="company",y='price_euros',estimator=np.mean,ax=axes[1],palette=pallet,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Price Distribution",fontsize=15)
axes[1].set_title("Average Price for Each Company ",fontsize=15)
plt.show()

戴尔、联想、惠普、华硕和宏碁笔记本电脑在我们的数据集中是最常见的。

三星、雷蛇、Mediacom、微软、小米、Vero、Chuwi、谷歌、富士通和LG华为的数据集中只有不到10台笔记本电脑。雷蛇是最昂贵的笔记本电脑(但我们只有7台雷蛇笔记本电脑)。

在数据集中最常见的公司中，MSI笔记本电脑的平均价格是最贵的

戴尔(Dell)、联想(Lenovo)、惠普(HP)和华硕(Asus)的笔记本电脑平均价格在1000欧元左右。

一般来说，Vero是数据集中平均最便宜的笔记本电脑，而Acer是最常用的笔记本电脑中最便宜的。价格范围在200到6000之间，但大多数笔记本电脑都在4000以下。

笔记本电脑是最常见的类型。

我们的数据库中有6种笔记本电脑，最流行的是笔记本电脑，最不流行的是上网本。

笔记本和上网本的平均价格最低。

df=df[df["price_euros"]<4000]
# CPU、GPU、操作系统
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="gpu_brand",ax=axes[0],palette=pallet,data=df,order=df["gpu_brand"].value_counts().index)
sns.boxplot(x ='gpu_brand',y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by GPU Brand",fontsize=15)
axes[1].set_title(" Laptop Price by CPU Brand",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="cpu_brand",ax=axes[0],palette=pallet,data=df,order=df["cpu_brand"].value_counts().index)
sns.boxplot(x ="cpu_brand",y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by  CPU Brand",fontsize=15)
axes[1].set_title("Laptop Price by CPU Brand",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)
sns.countplot(x="opsys",ax=axes[0],palette=pallet,data=df,order=df["opsys"].value_counts().index)
sns.boxplot(x ="opsys",y ='price_euros' ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by Operating System",fontsize=15)
axes[1].set_title("Laptop Price by Operating System",fontsize=15)
plt.show()

英特尔是最常见的CPU品牌，也是最贵的。AMD排名第二，差距很大。

Intel (Core i5 7200U, Core i7 7700HQ, Core i7 7500U)是最常见的cpu类型。

英特尔是出现频率最高的GPU品牌，其次是英伟达和AMD。

英特尔(HD Graphics 620和HD Graphics 520)是最常见的GPU类型。

Nvidia是最昂贵的GPU, AMD是最便宜的。使用AMD cpu的笔记本电脑的数量等于使用AMD gpu的笔记本电脑的数量，因为使用AMD cpu的笔记本电脑也有AMD gpu。

我们有5个操作系统Windows, Mac, Chrome, Linux和Android。我们也有笔记本电脑没有操作系统(OS)安装Mac操作系统的笔记本电脑平均价格最高，安装Linux操作系统的笔记本电脑平均价格最低。

有一款笔记本电脑带有ARM GPU品牌和三星CPU品牌，我会放弃它。

df=df[df["gpu_brand"]!="ARM"]
# 重量，屏幕尺寸和屏幕分辨率
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="weight",ax=axes[0],color="grey",data=df)
sns.countplot(x="inches",ax=axes[1],palette=pallet,order=df["inches"].value_counts().index,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Weight Distribution",fontsize=15)
axes[1].set_title("Screen Size Distribution",fontsize=15)
plt.show()

这款笔记本电脑一半以上的屏幕都是15.6英寸。

[15.6,17.3,14,13.3,12.5,11.6]是大多数笔记本电脑的屏幕尺寸。

[2.20,2.20,2.00,2.4,2.5]是最常见的重量，大多数笔记本电脑的重量分布在1.3 - 2.5之间。

screen_size = [15.6,17.3,14,13.3,12.5,11.6]
df=df[df["inches"].isin(screen_size)]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="resolution",ax=axes[0],palette=pallet,order=df["resolution"].value_counts().index,data=df)
sns.boxplot(x ='price_euros',y ="resolution",ax=axes[1],palette=pallet,data = df)
axes[0].set_title("Number of Laptops by Screen Resolution",fontsize=15)
axes[1].set_title("Laptop Price by Screen Resolution",fontsize=15)
plt.show()

我们有15种不同的分辨率。1920x1080是数据集中最常见的屏幕分辨率，平均价格为1224.799欧元

1366x768分辨率最差，平均价格最低。

# RAM和硬盘
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="ram",ax=axes[0],palette=pallet,order=df["ram"].value_counts().index,data=df)
sns.boxplot(x ="ram",y ="price_euros" ,palette=pallet,data = df)
axes[0].set_title("Number of Laptops by RAM (GB)",fontsize=15)
axes[1].set_title("Laptop Price by RAM (GB)",fontsize=15)
plt.show()

内存范围:8gb ~ 64gb。最常见的内存是8gb

内存与价格之间存在明显的关系，随着内存的增加，价格也会随之增加

fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_type",ax=axes[0],order=df["memory1_type"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_type",ax=axes[1],order=df["memory2_type"].value_counts().index,palette=pallet,data=df)
axes[0].set_title("Number of Laptops by the Type of the 1st Hard Drive",fontsize=15)
axes[1].set_title("Number of Laptops by the Type of the 2nd Hard Drive",fontsize=15)
plt.show()

df=df[df["ram"].isin([24,64]) == False]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_capacity",ax=axes[0],order=df["memory1_capacity"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_capacity",ax=axes[1],order=df["memory2_capacity"].value_counts().index,palette=pallet,data=df[df["memory2_capacity"]!=0])
axes[0].set_title("Number of Laptops by th 1st Hard Drive Capacity",fontsize=15)
axes[1].set_title("Number of Laptops by th 2nd Hard Drive Capacity",fontsize=15)
plt.show()

df["total_memory"]=df["memory1_capacity"]+df["memory2_capacity"]
sns.boxplot(x ="total_memory",y ="price_euros" ,palette=pallet,data = df)
plt.xticks(rotation=70)
plt.title("Laptop Price by Hard Drive Capacity",fontsize=15)
plt.show()

我们有180台笔记本电脑和2个硬盘。

第一块硬盘最常见的类型是256 GB的SSD。

第二个硬盘驱动器最常见的类型是1000 GB的HDD。

第二块硬盘通常具有高容量。

硬盘容量影响价格，但这两个变量之间的关系似乎不是很强

# 相关矩阵
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
corr_matrix = df[numerical_columns].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='crest', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix', fontsize=10)
plt.xticks(rotation=45, fontsize=8)
plt.yticks(fontsize=8)
plt.show()

源代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("laptop_price.csv",encoding='latin-1') 
df.head()
df.shape
df.info()
df.describe()
df.describe(include='O')
df.isnull().sum()
df.duplicated().sum()
# 将列名改为小写，以便于书写。
df = df.rename(columns=str.lower)
# 处理各大变量
df['resolution'] = df['screenresolution'].str.extract(r'(\d+x\d+)')
df['touch_screen'] = df['screenresolution'].str.extract(r'(Touchscreen)',re.IGNORECASE).fillna('NO')
df['touch_screen']=df['touch_screen'].replace("Touchscreen","Yes")
df['IPS Panel_screen'] = df['screenresolution'].str.extract(r'(IPS Panel)',re.IGNORECASE).fillna('No')
df['IPS Panel_screen']=df['IPS Panel_screen'].replace("IPS Panel","Yes")
df = df.drop('screenresolution', axis=1)
df['ram'] = df['ram'].str.replace('GB', '')
df['weight'] = df['weight'].str.replace('kg', '')
df['ram']=df['ram'].astype(int)
df['weight']=df['weight'].astype(float)
df['cpu_frq(GHz)']=df['cpu'].str.extract(r"(\d+(?:\.\d+)\s*GHz)")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].str.replace("GHz","")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].astype(float)
df['cpu_brand']=df['cpu'].str.extract(r"^([\w\-]+)")
df['memory']=df['memory'].str.replace('2TB','2000GB')
df['memory']=df['memory'].str.replace('1.0TB','1TB', regex=True)
df['memory']=df['memory'].str.replace('1TB','1000GB')
df[['memory1_type','memory2_type']]=df['memory'].str.split('+', expand=True)
df['memory1_capacity']=df['memory'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory1_type']=df['memory1_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory2_capacity']=df['memory2_type'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory2_type']=df['memory2_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory1_capacity']=df['memory1_capacity'].astype(float)
df['memory2_capacity']=df['memory2_capacity'].astype(float)
df["memory2_capacity"]= df["memory2_capacity"].replace({'NaN': np.nan})
df["memory2_capacity"]= df["memory2_capacity"].fillna(0)
df=df.drop(['memory'], axis=1)
df["gpu_brand"]=df['gpu'].str.extract(r"^([\w\-]+)")
df['opsys']=df['opsys'].replace({'Windows 10' : 'Windows', 'Windows 10 S' : 'Windows',
                                  'Windows 7' : 'Windows', 'Mac OS X' : 'macOS' }) 
df.head()
# 可视化设置
sns.set(rc={"axes.facecolor":"#FAF3FC","figure.facecolor":"#FAF3FC",'figure.figsize':(14,5)})
pallet = ["#998289","#9981A0","#F5B7B1","#F9E79F"]
# 制造及型号
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="company",ax=axes[0],palette=pallet,data=df,order=df["company"].value_counts().index)
sns.boxplot(x ='price_euros',y ="company" ,palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Company",fontsize=15)
axes[1].set_title("Laptop Price by Company",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="typename",ax=axes[0],palette=pallet,data=df,order=df["typename"].value_counts().index)
sns.boxplot(x ='price_euros',y ="typename" ,ax=axes[1],palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Type",fontsize=15)
axes[1].set_title("Laptop Price by Type ",fontsize=15)
plt.show()
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="price_euros",ax=axes[0],color="grey",data=df)
sns.barplot(x="company",y='price_euros',estimator=np.mean,ax=axes[1],palette=pallet,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Price Distribution",fontsize=15)
axes[1].set_title("Average Price for Each Company ",fontsize=15)
plt.show()
戴尔、联想、惠普、华硕和宏碁笔记本电脑在我们的数据集中是最常见的。
三星、雷蛇、Mediacom、微软、小米、Vero、Chuwi、谷歌、富士通和LG华为的数据集中只有不到10台笔记本电脑。雷蛇是最昂贵的笔记本电脑(但我们只有7台雷蛇笔记本电脑)。
在数据集中最常见的公司中，MSI笔记本电脑的平均价格是最贵的
戴尔(Dell)、联想(Lenovo)、惠普(HP)和华硕(Asus)的笔记本电脑平均价格在1000欧元左右。
一般来说，Vero是数据集中平均最便宜的笔记本电脑，而Acer是最常用的笔记本电脑中最便宜的。价格范围在200到6000之间，但大多数笔记本电脑都在4000以下。
笔记本电脑是最常见的类型。
我们的数据库中有6种笔记本电脑，最流行的是笔记本电脑，最不流行的是上网本。
笔记本和上网本的平均价格最低。
df[df["price_euros"]>4000]
我们只有4台价格在4000以上的笔记本电脑，两台来自雷蛇，一台来自惠普，还有一台来自联想。
所以高于4000的价格是异常值，我会降低。
df=df[df["price_euros"]<4000]
# CPU、GPU、操作系统
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="gpu_brand",ax=axes[0],palette=pallet,data=df,order=df["gpu_brand"].value_counts().index)
sns.boxplot(x ='gpu_brand',y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by GPU Brand",fontsize=15)
axes[1].set_title(" Laptop Price by CPU Brand",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="cpu_brand",ax=axes[0],palette=pallet,data=df,order=df["cpu_brand"].value_counts().index)
sns.boxplot(x ="cpu_brand",y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by  CPU Brand",fontsize=15)
axes[1].set_title("Laptop Price by CPU Brand",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)
sns.countplot(x="opsys",ax=axes[0],palette=pallet,data=df,order=df["opsys"].value_counts().index)
sns.boxplot(x ="opsys",y ='price_euros' ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by Operating System",fontsize=15)
axes[1].set_title("Laptop Price by Operating System",fontsize=15)
plt.show()
英特尔是最常见的CPU品牌，也是最贵的。AMD排名第二，差距很大。
Intel (Core i5 7200U, Core i7 7700HQ, Core i7 7500U)是最常见的cpu类型。
英特尔是出现频率最高的GPU品牌，其次是英伟达和AMD。
英特尔(HD Graphics 620和HD Graphics 520)是最常见的GPU类型。
Nvidia是最昂贵的GPU, AMD是最便宜的。使用AMD cpu的笔记本电脑的数量等于使用AMD gpu的笔记本电脑的数量，因为使用AMD cpu的笔记本电脑也有AMD gpu。
我们有5个操作系统Windows, Mac, Chrome, Linux和Android。我们也有笔记本电脑没有操作系统(OS)安装Mac操作系统的笔记本电脑平均价格最高，安装Linux操作系统的笔记本电脑平均价格最低。
有一款笔记本电脑带有ARM GPU品牌和三星CPU品牌，我会放弃它。
df=df[df["gpu_brand"]!="ARM"]
# 重量，屏幕尺寸和屏幕分辨率
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="weight",ax=axes[0],color="grey",data=df)
sns.countplot(x="inches",ax=axes[1],palette=pallet,order=df["inches"].value_counts().index,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Weight Distribution",fontsize=15)
axes[1].set_title("Screen Size Distribution",fontsize=15)
plt.show()
这款笔记本电脑一半以上的屏幕都是15.6英寸。
[15.6,17.3,14,13.3,12.5,11.6]是大多数笔记本电脑的屏幕尺寸。
[2.20,2.20,2.00,2.4,2.5]是最常见的重量，大多数笔记本电脑的重量分布在1.3 - 2.5之间。
screen_size = [15.6,17.3,14,13.3,12.5,11.6]
df=df[df["inches"].isin(screen_size)]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="resolution",ax=axes[0],palette=pallet,order=df["resolution"].value_counts().index,data=df)
sns.boxplot(x ='price_euros',y ="resolution",ax=axes[1],palette=pallet,data = df)
axes[0].set_title("Number of Laptops by Screen Resolution",fontsize=15)
axes[1].set_title("Laptop Price by Screen Resolution",fontsize=15)
plt.show()
我们有15种不同的分辨率。1920x1080是数据集中最常见的屏幕分辨率，平均价格为1224.799欧元
1366x768分辨率最差，平均价格最低。
# RAM和硬盘
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="ram",ax=axes[0],palette=pallet,order=df["ram"].value_counts().index,data=df)
sns.boxplot(x ="ram",y ="price_euros" ,palette=pallet,data = df)
axes[0].set_title("Number of Laptops by RAM (GB)",fontsize=15)
axes[1].set_title("Laptop Price by RAM (GB)",fontsize=15)
plt.show()
内存范围:8gb ~ 64gb。最常见的内存是8gb
内存与价格之间存在明显的关系，随着内存的增加，价格也会随之增加
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_type",ax=axes[0],order=df["memory1_type"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_type",ax=axes[1],order=df["memory2_type"].value_counts().index,palette=pallet,data=df)
axes[0].set_title("Number of Laptops by the Type of the 1st Hard Drive",fontsize=15)
axes[1].set_title("Number of Laptops by the Type of the 2nd Hard Drive",fontsize=15)
plt.show()
我们只有3台24内存的笔记本电脑和1台64内存的笔记本电脑，我将放弃它们。
df=df[df["ram"].isin([24,64]) == False]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_capacity",ax=axes[0],order=df["memory1_capacity"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_capacity",ax=axes[1],order=df["memory2_capacity"].value_counts().index,palette=pallet,data=df[df["memory2_capacity"]!=0])
axes[0].set_title("Number of Laptops by th 1st Hard Drive Capacity",fontsize=15)
axes[1].set_title("Number of Laptops by th 2nd Hard Drive Capacity",fontsize=15)
plt.show()
df["total_memory"]=df["memory1_capacity"]+df["memory2_capacity"]
sns.boxplot(x ="total_memory",y ="price_euros" ,palette=pallet,data = df)
plt.xticks(rotation=70)
plt.title("Laptop Price by Hard Drive Capacity",fontsize=15)
plt.show()
我们有180台笔记本电脑和2个硬盘。
第一块硬盘最常见的类型是256 GB的SSD。
第二个硬盘驱动器最常见的类型是1000 GB的HDD。
第二块硬盘通常具有高容量。
硬盘容量影响价格，但这两个变量之间的关系似乎不是很强
# 相关矩阵
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
corr_matrix = df[numerical_columns].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='crest', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix', fontsize=10)
plt.xticks(rotation=45, fontsize=8)
plt.yticks(fontsize=8)
plt.show()

资料获取，更多粉丝福利，关注下方公众号获取

🤵‍♂️ 个人主页：@艾派森的个人主页

✍🏻作者简介：Python学习者
🐋 希望大家多多支持，我们一起进步！😄
如果文章对你有帮助的话，
欢迎评论 💬点赞👍🏻 收藏 📂加关注+

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

5.数据预处理

5.数据可视化

源代码

1.项目背景

2.数据集介绍

本实验数据集来源于Kaggle，原始数据集共有1303条数据，13个变量，各变量含义如下：

0 laptop_ID-数字-产品ID

1 Company-字符串-笔记本电脑制造商

2 Product-字符串-品牌和型号

3 TypeName-字符串-类型（笔记本电脑、超极本、游戏机等）

4 Inches-数字-屏幕尺寸

5 ScreenResolution-字符串-屏幕分辨率

6 Cpu-字符串-中央处理器 (CPU)

7 Ram-字符串-笔记本电脑 RAM

8 Memory-字符串-硬盘/SSD 内存

9 GPU-字符串-图形处理单元 (GPU)

10 OpSys-字符串-操作系统

11 Weight-字符串-笔记本电脑重量

12 Price_euros-数字-价格（欧元）

3.技术工具

Python版本:3.9

代码编辑器：jupyter notebook

4.导入数据

导入第三方库并加载数据集

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("laptop_price.csv",encoding='latin-1') 
df.head()

查看数据大小

查看数据基本信息

查看数值型变量的描述性统计

查看非数值型变量的描述性统计

5.数据预处理

查看缺失值和重复值情况

发现并不存在

变量处理

# 将列名改为小写，以便于书写。
df = df.rename(columns=str.lower)
# 处理各大变量
df['resolution'] = df['screenresolution'].str.extract(r'(\d+x\d+)')
df['touch_screen'] = df['screenresolution'].str.extract(r'(Touchscreen)',re.IGNORECASE).fillna('NO')
df['touch_screen']=df['touch_screen'].replace("Touchscreen","Yes")
df['IPS Panel_screen'] = df['screenresolution'].str.extract(r'(IPS Panel)',re.IGNORECASE).fillna('No')
df['IPS Panel_screen']=df['IPS Panel_screen'].replace("IPS Panel","Yes")
df = df.drop('screenresolution', axis=1)
df['ram'] = df['ram'].str.replace('GB', '')
df['weight'] = df['weight'].str.replace('kg', '')
df['ram']=df['ram'].astype(int)
df['weight']=df['weight'].astype(float)
df['cpu_frq(GHz)']=df['cpu'].str.extract(r"(\d+(?:\.\d+)\s*GHz)")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].str.replace("GHz","")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].astype(float)
df['cpu_brand']=df['cpu'].str.extract(r"^([\w\-]+)")
df['memory']=df['memory'].str.replace('2TB','2000GB')
df['memory']=df['memory'].str.replace('1.0TB','1TB', regex=True)
df['memory']=df['memory'].str.replace('1TB','1000GB')
df[['memory1_type','memory2_type']]=df['memory'].str.split('+', expand=True)
df['memory1_capacity']=df['memory'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory1_type']=df['memory1_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory2_capacity']=df['memory2_type'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory2_type']=df['memory2_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory1_capacity']=df['memory1_capacity'].astype(float)
df['memory2_capacity']=df['memory2_capacity'].astype(float)
df["memory2_capacity"]= df["memory2_capacity"].replace({'NaN': np.nan})
df["memory2_capacity"]= df["memory2_capacity"].fillna(0)
df=df.drop(['memory'], axis=1)
df["gpu_brand"]=df['gpu'].str.extract(r"^([\w\-]+)")
df['opsys']=df['opsys'].replace({'Windows 10' : 'Windows', 'Windows 10 S' : 'Windows',
                                  'Windows 7' : 'Windows', 'Mac OS X' : 'macOS' }) 
df.head()

5.数据可视化

# 可视化设置
sns.set(rc={"axes.facecolor":"#FAF3FC","figure.facecolor":"#FAF3FC",'figure.figsize':(14,5)})
pallet = ["#998289","#9981A0","#F5B7B1","#F9E79F"]

# 制造及型号
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="company",ax=axes[0],palette=pallet,data=df,order=df["company"].value_counts().index)
sns.boxplot(x ='price_euros',y ="company" ,palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Company",fontsize=15)
axes[1].set_title("Laptop Price by Company",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="typename",ax=axes[0],palette=pallet,data=df,order=df["typename"].value_counts().index)
sns.boxplot(x ='price_euros',y ="typename" ,ax=axes[1],palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Type",fontsize=15)
axes[1].set_title("Laptop Price by Type ",fontsize=15)
plt.show()

fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="price_euros",ax=axes[0],color="grey",data=df)
sns.barplot(x="company",y='price_euros',estimator=np.mean,ax=axes[1],palette=pallet,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Price Distribution",fontsize=15)
axes[1].set_title("Average Price for Each Company ",fontsize=15)
plt.show()

戴尔、联想、惠普、华硕和宏碁笔记本电脑在我们的数据集中是最常见的。

在数据集中最常见的公司中，MSI笔记本电脑的平均价格是最贵的

戴尔(Dell)、联想(Lenovo)、惠普(HP)和华硕(Asus)的笔记本电脑平均价格在1000欧元左右。

笔记本电脑是最常见的类型。

我们的数据库中有6种笔记本电脑，最流行的是笔记本电脑，最不流行的是上网本。

笔记本和上网本的平均价格最低。

df=df[df["price_euros"]<4000]
# CPU、GPU、操作系统
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="gpu_brand",ax=axes[0],palette=pallet,data=df,order=df["gpu_brand"].value_counts().index)
sns.boxplot(x ='gpu_brand',y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by GPU Brand",fontsize=15)
axes[1].set_title(" Laptop Price by CPU Brand",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="cpu_brand",ax=axes[0],palette=pallet,data=df,order=df["cpu_brand"].value_counts().index)
sns.boxplot(x ="cpu_brand",y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by  CPU Brand",fontsize=15)
axes[1].set_title("Laptop Price by CPU Brand",fontsize=15)
plt.show()

fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)
sns.countplot(x="opsys",ax=axes[0],palette=pallet,data=df,order=df["opsys"].value_counts().index)
sns.boxplot(x ="opsys",y ='price_euros' ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by Operating System",fontsize=15)
axes[1].set_title("Laptop Price by Operating System",fontsize=15)
plt.show()

英特尔是最常见的CPU品牌，也是最贵的。AMD排名第二，差距很大。

Intel (Core i5 7200U, Core i7 7700HQ, Core i7 7500U)是最常见的cpu类型。

英特尔是出现频率最高的GPU品牌，其次是英伟达和AMD。

英特尔(HD Graphics 620和HD Graphics 520)是最常见的GPU类型。

Nvidia是最昂贵的GPU, AMD是最便宜的。使用AMD cpu的笔记本电脑的数量等于使用AMD gpu的笔记本电脑的数量，因为使用AMD cpu的笔记本电脑也有AMD gpu。

有一款笔记本电脑带有ARM GPU品牌和三星CPU品牌，我会放弃它。

df=df[df["gpu_brand"]!="ARM"]
# 重量，屏幕尺寸和屏幕分辨率
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="weight",ax=axes[0],color="grey",data=df)
sns.countplot(x="inches",ax=axes[1],palette=pallet,order=df["inches"].value_counts().index,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Weight Distribution",fontsize=15)
axes[1].set_title("Screen Size Distribution",fontsize=15)
plt.show()

这款笔记本电脑一半以上的屏幕都是15.6英寸。

[15.6,17.3,14,13.3,12.5,11.6]是大多数笔记本电脑的屏幕尺寸。

[2.20,2.20,2.00,2.4,2.5]是最常见的重量，大多数笔记本电脑的重量分布在1.3 - 2.5之间。

screen_size = [15.6,17.3,14,13.3,12.5,11.6]
df=df[df["inches"].isin(screen_size)]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="resolution",ax=axes[0],palette=pallet,order=df["resolution"].value_counts().index,data=df)
sns.boxplot(x ='price_euros',y ="resolution",ax=axes[1],palette=pallet,data = df)
axes[0].set_title("Number of Laptops by Screen Resolution",fontsize=15)
axes[1].set_title("Laptop Price by Screen Resolution",fontsize=15)
plt.show()

我们有15种不同的分辨率。1920x1080是数据集中最常见的屏幕分辨率，平均价格为1224.799欧元

1366x768分辨率最差，平均价格最低。

# RAM和硬盘
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="ram",ax=axes[0],palette=pallet,order=df["ram"].value_counts().index,data=df)
sns.boxplot(x ="ram",y ="price_euros" ,palette=pallet,data = df)
axes[0].set_title("Number of Laptops by RAM (GB)",fontsize=15)
axes[1].set_title("Laptop Price by RAM (GB)",fontsize=15)
plt.show()

内存范围:8gb ~ 64gb。最常见的内存是8gb

内存与价格之间存在明显的关系，随着内存的增加，价格也会随之增加

fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_type",ax=axes[0],order=df["memory1_type"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_type",ax=axes[1],order=df["memory2_type"].value_counts().index,palette=pallet,data=df)
axes[0].set_title("Number of Laptops by the Type of the 1st Hard Drive",fontsize=15)
axes[1].set_title("Number of Laptops by the Type of the 2nd Hard Drive",fontsize=15)
plt.show()

df=df[df["ram"].isin([24,64]) == False]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_capacity",ax=axes[0],order=df["memory1_capacity"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_capacity",ax=axes[1],order=df["memory2_capacity"].value_counts().index,palette=pallet,data=df[df["memory2_capacity"]!=0])
axes[0].set_title("Number of Laptops by th 1st Hard Drive Capacity",fontsize=15)
axes[1].set_title("Number of Laptops by th 2nd Hard Drive Capacity",fontsize=15)
plt.show()

df["total_memory"]=df["memory1_capacity"]+df["memory2_capacity"]
sns.boxplot(x ="total_memory",y ="price_euros" ,palette=pallet,data = df)
plt.xticks(rotation=70)
plt.title("Laptop Price by Hard Drive Capacity",fontsize=15)
plt.show()

我们有180台笔记本电脑和2个硬盘。

第一块硬盘最常见的类型是256 GB的SSD。

第二个硬盘驱动器最常见的类型是1000 GB的HDD。

第二块硬盘通常具有高容量。

硬盘容量影响价格，但这两个变量之间的关系似乎不是很强

# 相关矩阵
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
corr_matrix = df[numerical_columns].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='crest', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix', fontsize=10)
plt.xticks(rotation=45, fontsize=8)
plt.yticks(fontsize=8)
plt.show()

源代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("laptop_price.csv",encoding='latin-1') 
df.head()
df.shape
df.info()
df.describe()
df.describe(include='O')
df.isnull().sum()
df.duplicated().sum()
# 将列名改为小写，以便于书写。
df = df.rename(columns=str.lower)
# 处理各大变量
df['resolution'] = df['screenresolution'].str.extract(r'(\d+x\d+)')
df['touch_screen'] = df['screenresolution'].str.extract(r'(Touchscreen)',re.IGNORECASE).fillna('NO')
df['touch_screen']=df['touch_screen'].replace("Touchscreen","Yes")
df['IPS Panel_screen'] = df['screenresolution'].str.extract(r'(IPS Panel)',re.IGNORECASE).fillna('No')
df['IPS Panel_screen']=df['IPS Panel_screen'].replace("IPS Panel","Yes")
df = df.drop('screenresolution', axis=1)
df['ram'] = df['ram'].str.replace('GB', '')
df['weight'] = df['weight'].str.replace('kg', '')
df['ram']=df['ram'].astype(int)
df['weight']=df['weight'].astype(float)
df['cpu_frq(GHz)']=df['cpu'].str.extract(r"(\d+(?:\.\d+)\s*GHz)")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].str.replace("GHz","")
df['cpu_frq(GHz)']=df['cpu_frq(GHz)'].astype(float)
df['cpu_brand']=df['cpu'].str.extract(r"^([\w\-]+)")
df['memory']=df['memory'].str.replace('2TB','2000GB')
df['memory']=df['memory'].str.replace('1.0TB','1TB', regex=True)
df['memory']=df['memory'].str.replace('1TB','1000GB')
df[['memory1_type','memory2_type']]=df['memory'].str.split('+', expand=True)
df['memory1_capacity']=df['memory'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory1_type']=df['memory1_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory2_capacity']=df['memory2_type'].str.extract(r"([\d. +-/]+)\s*GB")
df['memory2_type']=df['memory2_type'].str.replace(r"([\d. +-/]+)\s*GB", '', regex=True)
df['memory1_capacity']=df['memory1_capacity'].astype(float)
df['memory2_capacity']=df['memory2_capacity'].astype(float)
df["memory2_capacity"]= df["memory2_capacity"].replace({'NaN': np.nan})
df["memory2_capacity"]= df["memory2_capacity"].fillna(0)
df=df.drop(['memory'], axis=1)
df["gpu_brand"]=df['gpu'].str.extract(r"^([\w\-]+)")
df['opsys']=df['opsys'].replace({'Windows 10' : 'Windows', 'Windows 10 S' : 'Windows',
                                  'Windows 7' : 'Windows', 'Mac OS X' : 'macOS' }) 
df.head()
# 可视化设置
sns.set(rc={"axes.facecolor":"#FAF3FC","figure.facecolor":"#FAF3FC",'figure.figsize':(14,5)})
pallet = ["#998289","#9981A0","#F5B7B1","#F9E79F"]
# 制造及型号
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="company",ax=axes[0],palette=pallet,data=df,order=df["company"].value_counts().index)
sns.boxplot(x ='price_euros',y ="company" ,palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Company",fontsize=15)
axes[1].set_title("Laptop Price by Company",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="typename",ax=axes[0],palette=pallet,data=df,order=df["typename"].value_counts().index)
sns.boxplot(x ='price_euros',y ="typename" ,ax=axes[1],palette=pallet,data = df)

axes[0].set_title("Number of Laptops by Type",fontsize=15)
axes[1].set_title("Laptop Price by Type ",fontsize=15)
plt.show()
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="price_euros",ax=axes[0],color="grey",data=df)
sns.barplot(x="company",y='price_euros',estimator=np.mean,ax=axes[1],palette=pallet,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Price Distribution",fontsize=15)
axes[1].set_title("Average Price for Each Company ",fontsize=15)
plt.show()
戴尔、联想、惠普、华硕和宏碁笔记本电脑在我们的数据集中是最常见的。
三星、雷蛇、Mediacom、微软、小米、Vero、Chuwi、谷歌、富士通和LG华为的数据集中只有不到10台笔记本电脑。雷蛇是最昂贵的笔记本电脑(但我们只有7台雷蛇笔记本电脑)。
在数据集中最常见的公司中，MSI笔记本电脑的平均价格是最贵的
戴尔(Dell)、联想(Lenovo)、惠普(HP)和华硕(Asus)的笔记本电脑平均价格在1000欧元左右。
一般来说，Vero是数据集中平均最便宜的笔记本电脑，而Acer是最常用的笔记本电脑中最便宜的。价格范围在200到6000之间，但大多数笔记本电脑都在4000以下。
笔记本电脑是最常见的类型。
我们的数据库中有6种笔记本电脑，最流行的是笔记本电脑，最不流行的是上网本。
笔记本和上网本的平均价格最低。
df[df["price_euros"]>4000]
我们只有4台价格在4000以上的笔记本电脑，两台来自雷蛇，一台来自惠普，还有一台来自联想。
所以高于4000的价格是异常值，我会降低。
df=df[df["price_euros"]<4000]
# CPU、GPU、操作系统
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="gpu_brand",ax=axes[0],palette=pallet,data=df,order=df["gpu_brand"].value_counts().index)
sns.boxplot(x ='gpu_brand',y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by GPU Brand",fontsize=15)
axes[1].set_title(" Laptop Price by CPU Brand",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="cpu_brand",ax=axes[0],palette=pallet,data=df,order=df["cpu_brand"].value_counts().index)
sns.boxplot(x ="cpu_brand",y ="price_euros" ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by  CPU Brand",fontsize=15)
axes[1].set_title("Laptop Price by CPU Brand",fontsize=15)
plt.show()
fig, axes = plt.subplots(1,2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)
sns.countplot(x="opsys",ax=axes[0],palette=pallet,data=df,order=df["opsys"].value_counts().index)
sns.boxplot(x ="opsys",y ='price_euros' ,palette=pallet,ax=axes[1],data = df)
axes[0].set_title("Number of Laptops by Operating System",fontsize=15)
axes[1].set_title("Laptop Price by Operating System",fontsize=15)
plt.show()
英特尔是最常见的CPU品牌，也是最贵的。AMD排名第二，差距很大。
Intel (Core i5 7200U, Core i7 7700HQ, Core i7 7500U)是最常见的cpu类型。
英特尔是出现频率最高的GPU品牌，其次是英伟达和AMD。
英特尔(HD Graphics 620和HD Graphics 520)是最常见的GPU类型。
Nvidia是最昂贵的GPU, AMD是最便宜的。使用AMD cpu的笔记本电脑的数量等于使用AMD gpu的笔记本电脑的数量，因为使用AMD cpu的笔记本电脑也有AMD gpu。
我们有5个操作系统Windows, Mac, Chrome, Linux和Android。我们也有笔记本电脑没有操作系统(OS)安装Mac操作系统的笔记本电脑平均价格最高，安装Linux操作系统的笔记本电脑平均价格最低。
有一款笔记本电脑带有ARM GPU品牌和三星CPU品牌，我会放弃它。
df=df[df["gpu_brand"]!="ARM"]
# 重量，屏幕尺寸和屏幕分辨率
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.histplot(x="weight",ax=axes[0],color="grey",data=df)
sns.countplot(x="inches",ax=axes[1],palette=pallet,order=df["inches"].value_counts().index,data=df)
plt.xticks(rotation=70)
axes[0].set_title("Weight Distribution",fontsize=15)
axes[1].set_title("Screen Size Distribution",fontsize=15)
plt.show()
这款笔记本电脑一半以上的屏幕都是15.6英寸。
[15.6,17.3,14,13.3,12.5,11.6]是大多数笔记本电脑的屏幕尺寸。
[2.20,2.20,2.00,2.4,2.5]是最常见的重量，大多数笔记本电脑的重量分布在1.3 - 2.5之间。
screen_size = [15.6,17.3,14,13.3,12.5,11.6]
df=df[df["inches"].isin(screen_size)]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="resolution",ax=axes[0],palette=pallet,order=df["resolution"].value_counts().index,data=df)
sns.boxplot(x ='price_euros',y ="resolution",ax=axes[1],palette=pallet,data = df)
axes[0].set_title("Number of Laptops by Screen Resolution",fontsize=15)
axes[1].set_title("Laptop Price by Screen Resolution",fontsize=15)
plt.show()
我们有15种不同的分辨率。1920x1080是数据集中最常见的屏幕分辨率，平均价格为1224.799欧元
1366x768分辨率最差，平均价格最低。
# RAM和硬盘
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="ram",ax=axes[0],palette=pallet,order=df["ram"].value_counts().index,data=df)
sns.boxplot(x ="ram",y ="price_euros" ,palette=pallet,data = df)
axes[0].set_title("Number of Laptops by RAM (GB)",fontsize=15)
axes[1].set_title("Laptop Price by RAM (GB)",fontsize=15)
plt.show()
内存范围:8gb ~ 64gb。最常见的内存是8gb
内存与价格之间存在明显的关系，随着内存的增加，价格也会随之增加
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_type",ax=axes[0],order=df["memory1_type"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_type",ax=axes[1],order=df["memory2_type"].value_counts().index,palette=pallet,data=df)
axes[0].set_title("Number of Laptops by the Type of the 1st Hard Drive",fontsize=15)
axes[1].set_title("Number of Laptops by the Type of the 2nd Hard Drive",fontsize=15)
plt.show()
我们只有3台24内存的笔记本电脑和1台64内存的笔记本电脑，我将放弃它们。
df=df[df["ram"].isin([24,64]) == False]
fig, axes = plt.subplots(nrows=1, ncols=2)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=70)

sns.countplot(x="memory1_capacity",ax=axes[0],order=df["memory1_capacity"].value_counts().index,palette=pallet,data=df)
sns.countplot(x="memory2_capacity",ax=axes[1],order=df["memory2_capacity"].value_counts().index,palette=pallet,data=df[df["memory2_capacity"]!=0])
axes[0].set_title("Number of Laptops by th 1st Hard Drive Capacity",fontsize=15)
axes[1].set_title("Number of Laptops by th 2nd Hard Drive Capacity",fontsize=15)
plt.show()
df["total_memory"]=df["memory1_capacity"]+df["memory2_capacity"]
sns.boxplot(x ="total_memory",y ="price_euros" ,palette=pallet,data = df)
plt.xticks(rotation=70)
plt.title("Laptop Price by Hard Drive Capacity",fontsize=15)
plt.show()
我们有180台笔记本电脑和2个硬盘。
第一块硬盘最常见的类型是256 GB的SSD。
第二个硬盘驱动器最常见的类型是1000 GB的HDD。
第二块硬盘通常具有高容量。
硬盘容量影响价格，但这两个变量之间的关系似乎不是很强
# 相关矩阵
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
corr_matrix = df[numerical_columns].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='crest', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix', fontsize=10)
plt.xticks(rotation=45, fontsize=8)
plt.yticks(fontsize=8)
plt.show()

资料获取，更多粉丝福利，关注下方公众号获取

USB迷 | 专注于互联网分享

数据分析案例-笔记本电脑价格数据可视化分析2

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

5.数据预处理

5.数据可视化

源代码

1.项目背景

2.数据集介绍

3.技术工具

4.导入数据

5.数据预处理

5.数据可视化

源代码

与本文相关的文章

评论列表 (0)