DataFrame and read data

julia

Published

February 19, 2023

数据框的生成

using DataFrames
da0 = DataFrame(
    name=["张三", "李四", "王五", "赵六"],
    age=[33, 42, missing, 51],
    sex=["M", "F", "M", "M"])
da1 = copy(da0)

	name	age	sex
	String	Int64?	String
1	张三	33	M
2	李四	42	F
3	王五	missing	M
4	赵六	51	M

其中使用了copy，生成了da0的副本。数据框属于一个可变数据类型（mutable type），若直接接受da0赋值给一个变量da1，则这两个变量实际上指向一个数据框，改变其中一个会同时改变两个。

da0 = DataFrame(
    "name" => ["张三", "李四", "王五", "赵六"],
    "age" => [33, 42, missing, 51],
    "sex" => ["M", "F", "M", "M"])

	name	age	sex
	String	Int64?	String
1	张三	33	M
2	李四	42	F
3	王五	missing	M
4	赵六	51	M

di = Dict(
    "name" => ["张三", "李四", "王五", "赵六"],
    "age" => [33, 42, missing, 51],
    "sex" => ["M", "F", "M", "M"])
da0 = DataFrame(di)

	age	name	sex
	Int64?	String	String
1	33	张三	M
2	42	李四	F
3	missing	王五	M
4	51	赵六	M

(nrow(da1), ncol(da1))

(4, 3)

@show names(da1)

names(da1) = ["name", "age", "sex"]

3-element Vector{String}:
 "name"
 "age"
 "sex"

zip(names(da1),string.(eltype.(eachcol(da1))))|>
    DataFrame |>
    d -> rename!(d,["Variable", "Type"])

	Variable	Type
	String	String
1	name	String
2	age	Union{Missing, Int64}
3	sex	String

访问单个元素

da1[2,1]

"李四"

da1[2, :name] = "孙七"
da1

	name	age	sex
	String	Int64?	String
1	张三	33	M
2	孙七	42	F
3	王五	missing	M
4	赵六	51	M

访问一列

da1[!,2]

4-element Vector{Union{Missing, Int64}}:
 33
 42
   missing
 51

在julia中较为常用的方法是!，可以实现将多个的变量的操作。

da1[:,2]

4-element Vector{Union{Missing, Int64}}:
 33
 42
   missing
 51

冒号与感叹号之间仍然存在些许差别：:可生成一个副本，而!会对于原有数据中进行修改。

CSV数据访问

using CSV
using Distributions

import XLSX
using DataFrames
using StatsBase
using StatsPlots

name=DataFrame(year=[1,2,3],id=["jian","qi","huang"])

	year	id
	Int64	String
1	1	jian
2	2	qi
3	3	huang

访问url上的数据

using Downloads
urlf = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
dht = CSV.read(Downloads.download(urlf), DataFrame,
    header=0)
rename!(dht, ["age", "sex", "cp", "trestbps", "chol",
    "fbs", "restecg", "thalach", "exang", "oldpeak",
    "slope", "ca", "thal", "num"])

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	63.0	1.0	1.0	145.0	233.0	1.0	2.0	150.0	0.0
2	67.0	1.0	4.0	160.0	286.0	0.0	2.0	108.0	1.0
3	67.0	1.0	4.0	120.0	229.0	0.0	2.0	129.0	1.0
4	37.0	1.0	3.0	130.0	250.0	0.0	0.0	187.0	0.0
5	41.0	0.0	2.0	130.0	204.0	0.0	2.0	172.0	0.0
6	56.0	1.0	2.0	120.0	236.0	0.0	0.0	178.0	0.0
7	62.0	0.0	4.0	140.0	268.0	0.0	2.0	160.0	0.0
8	57.0	0.0	4.0	120.0	354.0	0.0	0.0	163.0	1.0
9	63.0	1.0	4.0	130.0	254.0	0.0	2.0	147.0	0.0
10	53.0	1.0	4.0	140.0	203.0	1.0	2.0	155.0	1.0
11	57.0	1.0	4.0	140.0	192.0	0.0	0.0	148.0	0.0
12	56.0	0.0	2.0	140.0	294.0	0.0	2.0	153.0	0.0
13	56.0	1.0	3.0	130.0	256.0	1.0	2.0	142.0	1.0
14	44.0	1.0	2.0	120.0	263.0	0.0	0.0	173.0	0.0
15	52.0	1.0	3.0	172.0	199.0	1.0	0.0	162.0	0.0
16	57.0	1.0	3.0	150.0	168.0	0.0	0.0	174.0	0.0
17	48.0	1.0	2.0	110.0	229.0	0.0	0.0	168.0	0.0
18	54.0	1.0	4.0	140.0	239.0	0.0	0.0	160.0	0.0
19	48.0	0.0	3.0	130.0	275.0	0.0	0.0	139.0	0.0
20	49.0	1.0	2.0	130.0	266.0	0.0	0.0	171.0	0.0
21	64.0	1.0	1.0	110.0	211.0	0.0	2.0	144.0	1.0
22	58.0	0.0	1.0	150.0	283.0	1.0	2.0	162.0	0.0
23	58.0	1.0	2.0	120.0	284.0	0.0	2.0	160.0	0.0
24	58.0	1.0	3.0	132.0	224.0	0.0	2.0	173.0	0.0
25	60.0	1.0	4.0	130.0	206.0	0.0	2.0	132.0	1.0
26	50.0	0.0	3.0	120.0	219.0	0.0	0.0	158.0	0.0
27	58.0	0.0	3.0	120.0	340.0	0.0	0.0	172.0	0.0
28	66.0	0.0	1.0	150.0	226.0	0.0	0.0	114.0	0.0
29	43.0	1.0	4.0	150.0	247.0	0.0	0.0	171.0	0.0
30	40.0	1.0	4.0	110.0	167.0	0.0	2.0	114.0	1.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

在行下标位置写单独的叹号表示所有行，在列下标指定一列后可以访问数据框的这一列，不制作副本。比如，df[!,2], df[!, "age"]或df[!, :age]都可以取出df的第二列作为一个一维数组：

df[!,2]

29-element Vector{Float64}:
 36102.6
 35445.1
 33106.0
 29883.0
 27041.2
 24779.1
 22926.0
 21134.6
 19024.7
 17188.8
 14964.0
 12900.9
 11813.1
     ⋮
  5267.2
  4525.7
  3861.5
  3277.8
  2759.8
  2439.1
  2118.1
  1819.4
  1516.2
  1149.8
   888.9
   710.2

describe(df)

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Float64	Real	Float64	Real	Int64	DataType
1	Column1	2006.0	1992	2006.0	2020	0	Int64
2	Beijing	12719.2	710.2	8387.0	36102.6	0	Float64
3	Tianjin	5536.72	411.0	3538.2	14083.7	0	Float64
4	Hebei	14254.6	1278.5	10043.0	36206.9	0	Float64
5	Shanxi	6732.14	551.1	4713.6	17651.9	0	Float64
6	Inner Mongolia	6485.58	421.7	4161.8	17359.8	0	Float64
7	Liaoning	11209.5	1473.0	8390.3	25115.0	0	Float64
8	Jilin	5135.64	558.1	3226.5	12311.3	0	Float64
9	Heilongjiang	6601.08	857.4	5329.8	13698.5	0	Float64
10	Shanghai	14671.4	1114.3	10598.9	38700.6	0	Float64
11	Jiangsu	35370.5	2136.0	21240.8	102719.0	0	Float64
12	Zhejiang	22816.2	1375.7	15302.7	64613.3	0	Float64
13	Anhui	12282.8	827.0	6500.3	38680.6	0	Float64
14	Fujian	13892.8	784.7	7468.6	43903.9	0	Float64
15	Jiangxi	8414.53	572.6	4696.8	25691.5	0	Float64
16	Shandong	27864.4	2196.5	18967.8	73129.0	0	Float64
17	Henan	19156.4	1279.8	11977.9	54997.1	0	Float64
18	Hubei	14904.0	1088.4	7531.8	45429.0	0	Float64
19	Hunan	13893.0	987.0	7431.6	41781.5	0	Float64
20	Guangdong	38962.5	2447.5	25961.2	1.10761e5	0	Float64
21	Guangxi,	7576.56	646.6	4417.8	22156.7	0	Float64
22	Hainan	1872.47	184.9	1027.5	5532.4	0	Float64
23	Chongqing	7843.67	462.5	3900.3	25002.8	0	Float64
24	Sichuan	15689.6	1177.3	8494.7	48598.8	0	Float64
25	Guizhou	5036.27	339.9	2264.1	17826.6	0	Float64
26	Yunnan	7647.3	618.7	4090.7	24521.9	0	Float64
27	Tibet	529.959	33.3	285.9	1902.7	0	Float64
28	Shaanxi	8744.46	531.6	4595.6	26181.9	0	Float64
29	Gansu	3377.07	317.8	2203.0	9016.7	0	Float64
30	Qinghai	1026.16	87.5	585.2	3005.9	0	Float64
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

df[:1,:]

Row	地区	x1	x2	x3	x4	x5	x6	x7	x8	x9
	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any
1	北京	39138	58042	53062	49455	43187	143717	94956	65646	64250

mean(df[:,:x1])

30844.166666666668

first(df,2)

Row	地区	x1	x2	x3	x4	x5	x6	x7	x8	x9
	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any
1	北京	39138	58042	53062	49455	43187	143717	94956	65646	64250
2	天津	36007	61667	47103	50372	43400	68436	58365	44999	51602

using Query

using Downloads
urlf = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
dht = CSV.read(Downloads.download(urlf), DataFrame,
    header=0)
rename!(dht, ["age", "sex", "cp", "trestbps", "chol", 
    "fbs", "restecg", "thalach", "exang", "oldpeak",    
    "slope", "ca", "thal", "num"])

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	63.0	1.0	1.0	145.0	233.0	1.0	2.0	150.0	0.0
2	67.0	1.0	4.0	160.0	286.0	0.0	2.0	108.0	1.0
3	67.0	1.0	4.0	120.0	229.0	0.0	2.0	129.0	1.0
4	37.0	1.0	3.0	130.0	250.0	0.0	0.0	187.0	0.0
5	41.0	0.0	2.0	130.0	204.0	0.0	2.0	172.0	0.0
6	56.0	1.0	2.0	120.0	236.0	0.0	0.0	178.0	0.0
7	62.0	0.0	4.0	140.0	268.0	0.0	2.0	160.0	0.0
8	57.0	0.0	4.0	120.0	354.0	0.0	0.0	163.0	1.0
9	63.0	1.0	4.0	130.0	254.0	0.0	2.0	147.0	0.0
10	53.0	1.0	4.0	140.0	203.0	1.0	2.0	155.0	1.0
11	57.0	1.0	4.0	140.0	192.0	0.0	0.0	148.0	0.0
12	56.0	0.0	2.0	140.0	294.0	0.0	2.0	153.0	0.0
13	56.0	1.0	3.0	130.0	256.0	1.0	2.0	142.0	1.0
14	44.0	1.0	2.0	120.0	263.0	0.0	0.0	173.0	0.0
15	52.0	1.0	3.0	172.0	199.0	1.0	0.0	162.0	0.0
16	57.0	1.0	3.0	150.0	168.0	0.0	0.0	174.0	0.0
17	48.0	1.0	2.0	110.0	229.0	0.0	0.0	168.0	0.0
18	54.0	1.0	4.0	140.0	239.0	0.0	0.0	160.0	0.0
19	48.0	0.0	3.0	130.0	275.0	0.0	0.0	139.0	0.0
20	49.0	1.0	2.0	130.0	266.0	0.0	0.0	171.0	0.0
21	64.0	1.0	1.0	110.0	211.0	0.0	2.0	144.0	1.0
22	58.0	0.0	1.0	150.0	283.0	1.0	2.0	162.0	0.0
23	58.0	1.0	2.0	120.0	284.0	0.0	2.0	160.0	0.0
24	58.0	1.0	3.0	132.0	224.0	0.0	2.0	173.0	0.0
25	60.0	1.0	4.0	130.0	206.0	0.0	2.0	132.0	1.0
26	50.0	0.0	3.0	120.0	219.0	0.0	0.0	158.0	0.0
27	58.0	0.0	3.0	120.0	340.0	0.0	0.0	172.0	0.0
28	66.0	0.0	1.0	150.0	226.0	0.0	0.0	114.0	0.0
29	43.0	1.0	4.0	150.0	247.0	0.0	0.0	171.0	0.0
30	40.0	1.0	4.0	110.0	167.0	0.0	2.0	114.0	1.0
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

first(dht.age,10)

10-element Vector{Float64}:
 63.0
 67.0
 67.0
 37.0
 41.0
 56.0
 62.0
 57.0
 63.0
 53.0

dline01 = DataFrame(
    x = 1:5,
    y = [11, 13, 18, 15, 14])

Row	x	y
	Int64	Int64
1	1	11
2	2	13
3	3	18
4	4	15
5	5	14

0:0.1:6

0.0:0.1:6.0

sum=0
for i in 0:0.1:6
    sum+=i
end
@show sum

LoadError: cannot assign a value to variable Base.sum from module Main

using DataFrames, DataFramesMeta
using CategoricalArrays

using Makie
using LinearAlgebra

读取数据库数据

using RDatasets: dataset

iris=dataset("datasets","iris")

	SepalLength	SepalWidth	PetalLength	PetalWidth	Species
	Float64	Float64	Float64	Float64	Cat…
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa
7	4.6	3.4	1.4	0.3	setosa
8	5.0	3.4	1.5	0.2	setosa
9	4.4	2.9	1.4	0.2	setosa
10	4.9	3.1	1.5	0.1	setosa
11	5.4	3.7	1.5	0.2	setosa
12	4.8	3.4	1.6	0.2	setosa
13	4.8	3.0	1.4	0.1	setosa
14	4.3	3.0	1.1	0.1	setosa
15	5.8	4.0	1.2	0.2	setosa
16	5.7	4.4	1.5	0.4	setosa
17	5.4	3.9	1.3	0.4	setosa
18	5.1	3.5	1.4	0.3	setosa
19	5.7	3.8	1.7	0.3	setosa
20	5.1	3.8	1.5	0.3	setosa
21	5.4	3.4	1.7	0.2	setosa
22	5.1	3.7	1.5	0.4	setosa
23	4.6	3.6	1.0	0.2	setosa
24	5.1	3.3	1.7	0.5	setosa
25	4.8	3.4	1.9	0.2	setosa
26	5.0	3.0	1.6	0.2	setosa
27	5.0	3.4	1.6	0.4	setosa
28	5.2	3.5	1.5	0.2	setosa
29	5.2	3.4	1.4	0.2	setosa
30	4.7	3.2	1.6	0.2	setosa
⋮	⋮	⋮	⋮	⋮	⋮

sample=iris[:,1:4]
label=iris[:,end]
#训练数据
train=sample[1:2:end,:]
train_albel=label[1:2:end]
#测试数据
test=sample[2:2:end,:]
test_label=label[2:2:end]
#需要把Iris数据DataFrame类型转换为Array类型

75-element CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

Array(train)

LoadError: UndefVarError: train not defined