batch很好理解,就是batch size。注意在一個(gè)epoch中最后一個(gè)batch大小可能小于等于batch size
dataset.repeat就是俗稱epoch,但在tf中與dataset.shuffle的使用順序可能會(huì)導(dǎo)致個(gè)epoch的混合
dataset.shuffle就是說(shuō)維持一個(gè)buffer size 大小的 shuffle buffer,圖中所需的每個(gè)樣本從shuffle buffer中獲取,取得一個(gè)樣本后,就從源數(shù)據(jù)集中加入一個(gè)樣本到shuffle buffer中。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
import os os.environ[ 'CUDA_VISIBLE_DEVICES' ] = "" import numpy as np import tensorflow as tf np.random.seed( 0 ) x = np.random.sample(( 11 , 2 )) # make a dataset from a numpy array print (x) print () dataset = tf.data.Dataset.from_tensor_slices(x) dataset = dataset.shuffle( 3 ) dataset = dataset.batch( 4 ) dataset = dataset.repeat( 2 ) # create the iterator iter = dataset.make_one_shot_iterator() el = iter .get_next() with tf.Session() as sess: print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
#源數(shù)據(jù)集 [[ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ] [ 0.43758721 0.891773 ] [ 0.96366276 0.38344152 ] [ 0.79172504 0.52889492 ] [ 0.56804456 0.92559664 ] [ 0.07103606 0.0871293 ] [ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.97861834 0.79915856 ]] # 通過(guò)shuffle batch后取得的樣本 [[ 0.4236548 0.64589411 ] [ 0.60276338 0.54488318 ] [ 0.43758721 0.891773 ] [ 0.5488135 0.71518937 ]] [[ 0.96366276 0.38344152 ] [ 0.56804456 0.92559664 ] [ 0.0202184 0.83261985 ] [ 0.79172504 0.52889492 ]] [[ 0.07103606 0.0871293 ] [ 0.97861834 0.79915856 ] [ 0.77815675 0.87001215 ]] #最后一個(gè)batch樣本個(gè)數(shù)為3 [[ 0.60276338 0.54488318 ] [ 0.5488135 0.71518937 ] [ 0.43758721 0.891773 ] [ 0.79172504 0.52889492 ]] [[ 0.4236548 0.64589411 ] [ 0.56804456 0.92559664 ] [ 0.0202184 0.83261985 ] [ 0.07103606 0.0871293 ]] [[ 0.77815675 0.87001215 ] [ 0.96366276 0.38344152 ] [ 0.97861834 0.79915856 ]] #最后一個(gè)batch樣本個(gè)數(shù)為3 |
1、按照shuffle中設(shè)置的buffer size,首先從源數(shù)據(jù)集取得三個(gè)樣本:
shuffle buffer:
[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
[ 0.4236548 0.64589411]
2、從buffer中取一個(gè)樣本到batch中得:
shuffle buffer:
[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
batch:
[ 0.4236548 0.64589411]
3、shuffle buffer不足三個(gè)樣本,從源數(shù)據(jù)集提取一個(gè)樣本:
shuffle buffer:
[ 0.5488135 0.71518937]
[ 0.60276338 0.54488318]
[ 0.43758721 0.891773 ]
4、從buffer中取一個(gè)樣本到batch中得:
shuffle buffer:
[ 0.5488135 0.71518937]
[ 0.43758721 0.891773 ]
batch:
[ 0.4236548 0.64589411]
[ 0.60276338 0.54488318]
5、如此反復(fù)。這就意味中如果shuffle 的buffer size=1,數(shù)據(jù)集不打亂。如果shuffle 的buffer size=數(shù)據(jù)集樣本數(shù)量,隨機(jī)打亂整個(gè)數(shù)據(jù)集
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
|
import os os.environ[ 'CUDA_VISIBLE_DEVICES' ] = "" import numpy as np import tensorflow as tf np.random.seed( 0 ) x = np.random.sample(( 11 , 2 )) # make a dataset from a numpy array print (x) print () dataset = tf.data.Dataset.from_tensor_slices(x) dataset = dataset.shuffle( 1 ) dataset = dataset.batch( 4 ) dataset = dataset.repeat( 2 ) # create the iterator iter = dataset.make_one_shot_iterator() el = iter .get_next() with tf.Session() as sess: print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) [[ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ] [ 0.43758721 0.891773 ] [ 0.96366276 0.38344152 ] [ 0.79172504 0.52889492 ] [ 0.56804456 0.92559664 ] [ 0.07103606 0.0871293 ] [ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.97861834 0.79915856 ]] [[ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ] [ 0.43758721 0.891773 ]] [[ 0.96366276 0.38344152 ] [ 0.79172504 0.52889492 ] [ 0.56804456 0.92559664 ] [ 0.07103606 0.0871293 ]] [[ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.97861834 0.79915856 ]] [[ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ] [ 0.43758721 0.891773 ]] [[ 0.96366276 0.38344152 ] [ 0.79172504 0.52889492 ] [ 0.56804456 0.92559664 ] [ 0.07103606 0.0871293 ]] [[ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.97861834 0.79915856 ]] |
注意如果repeat在shuffle之前使用:
官方說(shuō)repeat在shuffle之前使用能提高性能,但模糊了數(shù)據(jù)樣本的epoch關(guān)系
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
|
import os os.environ[ 'CUDA_VISIBLE_DEVICES' ] = "" import numpy as np import tensorflow as tf np.random.seed( 0 ) x = np.random.sample(( 11 , 2 )) # make a dataset from a numpy array print (x) print () dataset = tf.data.Dataset.from_tensor_slices(x) dataset = dataset.repeat( 2 ) dataset = dataset.shuffle( 11 ) dataset = dataset.batch( 4 ) # create the iterator iter = dataset.make_one_shot_iterator() el = iter .get_next() with tf.Session() as sess: print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) print (sess.run(el)) [[ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ] [ 0.43758721 0.891773 ] [ 0.96366276 0.38344152 ] [ 0.79172504 0.52889492 ] [ 0.56804456 0.92559664 ] [ 0.07103606 0.0871293 ] [ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.97861834 0.79915856 ]] [[ 0.56804456 0.92559664 ] [ 0.5488135 0.71518937 ] [ 0.60276338 0.54488318 ] [ 0.07103606 0.0871293 ]] [[ 0.96366276 0.38344152 ] [ 0.43758721 0.891773 ] [ 0.43758721 0.891773 ] [ 0.77815675 0.87001215 ]] [[ 0.79172504 0.52889492 ] #出現(xiàn)相同樣本出現(xiàn)在同一個(gè)batch中 [ 0.79172504 0.52889492 ] [ 0.60276338 0.54488318 ] [ 0.4236548 0.64589411 ]] [[ 0.07103606 0.0871293 ] [ 0.4236548 0.64589411 ] [ 0.96366276 0.38344152 ] [ 0.5488135 0.71518937 ]] [[ 0.97861834 0.79915856 ] [ 0.0202184 0.83261985 ] [ 0.77815675 0.87001215 ] [ 0.56804456 0.92559664 ]] [[ 0.0202184 0.83261985 ] [ 0.97861834 0.79915856 ]] #可以看到最后個(gè)batch為2,而前面都是4 |
使用案例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
def input_fn(filenames, batch_size = 32 , num_epochs = 1 , perform_shuffle = False ): print ( 'Parsing' , filenames) def decode_libsvm(line): #columns = tf.decode_csv(value, record_defaults=CSV_COLUMN_DEFAULTS) #features = dict(zip(CSV_COLUMNS, columns)) #labels = features.pop(LABEL_COLUMN) columns = tf.string_split([line], ' ' ) labels = tf.string_to_number(columns.values[ 0 ], out_type = tf.float32) splits = tf.string_split(columns.values[ 1 :], ':' ) id_vals = tf.reshape(splits.values,splits.dense_shape) feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits = 2 ,axis = 1 ) feat_ids = tf.string_to_number(feat_ids, out_type = tf.int32) feat_vals = tf.string_to_number(feat_vals, out_type = tf.float32) #feat_ids = tf.reshape(feat_ids,shape=[-1,FLAGS.field_size]) #for i in range(splits.dense_shape.eval()[0]): # feat_ids.append(tf.string_to_number(splits.values[2*i], out_type=tf.int32)) # feat_vals.append(tf.string_to_number(splits.values[2*i+1])) #return tf.reshape(feat_ids,shape=[-1,field_size]), tf.reshape(feat_vals,shape=[-1,field_size]), labels return { "feat_ids" : feat_ids, "feat_vals" : feat_vals}, labels # Extract lines from input files using the Dataset API, can pass one filename or filename list dataset = tf.data.TextLineDataset(filenames). map (decode_libsvm, num_parallel_calls = 10 ).prefetch( 500000 ) # multi-thread pre-process then prefetch # Randomizes input using a window of 256 elements (read into memory) if perform_shuffle: dataset = dataset.shuffle(buffer_size = 256 ) # epochs from blending together. dataset = dataset.repeat(num_epochs) dataset = dataset.batch(batch_size) # Batch size to use #return dataset.make_one_shot_iterator() iterator = dataset.make_one_shot_iterator() batch_features, batch_labels = iterator.get_next() #return tf.reshape(batch_ids,shape=[-1,field_size]), tf.reshape(batch_vals,shape=[-1,field_size]), batch_labels return batch_features, batch_labels |
到此這篇關(guān)于淺談tensorflow中dataset.shuffle和dataset.batch dataset.repeat注意點(diǎn)的文章就介紹到這了,更多相關(guān)tensorflow中dataset.shuffle和dataset.batch dataset.repeat內(nèi)容請(qǐng)搜索服務(wù)器之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持服務(wù)器之家!
原文鏈接:https://blog.csdn.net/qq_16234613/article/details/81703228