閱讀684 返回首頁    go 技術社區[雲棲]


影評分析-TensorFlow和TensorBoard自然語言分析動手實驗

通過這篇文章,你會了解如何在自然語言處理項目中運用TensorFlow這個強大的工具。並同時看到TensorBoard的一些基本用法。我會對文中涉及到的代碼細節進行解釋,但可能一些更基礎的知識需要你自己去補充了,這裏就不再贅述。


在這個項目中,我們會用到Kaggle (https://www.kaggle.com/c/word2vec-nlp-tutorial)中的數據。這個數據集包括了25000個標記過的影評和50000個未標記過的訓練影評。


這裏是我們會用到的Python包:

import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple


你可能對其中的一些包並不熟悉,但是沒關係,我會在對他們的用法做一定的解釋。同時,你也可以從網上找到相關的資料。這裏數據格式需要是.tsv文件,我們需要上傳數據並加入相應的分隔符。


train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
test = pd.read_csv("testData.tsv", delimiter="\t")

這裏是個數據樣例:


# Here's the first review as an example
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.

為了提高性能,我們需要對原數據進行一定的預處理。比如文本數據中的<br/>標簽,它對我們的訓練並沒有任何作用。我們需要把這些噪音從數據中清除。



def clean_text(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    
    # Return a list of words
    return(text)


這裏我們把文本數據歸成兩類:停止詞和正則式。


# stop words
if remove_stopwords:
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]

停止詞就是那些並沒有太多實際意義的單詞(a,the,just等)。這類詞同樣給我們的數據帶來了大量的噪音,我們需要在開始訓練前清理掉這些單詞。這裏是我們的停止詞詞典:https://gist.github.com/sebleier/554280


另外,我發現對於不同的項目,我們可能需要對停止詞詞典做一定的調整。比如在這個項目的詞典裏包括了一些代詞,而他們可能會在另外的項目裏是有意義的,並不能當成停止詞來處理。


# re
text = re.sub(r"<br />", " ", text)
text = re.sub(r"[^a-z]", " ", text)
text = re.sub(r"   ", " ", text) # Remove any extra spaces
text = re.sub(r"  ", " ", text)

這裏re就是我們的正則式。是字符串的一種簡化形式。

你可以看到,在第二行我們把<br/>替換成了空字符串,並移除出我們的文本。接下來是對我們的訓練文本中的單詞進行令牌化。


# Tokenize the reviews
all_reviews = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_reviews)
print("Fitting is complete.")

train_seq = tokenizer.texts_to_sequences(train_clean)
print("train_seq is complete.")

test_seq = tokenizer.texts_to_sequences(test_clean)
print("test_seq is complete")


令牌化就是把每個單詞轉換成唯一對應的數字。例如[“The”, “cat”, “went”, “to”, “the”, “zoo”, “.”]這句話的令牌化結果就是[1, 2, 3, 4, 1, 5, 6]。


目前有幾種不同的令牌化的方法,而我更傾向於Kera的方法。


這個項目裏的單詞表並不是很大,一共有99426個單詞。


word_index = tokenizer.word_index

另外你可以對文本中的單詞進行一定得初步篩選。例如,隻選用那些常用的單詞,這樣可以把單詞量降到80000左右。如果更進一步,你可以選出那些最少用到5次的單詞。這樣的優化能讓我們的程序的處理性能更好。比如,‘Goldfinger’(007係列經典電影《金手指》)隻在數據文本中出現了一次,這樣這個單詞並不能很好的對影評的態度進行衡量。相反,那些常用的單詞‘good’或‘bad’更能體現影評的態度,從而幫助程序進行預測。


以下就是令牌化後的影評數據。
[445, 86, 489, 10939, 8, 61, 583, 2603, 120, 68, 957, 560, 53, 212, 24485, 212, 17247, 219, 193, 97, 20, 695, 2565, 124, 109, 15, 520, 3954, 193, 27, 246, 654, 2352, 1261, 17247, 90, 4782, 90, 712, 3, 305, 86, 16, 358, 1846, 542, 1219, 3592, 10939, 1, 485, 871, 3538, 23, 526, 673, 1414, 19, 63, 5305, 2089, 1118, 185, 413, 1523, 817, 2583, 7, 10939, 477, 86, 665, 85, 272, 114, 578, 10939, 34480, 29662, 148, 2, 10939, 381, 13, 59, 26, 381, 210, 15, 252, 178, 10, 751, 712, 3, 142, 341, 464, 145, 16427, 4121, 1718, 635, 876, 10547, 1018, 12089, 890, 1067, 1652, 416, 10939, 265, 19, 596, 141, 10939, 18336, 2302, 15821, 876, 10547, 1, 34, 38190, 388, 21, 49, 17539, 1414, 434, 9821, 193, 4238, 10939, 1, 120, 669, 520, 96, 7, 10939, 1555, 444, 2271, 138, 2137, 2383, 635, 23, 72, 117, 4750, 5364, 307, 1326, 31136, 19, 635, 556, 888, 665, 697, 6, 452, 195, 547, 138, 689, 3386, 1234, 790, 56, 1239, 268, 2, 21, 7, 10939, 6, 580, 78, 476, 32, 21, 245, 706, 158, 276, 113, 7674, 673, 3526, 10939, 1, 37925, 1690, 2, 159, 413, 1523, 294, 6, 956, 21, 51, 1500, 1226, 2352, 17, 612, 8, 61, 442, 724, 7184, 17, 25, 4, 49, 21, 199, 443, 3912, 3484, 49, 110, 270, 495, 252, 289, 124, 6, 19622, 19910, 363, 1502]

下一步,讓我們調整影評數據的長度,使得每條數據都有相同的長度。

max_review_length = 200

train_pad = pad_sequences(train_seq, maxlen = max_review_length)
print("train_pad is complete.")

test_pad = pad_sequences(test_seq, maxlen = max_review_length)
print("test_pad is complete.")

更多的訓練數據可以提高程序的準確度。這裏,我設置了影評數據長度為200,目的是為了提升我們的訓練速度。
在設置之前,檢查一下每條數據的長度,從而確定一個比較合適的值。這裏我們用到了numpy的percentile方法。

np.percentile(lengths.counts, 80)

對於長度為200時,基本上80%的影評數據所有的單詞都會被包括。對於那些長度超出200的記錄,多餘的部分會被去除。相反,那些不足的影評則會被填充令牌補滿剩餘部分。大家需要考慮如何對數據進行更有效的填充和截取。這裏我不做詳細的討論,但是大家可以想想自己有哪些更好的想法。

至此,我們可以把數據分成訓練集和驗證集。


x_train, x_valid, y_train, y_valid = train_test_split(train_pad, train.sentiment, test_size = 0.15, random_state = 2)

通常我會把數據分成訓練集,驗證集和測試集。這個項目裏,我們的數據來源於Kaggle Competition,這些數據本身就是測試數據,因此我們可以直接把它們看成測試集。在開始我們的程序前, 我們先要創建幾個方法以便用於後麵的批處理。

def get_batches(x, y, batch_size):
    '''Create the batches for the training and validation data'''
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]
def get_test_batches(x, batch_size):
    '''Create the batches for the testing data'''
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size]

上麵的這些方法會把數據分成數據量相同的組。這裏需要注意的是如果最後一個組的數據量小於batch_size,那它將被忽略。因此,batch_size應該是你的數據集長度的整數倍,否則,你可能會在後麵上傳預測數據時遇到一些問題。下麵我們就來開始一步一步進行循環神經網絡的編寫。

def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, dropout, learning_rate, multiple_fc, fc_units):
    '''Build the Recurrent Neural Network'''

    tf.reset_default_graph()

    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, 
                                    embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
        drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                         output_keep_prob=keep_prob)
        cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    
    # Set the initial state
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, final_state = tf.nn.dynamic_rnn(
                                        cell,         
                                        embed,
                                        initial_state=initial_state)    
    
    # Create the fully connected layers
    with tf.name_scope("fully_connected"):
        
        # Initialize the weights and biases
        weights = tf.truncated_normal_initializer(stddev=0.1)
        biases = tf.zeros_initializer()
        
        dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                    num_outputs = fc_units,
                    activation_fn = tf.sigmoid,
                    weights_initializer = weights,
                    biases_initializer = biases)
        
        dense = tf.contrib.layers.dropout(dense, keep_prob)
        
        # Depending on the iteration, use a second fully connected 
          layer
        if multiple_fc == True:
            dense = tf.contrib.layers.fully_connected(dense,
                        num_outputs = fc_units,
                        activation_fn = tf.sigmoid,
                        weights_initializer = weights,
                        biases_initializer = biases)
            
            dense = tf.contrib.layers.dropout(dense, keep_prob)
    
    # Make the predictions
    with tf.name_scope('predictions'):
        predictions = tf.contrib.layers.fully_connected(dense, 
                          num_outputs = 1, 
                          activation_fn=tf.sigmoid,
                          weights_initializer = weights,
                          biases_initializer = biases)
        
        tf.summary.histogram('predictions', predictions)
    
    # Calculate the cost
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels, predictions)
        tf.summary.scalar('cost', cost)
    
    # Train the model
    with tf.name_scope('train'):    
        optimizer = 
            tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Determine the accuracy
    with tf.name_scope("accuracy"):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), 
                                        tf.int32), 
                                        labels)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
    
    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',        
                    'final_state','accuracy', 'predictions', 'cost', 
                    'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

如果之前你沒有用過TensorBoard, 我建議可以看下Siraj Raval的這個視頻https://www.youtube.com/watch?v=fBVEXKp4DIc, 同時他還有很多相關的視頻教程,同樣非常出色。



tf.reset_default_graph()


在開始訓練前,需要把我們的圖重置一下,以保證訓練數據的質量。


# Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [None, None],  
                                    name='inputs')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [None, None], 
                                    name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')


這些是我們數據的占位符。當我們用TensorBoard 圖形化我們的數據後,tf.name_scope()可以用來標記圖中的特定部分。


# Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, 
                                  embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

通過映射(embedding),我們把準備的單詞表轉換成向量,embed_size表示這個向量的維度。這裏有更詳細的討論:https://www.quora.com/What-does-the-word-embedding-mean-in-the-context-of-Machine-Learning

雖然這裏我用了隨機均勻分布的方法,其實還有其他很多方法建立這樣映射。比如用帶有較小標準差的截斷正態分布也是不錯的選擇,那代碼就是這樣的:

embedding = tf.Variable(tf.truncated_normal((n_words, embed_size), -0.1, 0.1))

讓我們自己動手試試!


# Build the RNN layers
with tf.name_scope("RNN_layers"):
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, 
                                         output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)

這是循環神經網絡的核心。正如你從這裏的超參數看到的,我們會用到一個有著50%Dropout的兩層網絡。


# Set the initial state
with tf.name_scope("RNN_init_state"):
    initial_state = cell.zero_state(batch_size, tf.float32)

這步創建了圖的初始狀態。


# Run the data through the RNN layers
with tf.name_scope("RNN_forward"):
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                        initial_state=initial_state)

這裏是模塊的正饋部分。正如我之前提到過的,每個批處理的數據文本有著不同的最長長度。因此我們可以利用tf.nn.dynamic_rnn 來處理。該方法的具體使用請參閱: https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/dynamic_rnn

# Create the fully connected layers
with tf.name_scope("fully_connected"):
        
    # Initialize the weights and biases
    weights = tf.truncated_normal_initializer(stddev=0.1)
    biases = tf.zeros_initializer()
        
    dense = tf.contrib.layers.fully_connected(outputs[:, -1],
                num_outputs = fc_units,
                activation_fn = tf.sigmoid,
                weights_initializer = weights,
                biases_initializer = biases)
        
    dense = tf.contrib.layers.dropout(dense, keep_prob)
        
    # Depending on the iteration, use a second fully connected layer
    if multiple_fc == True:
        dense = tf.contrib.layers.fully_connected(dense,
                    num_outputs = fc_units,
                    activation_fn = tf.sigmoid,
                    weights_initializer = weights,
                    biases_initializer = biases)
            
        dense = tf.contrib.layers.dropout(dense, keep_prob)

這步就是我們添加第一層和可能完全鏈接的第二層網絡。他們的權重(weight)和偏置值(bias)會在之前提到過的映射(embedding)方法中進行初始化。這裏multiple_fc 是方法的一個參數它使我們能夠測試這個模型的架構。用這個方法,你可以對不同的內容進行測試,比如如何初始化權重和偏置值,是否使用LSTM(Long Short Term Memory)或GRU(Gated Recurrent Unit),等等

# Make the predictions
with tf.name_scope('predictions'):
    predictions = tf.contrib.layers.fully_connected(dense, 
                      num_outputs = 1, 
                      activation_fn = tf.sigmoid,
                      weights_initializer = weights,
                      biases_initializer = biases)
        
    tf.summary.histogram('predictions', predictions)

我們這裏隻有關於影評態度(0到1)的一個輸出。Sigmoid方法把最終完全連接的輸出層映射到了這個區間。tf.summary.histogram() 記錄了我們的預測結果,並輸出成TensorBoard中的柱狀圖。這能使我們清楚的看到,在訓練我們的模型過程中,預測分布的是如何改變的。同時,我們也可以看到訓練集和驗證集是如何進行比較的。


# Calculate the cost
with tf.name_scope('cost'):
    cost = tf.losses.mean_squared_error(labels, predictions)
    tf.summary.scalar('cost', cost)


這裏計算的是訓練過程中的花費。我們用到了tf.summary.scalar()方法,這裏花費是一個沒有區間的標量值。


# Train the model
with tf.name_scope('train'):    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

Adam是一個通用的優化器,它能使模塊訓練更有效率。你還可以用其他的算法。這裏我們就不展開討論了。


# Determine the accuracy
with tf.name_scope("accuracy"):
    correct_pred = tf.equal(tf.cast(tf.round(predictions), 
                                        tf.int32), 
                                        labels)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
    tf.summary.scalar('accuracy', accuracy)

我們的預測值是0到1之間的數字。為此我們需要在和標簽比對前,對其進行取整。tf.reduce_mean() 能最大化準確的預測值。

# Merge all of the summaries
merged = tf.summary.merge_all()

這裏,我們進行匯總,簡化數據保存的過程。


# Export the nodes 
export_nodes = ['inputs', 'labels', 'keep_prob','initial_state',        
                'final_state','accuracy', 'predictions', 'cost', 
                'optimizer', 'merged']
Graph = namedtuple('Graph', export_nodes)
local_dict = locals()
graph = Graph(*[local_dict[each] for each in export_nodes])


導出所有節點,為後續的訓練函數做準備。至此,我們完成了循環神經網絡的創建,接下來我們就能用其進行訓練了。


def train(model, epochs, log_string):
    '''Train the RNN'''

    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Used to determine when to stop the training early
        valid_loss_summary = []
        
        # Keep track of which batch iteration is being trained
        iteration = 0

        print()
        print("Training Model: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
        valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))

        for e in range(epochs):
            state = sess.run(model.initial_state)
            
            # Record progress with each epoch
            train_loss = []
            train_acc = []
            val_acc = []
            val_loss = []

            with tqdm(total=len(x_train)) as pbar:
                for _, (x, y) in enumerate(get_batches(x_train,       
                                               y_train, 
                                               batch_size), 1):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: dropout,
                            model.initial_state: state}
                    summary, loss, acc, state, _ =     
                                          sess.run([model.merged, 
                                                  model.cost, 
                                                  model.accuracy, 
                                                  model.final_state, 
                                                  model.optimizer], 
                                                  feed_dict=feed)                
                    
                    # Record the loss and accuracy of each training  
                      batch
                    
                    train_loss.append(loss)
                    train_acc.append(acc)
                    
                    # Record the progress of training
                    train_writer.add_summary(summary, iteration)
                    
                    iteration += 1
                    pbar.update(batch_size)
            
            # Average the training loss and accuracy of each epoch
            avg_train_loss = np.mean(train_loss)
            avg_train_acc = np.mean(train_acc) 

            val_state = sess.run(model.initial_state)
            with tqdm(total=len(x_valid)) as pbar:
                for x, y in get_batches(x_valid,y_valid,batch_size):
                    feed = {model.inputs: x,
                            model.labels: y[:, None],
                            model.keep_prob: 1,
                            model.initial_state: val_state}
                    summary, batch_loss, batch_acc, val_state =     
                                 sess.run([model.merged, 
                                           model.cost, 
                                           model.accuracy, 
                                           model.final_state], 
                                           feed_dict=feed)
                    
                    # Record the validation loss and accuracy of 
                      each epoch
                    
                    val_loss.append(batch_loss)
                    val_acc.append(batch_acc)
                    pbar.update(batch_size)
            
            # Average the validation loss and accuracy of each epoch
            avg_valid_loss = np.mean(val_loss)    
            avg_valid_acc = np.mean(val_acc)
            valid_loss_summary.append(avg_valid_loss)
            
            # Record the validation data's progress
            valid_writer.add_summary(summary, iteration)

            # Print the progress of each epoch
            print("Epoch: {}/{}".format(e, epochs),
                  "Train Loss: {:.3f}".format(avg_train_loss),
                  "Train Acc: {:.3f}".format(avg_train_acc),
                  "Valid Loss: {:.3f}".format(avg_valid_loss),
                  "Valid Acc: {:.3f}".format(avg_valid_acc))

            # Stop training if the validation loss does not decrease 
              after 3 epochs
            
            if avg_valid_loss > min(valid_loss_summary):
                print("No Improvement.")
                stop_early += 1
                if stop_early == 3:
                    break   
            
            # Reset stop_early if the validation loss finds a new low
            # Save a checkpoint of the model
            else:
                print("New Record!")
                stop_early = 0
                checkpoint ="./sentiment_{}.ckpt".format(log_string)
                saver.save(sess, checkpoint)


下麵讓我們一步步來解釋上麵的代碼


saver = tf.train.Saver()

這裏是為了保存每個檢查點



train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))

這段代碼把訓練的匯總作為日誌存在本地。建議把它們存放在同一個logs文件夾下,同時訓練和驗證匯總分別放在不同的子文件夾。這樣能更方便的在TensorBoard上進行比對。


tqdm() (https://pypi.python.org/pypi/tqdm)則會記錄每個過程的時間,這樣我就能了解每次過程中還有大致多少剩餘時間。


# Record the validation data's progress
valid_writer.add_summary(summary, iteration)

上麵這段代碼是為了提供每次處理的匯總情況。雖然這不是必須的,但它能提供不少我們需要的信息。


# Reset stop_early if the validation loss finds a new low
# Save a checkpoint of the model
else:
    print("New Record!")
    stop_early = 0
    checkpoint = "./sentiment_{}.ckpt".format(log_string)
    saver.save(sess, checkpoint)

這裏我強烈建議盡量早的獲取檢查點,這能給訓練節省大量的時間。同時,根據你自己的情況來決定大致需要幾個節點來。這裏我隻對最好的一次模型迭代進行保存(檢查點),因為這樣能夠節省一定的電腦空間。當然這個可以根據不同的項目來進行調整。

下麵是我用的一些默認超參數:

n_words = len(word_index)
embed_size = 300
batch_size = 250
lstm_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.001
epochs = 100
multiple_fc = False
fc_units = 256

超參數能夠很好的為這個模塊性能調優下麵是我的調優方案:




# Train the model with the desired tuning parameters
for lstm_size in [64,128]:
    for multiple_fc in [True, False]:
        for fc_units in [128, 256]:
            log_string = 'ru={},fcl={},fcu={}'.format(lstm_size,
                                                      multiple_fc,
                                                      fc_units)
            model = build_rnn(n_words = n_words, 
                              embed_size = embed_size,
                              batch_size = batch_size,
                              lstm_size = lstm_size,
                              num_layers = num_layers,
                              dropout = dropout,
                              learning_rate = learning_rate,
                              multiple_fc = multiple_fc,
                              fc_units = fc_units)            
            train(model, epochs, log_string)

這裏我可以調整不同的lstm_size, multiple_fc, fc_units參數值用這種結構,你可以設置任何你想調優的值。注意,這裏需要在日誌中記錄這些數值。



def make_predictions(lstm_size, multiple_fc, fc_units, checkpoint):
    '''Predict the sentiment of the testing data'''
    
    # Record all of the predictions
    all_preds = []

    model = build_rnn(n_words = n_words, 
                      embed_size = embed_size,
                      batch_size = batch_size,
                      lstm_size = lstm_size,
                      num_layers = num_layers,
                      dropout = dropout,
                      learning_rate = learning_rate,
                      multiple_fc = multiple_fc,
                      fc_units = fc_units) 
    
    with tf.Session() as sess:
        saver = tf.train.Saver()
        # Load the model
        saver.restore(sess, checkpoint)
        test_state = sess.run(model.initial_state)
        for _, x in enumerate(get_test_batches(x_test, 
                                               batch_size), 1):
            feed = {model.inputs: x,
                    model.keep_prob: 1,
                    model.initial_state: test_state}
            predictions = sess.run(model.predictions,feed_dict=feed)
            for pred in predictions:
                all_preds.append(float(pred))
                
    return all_preds

這個方法會得到訓練數據的預測結果這裏要注意的是參數的設置需要與你的調優能夠匹配。否則,你的預測結果可能會使用默認的參數值,而導致最終的結果並不能達到你的預期。


這就是整個項目的大致內容,這是在Github(https://github.com/Currie32/Movie-Reviews-Sentiment)上的源代碼。如果對代碼有任何問題或改進意見,我們也可以進行進一步的探討。



本文由北郵@愛可可-愛生活 老師推薦,阿裏雲雲棲社區組織翻譯。

Predicting Movie Review Sentiment with TensorFlow and TensorBoardDave Currie


作者介紹:Dave Currie 致力於機器學習(自然語言處理方向)與數據科學研究的軟件工程師。

Linkedinhttps://www.linkedin.com/in/davidcurrie32


原文

附件為原文pdf版本









最後更新:2017-06-19 12:02:06

  上一篇:go  噓!阿裏大牛的書單被我“偷”來了
  下一篇:go  純幹貨 | 從淘寶到雲端的高可用架構演進