資料科學家的Pytest-有解無憂

作者|Khuyen Tran
編譯|VK
來源|Towards Datas Science

動機

應用不同的python代碼來處理notebook中的資料是很有趣的，但是為了使代碼具有可復制性，你需要將它們放入函式和類中，將代碼放入腳本時，代碼可能會因某些函式而中斷，那么，如何檢查你的功能是否如你所期望的那樣作業呢？

例如，我們使用TextBlob創建一個函式來提取文本的情感，TextBlob是一個用于處理文本資料的Python庫，我們希望確保它像我們預期的那樣作業：如果測驗為積極，函式回傳一個大于0的值；如果文本為消極，則回傳一個小于0的值，

from textblob import TextBlob

def extract_sentiment(text: str):
        '''使用textblob提取情緒，
        	在范圍[- 1,1]內'''

        text = TextBlob(text)

        return text.sentiment.polarity

要知道函式是否每次都會回傳正確的值，最好的方法是將這些函式應用于不同的示例，看看它是否會產生我們想要的結果，這就是測驗的重要性，

一般來說，你應該在資料科學專案中使用測驗，因為它允許你：

確保代碼按預期作業
檢測邊緣情況
有信心用改進的代碼交換現有代碼，而不必擔心破壞整個管道

有許多Python工具可用于測驗，但最簡單的工具是Pytest，

Pytest入門

Pytest是一個框架，它使得用Python撰寫小測驗變得容易，我喜歡pytest，因為它可以幫助我用最少的代碼撰寫測驗，如果你不熟悉測驗，那么pytest是一個很好的入門工具，

要安裝pytest，請運行

pip install -U pytest

要測驗上面所示的函式，我們可以簡單地創建一個函式，該函式以test_開頭，后面跟著我們要測驗的函式的名稱，即extract_sentiment

#sentiment.py
def extract_sentiment(text: str):
        '''使用textblob提取情緒，
        	在范圍[- 1,1]內'''

        text = TextBlob(text)

        return text.sentiment.polarity

def test_extract_sentiment():

    text = "I think today will be a great day"

    sentiment = extract_sentiment(text)

    assert sentiment > 0

在測驗函式中，我們將函式extract_sentiment應用于示例文本：“I think today will be a great day”，我們使用assert sentiment > 0來確保情緒是積極的，

就這樣！現在我們準備好運行測驗了，

如果我們的腳本名是sentiment.py，我們可以運行

pytest sentiment.py

Pytest將遍歷我們的腳本并運行以test開頭的函式，上面的測驗輸出如下所示

========================================= test session starts ==========================================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1

collected 1 item
process.py .                                                                                     [100%]

========================================== 1 passed in 0.68s ===========================================

很酷！我們不需要指定要測驗哪個函式，只要函式名以test開頭，pytest就會檢測并執行該函式！我們甚至不需要匯入pytest就可以運行pytest

如果測驗失敗，pytest會產生什么輸出？

#sentiment.py

def test_extract_sentiment():

    text = "I think today will be a great day"

    sentiment = extract_sentiment(text)

    assert sentiment < 0

>>> pytest sentiment.py

========================================= test session starts ==========================================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
collected 1 item

process.py F                                                                                     [100%]
=============================================== FAILURES ===============================================
________________________________________ test_extract_sentiment ________________________________________

def test_extract_sentiment():
    
        text = "I think today will be a great day"
    
        sentiment = extract_sentiment(text)
    
>       assert sentiment < 0
E       assert 0.8 < 0

process.py:17: AssertionError
======================================= short test summary info ========================================
FAILED process.py::test_extract_sentiment - assert 0.8 < 0
========================================== 1 failed in 0.84s ===========================================

從輸出可以看出，測驗失敗是因為函式的情感值為0.8，并且不小于0！我們不僅可以知道我們的函式是否如預期的那樣作業，而且還可以知道為什么它不起作用，從這個角度來看，我們知道在哪里修復我們的函式，以實作我們想要的功能，

同一函式的多次測驗

我們可以用其他例子來測驗我們的函式，新測驗函式的名稱是什么？

第二個函式的名稱可以是test_extract_sentiment_2，如果我們想在帶有負面情緒的文本上測驗函式，那么它的名稱可以是test_extract_sentiment_negative，任何函式名只要以test開頭就可以作業

#sentiment.py

def test_extract_sentiment_positive():

    text = "I think today will be a great day"

    sentiment = extract_sentiment(text)

    assert sentiment > 0

def test_extract_sentiment_negative():

    text = "I do not think this will turn out well"

    sentiment = extract_sentiment(text)

    assert sentiment < 0

>>> pytest sentiment.py

========================================= test session starts ==========================================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
collected 2 items

process.py .F                                                                                    [100%]
=============================================== FAILURES ===============================================
___________________________________ test_extract_sentiment_negative ____________________________________

def test_extract_sentiment_negative():
    
        text = "I do not think this will turn out well"
    
        sentiment = extract_sentiment(text)
    
>       assert sentiment < 0
E       assert 0.0 < 0

process.py:25: AssertionError
======================================= short test summary info ========================================
FAILED process.py::test_extract_sentiment_negative - assert 0.0 < 0
===================================== 1 failed, 1 passed in 0.80s ======================================

從輸出中，我們知道一個測驗通過，一個測驗失敗，以及測驗失敗的原因，我們希望“I do not think this will turn out well”這句話是消極的，但結果卻是0，

這有助于我們理解，函式可能不會100%準確；因此，在使用此函式提取文本情感時，我們應該謹慎，

引數化：組合測驗

以上2個測驗功能用于測驗同一功能，有沒有辦法把兩個例子合并成一個測驗函式？這時引數化就派上用場了

用樣本串列引數化

使用pytest.mark.parametrize()，通過在引數中提供示例串列，我們可以使用不同的示例執行測驗，

# sentiment.py

from textblob import TextBlob
import pytest

def extract_sentiment(text: str):
        '''使用textblob提取情緒，
        	在范圍[- 1,1]內'''

        text = TextBlob(text)

        return text.sentiment.polarity

testdata = https://www.cnblogs.com/panchuangai/p/["I think today will be a great day","I do not think this will turn out well"]

@pytest.mark.parametrize('sample', testdata)
def test_extract_sentiment(sample):

    sentiment = extract_sentiment(sample)

    assert sentiment > 0

在上面的代碼中，我們將變數sample分配給一個示例串列，然后將該變數添加到測驗函式的引數中，現在每個例子將一次測驗一次，

========================== test session starts ===========================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
collected 2 items

sentiment.py .F                                                    [100%]

================================ FAILURES ================================
_____ test_extract_sentiment[I do not think this will turn out well] _____

sample = 'I do not think this will turn out well'

@pytest.mark.parametrize('sample', testdata)
    def test_extract_sentiment(sample):
    
        sentiment = extract_sentiment(sample)
    
>       assert sentiment > 0
E       assert 0.0 > 0

sentiment.py:19: AssertionError
======================== short test summary info =========================
FAILED sentiment.py::test_extract_sentiment[I do not think this will turn out well]
====================== 1 failed, 1 passed in 0.80s ===================

使用parametrize()，我們可以在once函式中測驗兩個不同的示例！

使用示例串列和預期輸出進行引數化

如果我們期望不同的例子有不同的輸出呢？Pytest還允許我們向測驗函式的引數添加示例和預期輸出！

例如，下面的函式檢查文本是否包含特定的單詞，

def text_contain_word(word: str, text: str):
    '''檢查文本是否包含特定的單詞'''
    
    return word in text

如果文本包含單詞，則回傳True，

如果單詞是“duck”，而文本是“There is a duck in this text”，我們期望回傳True，

如果單詞是‘duck’，而文本是‘There is nothing here’，我們期望回傳False，

我們將使用parametrize()而不使用元組串列，

# process.py
import pytest
def text_contain_word(word: str, text: str):
    '''查找文本是否包含特定的單詞'''
    
    return word in text

testdata = https://www.cnblogs.com/panchuangai/p/[
    ('There is a duck in this text',True),
    ('There is nothing here', False)
    ]

@pytest.mark.parametrize('sample, expected_output', testdata)
def test_text_contain_word(sample, expected_output):

    word = 'duck'

    assert text_contain_word(word, sample) == expected_output

函式的引數結構為parametrize('sample，expected_out'，'testdata)，testdata=https://www.cnblogs.com/panchuangai/p/[(，)，(，)

>>> pytest process.py

========================================= test session starts ==========================================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
plugins: hydra-core-1.0.0, Faker-4.1.1
collected 2 items

process.py ..                                                                                    [100%]

========================================== 2 passed in 0.04s ===========================================

我們的兩個測驗都通過了！

一次測驗一個函式

當腳本中測驗函式的數量越來越大時，你可能希望一次測驗一個函式而不是多個函式，用pytest很容易，pytest file.py::function_name

testdata = https://www.cnblogs.com/panchuangai/p/["I think today will be a great day","I do not think this will turn out well"]

@pytest.mark.parametrize('sample', testdata)
def test_extract_sentiment(sample):

    sentiment = extract_sentiment(sample)

    assert sentiment > 0


testdata = https://www.cnblogs.com/panchuangai/p/[
    ('There is a duck in this text',True),
    ('There is nothing here', False)
    ]

@pytest.mark.parametrize('sample, expected_output', testdata)
def test_text_contain_word(sample, expected_output):

    word = 'duck'

    assert text_contain_word(word, sample) == expected_output

例如，如果你只想運行test_text_contain_word，請運行

pytest process.py::test_text_contain_word

而pytest只執行我們指定的一個測驗！

fixture：使用相同的資料來測驗不同的函式

如果我們想用相同的資料來測驗不同的函式呢？例如，我們想測驗“今Today I found a duck and I am happy”這句話是否包含“duck ”這個詞，它的情緒是否是積極的，這是fixture派上用場的時候，

pytest fixture是一種向不同的測驗函式提供資料的方法

@pytest.fixture
def example_data():
    return 'Today I found a duck and I am happy'


def test_extract_sentiment(example_data):

    sentiment = extract_sentiment(example_data)

    assert sentiment > 0

def test_text_contain_word(example_data):

    word = 'duck'

    assert text_contain_word(word, example_data) == True

在上面的示例中，我們使用decorator創建了一個示例資料@pytest.fixture在函式example_data的上方，這將把example_data轉換成一個值為“Today I found a duck and I am happy”的變數

現在，我們可以使用示例資料作為任何測驗的引數！

組織你的專案

最后但并非最不重要的是，當代碼變大時，我們可能需要將資料科學函式和測驗函式放在兩個不同的檔案夾中，這將使我們更容易找到每個函式的位置，

用test_<name>.py或<name>_test.py命名我們的測驗函式. Pytest將搜索名稱以“test”結尾或以“test”開頭的檔案，并在該檔案中執行名稱以“test”開頭的函式，這很方便！

有不同的方法來組織你的檔案，你可以將我們的資料科學檔案和測驗檔案組織在同一個目錄中，也可以在兩個不同的目錄中組織，一個用于源代碼，一個用于測驗

方法1：

test_structure_example/
├── process.py
└── test_process.py

方法2：

test_structure_example/
├── src
│   └── process.py
└── tests
    └── test_process.py

由于資料科學函式很可能有多個檔案，測驗函式有多個檔案，所以你可能需要將它們放在不同的目錄中，如方法2，

這是2個檔案的樣子

from textblob import TextBlob

def extract_sentiment(text: str):
        '''使用textblob提取情緒，
        	在范圍[- 1,1]內'''

        text = TextBlob(text)

        return text.sentiment.polarity

import sys
import os.path
sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
from src.process import extract_sentiment
import pytest


def test_extract_sentiment():

    text = 'Today I found a duck and I am happy'

    sentiment = extract_sentiment(text)

    assert sentiment > 0

簡單地說添加sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))可以從父目錄匯入函式，

在根目錄(test_structure_example/)下，運行pytest tests/test_process.py或者運行在test_structure_example/tests目錄下的pytest test_process.py，

========================== test session starts ===========================
platform linux -- Python 3.8.3, pytest-5.4.2, py-1.8.1, pluggy-0.13.1
collected 1 item

tests/test_process.py .                                            [100%]

=========================== 1 passed in 0.69s ============================

很酷！

結論

你剛剛了解了pytest，我希望本文能很好地概述為什么測驗很重要，以及如何將測驗與pytest結合到資料科學專案中，通過測驗，你不僅可以知道你的函式是否按預期作業，而且還可以自信地使用不同的工具或不同的代碼結構來切換現有代碼，

本文的源代碼可以在這里找到：

https://github.com/khuyentran1401/Data-science/tree/master/data_science_tools/pytest

我喜歡寫一些基本的資料科學概念，玩不同的演算法和資料科學工具，

原文鏈接：https://towardsdatascience.com/pytest-for-data-scientists-2990319e55e6

歡迎關注磐創AI博客站：
http://panchuang.net/

sklearn機器學習中文官方檔案：
http://sklearn123.com/

歡迎關注磐創博客資源匯總站：
http://docs.panchuang.net/

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/186555.html

標籤：其他

上一篇：Docker鏡像

下一篇：掌握Seaborn的三分之一：使用relplot進行統計繪圖