python是否有一個基于超級集合生成向量的庫？ -有解無憂

我需要根據資料集的特征總量為資料集中的每個樣本生成矢量。假設該資料集有6個特征
features = ['a', 'b', 'c', 'd', 'e', 'f']/code>
一個樣本s1只有3個特征
s1 = ['a', 'b', 'c']/code>
我想為s1生成一個代表特征的向量 >s1 = [1, 1, 1, 0, 0 ,0]
另一個例子。s2 = ['a', 'c', 'f']，那么這個向量應該是[1, 0, 1, 0, 1]<
是否有任何python庫可以完成這個任務？如果沒有，我應該如何完成這項任務呢？
uj5u.com熱心網友回復：
可能不是最優化的，但是如果你想為資料集中的每個樣本建立一個向量，你只需要為0和2⁶之間的每個數字創建一個二進制陣列：
features = ['a'/span>, 'b'/span>, 'c'/span>, 'd'/span>, 'e'/span>, 'f'/span>]
l = len(features)
vectors = [[int(y) for y in f'{x: 0{l}b}'/span>] for x in range（2 ** l）] 

print（vectors）。

uj5u.com熱心網友回復：
這是很直接的，并不是真的需要一個庫。
 這是很直接的，并不是真的需要一個庫。
純Python解決方案
features = ['a'/span>, 'b'/span>, 'c'/span>, 'd'/span>, 'e'/span>, 'f'/span>]
features_lookup = dict(map(reversed，enumerate（features）)


s1 = ['a'/span>, 'b'/span>, 'c'/span>]
S2 = ['a'/span>, 'c'/span>, 'f'/span>]


def create_feature_vector(sample, lookup) 。
    vec = [0]*len(lookup)
    for value in sample:
        vec[lookup[value]] = 1
    return vec

輸出：
>>> create_feature_vector(s1, features_lookup)
[1, 1, 1, 0, 0, 0]

>>> create_feature_vector(s2, features_lookup)
[1, 0, 1, 0, 1]

單個特征向量的Numpy替代方案
如果你碰巧已經在使用numpy，如果你的特征集很大，這將是更有效的方法：
 
import numpy as np


特征 = np. array(['a'/span>, 'b'/span>, 'c'/span>, 'd'/span>, 'e'/span>, 'f'/span>] )
sample_size=3


def feature_sample_and_vector（sample_size, features）。
    n = 特征.大小
    sample_indices = np.random.choice(range(n), sample_size, replace=False)
    sample = features[sample_indices]
    vector = np.zeros(n, dtype="uint8"/span>)
    vector[sample_indices] =1
    return sample, vector

Numpy用于大量樣本及其特征向量的替代方案
使用numpy使我們能夠很好地擴展大型特征集和/或大型樣本集。請注意，這種方法可能會產生重復的樣本：
 import random
import numpy as np


# 假設特征已經是一個numpy陣列。
def generate_samples（features, num_samples, sample_size）。
    n = 特征.大小
    vectors = np.zeros((num_samples, n), dtype="uint8")
    idxs = [random.sample(range(n), k=sample_size) for _ in range(num_samples) ]
    cols = np.sort(np.array(idxs), axis=1)  # 如果特征的順序并不重要，你可以洗掉排序。
    rows = np.repeat(np.range(num_samples).reshape(-1, 1), samples_size, axis=1)
    vectors[rows, cols] =1
    樣本=特征[cols]
    return samples, vectors

演示：
>>> generate_samples(features, 10, 3)
(array(['d', 'e', 'f'],
        ['a'/span>, 'b'/span>, 'c'/span>]。
        ['c'/span>, 'd'/span>, 'e'/span>],
        ['c'/span>, 'd'/span>, 'f'/span>],
        ['a'/span>, 'b'/span>, 'f'/span>]。
        ['a'/span>, 'e'/span>, 'f'/span>]。
        ['c'/span>, 'd'/span>, 'f'/span>]。
        ['b'/span>, 'e'/span>, 'f'/span>]。
        ['b'/span>, 'd'/span>, 'f'/span>],
        ['a', 'c', 'e']], dtype='<U1'）。)
 array([[0, 0, 0, 1, 1, 1】。]
        [1, 1, 1, 0, 0, >0] 。
        [0, 0, 1, 1, 0],
        [0, 0, 1, 1, 0, 1] 。
        [1, 1, 0, 0, 0, >1]。
        [1, 0, 0, 0, 1, >1] 。
        [0, 0, 1, 1, 0, 1] 。
        [0, 1, 0, 0, 1, >1]。
        [0, 1, 0, 1, 0, 1] 。
        [1, 0, 1。0, 1, 0]], dtype=uint8)

一個非常簡單的計時基準，從26個特征集中抽取10萬個大小為12的樣本：
 一個非常簡單的計時基準。
在 [2]: features = np.array(list("abcdefghijklmnopqrstuvwxyz")

在[3]：num_samples = 100000。

在 [4]: sample_size = 12 

在 [5]: %timeit generate_samples(features, num_samples, sample_size)
645 ms ± 9.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

唯一真正的瓶頸是產生索引所需的串列理解。不幸的是，沒有二維變體可以使用np.random.choice()生成無替換樣本，所以你仍然不得不采用相對緩慢的方法來生成隨機樣本索引。






        
      轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/321107.html
      標籤：
      上一篇：替換一個陣列中的全部元素
下一篇：如何計算一個點是否落在網格內的一條線上？