Linux Namespace 入門系列：Namespace API-有解無憂

Linux Namespace 是 Linux 提供的一種內核級別環境隔離的方法，用官方的話來說，Linux Namespace 將全域系統資源封裝在一個抽象中，從而使 namespace 內的行程認為自己具有獨立的資源實體，這項技術本來沒有掀起多大的波瀾，是容器技術的崛起讓他重新引起了大家的注意，

Linux Namespace 有如下 6 個種類：

分類	系統呼叫引數	相關內核版本
Mount namespaces	CLONE_NEWNS	Linux 2.4.19
UTS namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC namespaces	CLONE_NEWIPC	Linux 2.6.19
PID namespaces	CLONE_NEWPID	Linux 2.6.24
Network namespaces	CLONE_NEWNET	始于Linux 2.6.24 完成于 Linux 2.6.29
User namespaces	CLONE_NEWUSER	始于 Linux 2.6.23 完成于 Linux 3.8

namespace 的 API 由三個系統呼叫和一系列 /proc 檔案組成，本文將會詳細介紹這些系統呼叫和 /proc 檔案，為了指定要操作的 namespace 型別，需要在系統呼叫的 flag 中通過常量 CLONE_NEW* 指定（包括 CLONE_NEWIPC，CLONE_NEWNS， CLONE_NEWNET，CLONE_NEWPID，CLONE_NEWUSER 和 `CLONE_NEWUTS），可以指定多個常量，通過 |（位或）操作來實作，

簡單描述一下三個系統呼叫的功能：

clone() : 實作執行緒的系統呼叫，用來創建一個新的行程，并可以通過設計上述系統呼叫引數達到隔離的目的，
unshare() : 使某行程脫離某個 namespace，
setns() : 把某行程加入到某個 namespace，

具體的實作原理請往下看，

1. clone()

clone() 的原型如下：

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

child_func : 傳入子行程運行的程式主函式，
child_stack : 傳入子行程使用的堆疊空間，
flags : 表示使用哪些 CLONE_* 標志位，
args : 用于傳入用戶引數，

clone() 與 fork() 類似，都相當于把當前行程復制了一份，但 clone() 可以更細粒度地控制與子行程共享的資源（其實就是通過 flags 來控制），包括虛擬記憶體、打開的檔案描述符和信號量等等，一旦指定了標志位 CLONE_NEW*，相對應型別的 namespace 就會被創建，新創建的行程也會成為該 namespace 中的一員，

clone() 的原型并不是最底層的系統呼叫，而是封裝過的，真正的系統呼叫內核實作函式為 do_fork()，形式如下：

long do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr)

其中 clone_flags 可以賦值為上面提到的標志，

下面來看一個例子：

/* demo_uts_namespaces.c

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Demonstrate the operation of UTS namespaces.
*/
#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static int              /* Start function for cloned child */
childFunc(void *arg)
{
    struct utsname uts;

    /* 在新的 UTS namespace 中修改主機名 */

    if (sethostname(arg, strlen(arg)) == -1)
        errExit("sethostname");

    /* 獲取并顯示主機名 */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in child:  %s\n", uts.nodename);

    /* Keep the namespace open for a while, by sleeping.
       This allows some experimentation--for example, another
       process might join the namespace. */
     
    sleep(100);

    return 0;           /* Terminates child */
}

/* 定義一個給 clone 用的堆疊，堆疊大小1M */
#define STACK_SIZE (1024 * 1024) 

static char child_stack[STACK_SIZE];

int
main(int argc, char *argv[])
{
    pid_t child_pid;
    struct utsname uts;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    /* 呼叫 clone 函式創建一個新的 UTS namespace，其中傳出一個函式，還有一個堆疊空間（為什么傳尾指標，因為堆疊是反著的）;
       新的行程將在用戶定義的函式 childFunc() 中執行 */

    child_pid = clone(childFunc, 
                    child_stack + STACK_SIZE,   /* 因為堆疊是反著的， 
                                                   所以傳尾指標 */ 
                    CLONE_NEWUTS | SIGCHLD, argv[1]);
    if (child_pid == -1)
        errExit("clone");
    printf("PID of child created by clone() is %ld\n", (long) child_pid);

    /* Parent falls through to here */

    sleep(1);           /* 給子行程預留一定的時間來改變主機名 */

    /* 顯示當前 UTS namespace 中的主機名，和 
       子行程所在的 UTS namespace 中的主機名不同 */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in parent: %s\n", uts.nodename);

    if (waitpid(child_pid, NULL, 0) == -1)      /* 等待子行程結束 */
        errExit("waitpid");
    printf("child has terminated\n");

    exit(EXIT_SUCCESS);
}

該程式通過標志位 CLONE_NEWUTS 呼叫 clone() 函式創建一個 UTS namespace，UTS namespace 隔離了兩個系統識別符號 — 主機名和 NIS 域名 —它們分別通過 sethostname() 和 setdomainname() 這兩個系統呼叫來設定，并通過系統呼叫 uname() 來獲取，

下面將對程式中的一些關鍵部分進行解讀（為了簡單起見，我們將省略其中的錯誤檢查），

程式運行時后面需要跟上一個命令列引數，它將會創建一個在新的 UTS namespace 中執行的子行程，該子行程會在新的 UTS namespace 中將主機名改為命令列引數中提供的值，

主程式的第一個關鍵部分是通過系統呼叫 clone() 來創建子行程：

child_pid = clone(childFunc, 
                  child_stack + STACK_SIZE,   /* Points to start of 
                                                 downwardly growing stack */ 
                  CLONE_NEWUTS | SIGCHLD, argv[1]);

printf("PID of child created by clone() is %ld\n", (long) child_pid);

子行程將會在用戶定義的函式 childFunc() 中開始執行，該函式將會接收 clone() 最后的引數（argv[1]）作為自己的引數，并且標志位包含了 CLONE_NEWUTS，所以子行程會在新創建的 UTS namespace 中執行，

接下來主行程睡眠一段時間，讓子行程能夠有時間更改其 UTS namespace 中的主機名，然后呼叫 uname() 來檢索當前 UTS namespace 中的主機名，并顯示該主機名：

sleep(1);           /* Give child time to change its hostname */

uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);

與此同時，由 clone() 創建的子行程執行的函式 childFunc() 首先將主機名改為命令列引數中提供的值，然后檢索并顯示修改后的主機名：

sethostname(arg, strlen(arg);
    
uname(&uts);
printf("uts.nodename in child:  %s\n", uts.nodename);

子行程退出之前也睡眠了一段時間，這樣可以防止新的 UTS namespace 不會被關閉，讓我們能夠有機會進行后續的實驗，

執行程式，觀察父行程和子行程是否處于不同的 UTS namespace 中：

$ su                   # 需要特權才能創建 UTS namespace
Password: 
# uname -n
antero
# ./demo_uts_namespaces bizarro
PID of child created by clone() is 27514
uts.nodename in child:  bizarro
uts.nodename in parent: antero

除了 User namespace 之外，創建其他的 namespace 都需要特權，更確切地說，是需要相應的 Linux Capabilities，即 CAP_SYS_ADMIN，這樣就可以避免設定了 SUID（Set User ID on execution）的程式因為主機名不同而做出一些愚蠢的行為，如果對 Linux Capabilities 不是很熟悉，可以參考我之前的文章：Linux Capabilities 入門教程：概念篇，

2. proc 檔案

每個行程都有一個 /proc/PID/ns 目錄，其下面的檔案依次表示每個 namespace, 例如 user 就表示 user namespace，從 3.8 版本的內核開始，該目錄下的每個檔案都是一個特殊的符號鏈接，鏈接指向 $namespace:[$namespace-inode-number]，前半部份為 namespace 的名稱，后半部份的數字表示這個 namespace 的句柄號，句柄號用來對行程所關聯的 namespace 執行某些操作，

$ ls -l /proc/$$/ns         # $$ 表示當前所在的 shell 的 PID
total 0
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

這些符號鏈接的用途之一是用來確認兩個不同的行程是否處于同一 namespace 中，如果兩個行程指向的 namespace inode number 相同，就說明他們在同一個 namespace 下，否則就在不同的 namespace 下，這些符號鏈接指向的檔案比較特殊，不能直接訪問，事實上指向的檔案存放在被稱為 nsfs 的檔案系統中，該檔案系統用戶不可見，可以使用系統呼叫 stat() 在回傳的結構體的 st_ino 欄位中獲取 inode number，在 shell 終端中可以用命令（實際上就是呼叫了 stat()）看到指向檔案的 inode 資訊：

$ stat -L /proc/$$/ns/net
  File: /proc/3232/ns/net
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 4h/4d	Inode: 4026531956  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-17 15:45:23.783304900 +0800
Modify: 2020-01-17 15:45:23.783304900 +0800
Change: 2020-01-17 15:45:23.783304900 +0800
 Birth: -

除了上述用途之外，這些符號鏈接還有其他的用途，如果我們打開了其中一個檔案，那么只要與該檔案相關聯的檔案描述符處于打開狀態，即使該 namespace 中的所有行程都終止了，該 namespace 依然不會被洗掉，通過 bind mount 將符號鏈接掛載到系統的其他位置，也可以獲得相同的效果：

$ touch ~/uts
$ mount --bind /proc/27514/ns/uts ~/uts

3. setns()

加入一個已經存在的 namespace 可以通過系統呼叫 setns() 來完成，它的原型如下：

int setns(int fd, int nstype);

更確切的說法是：setns() 將呼叫的行程與特定型別 namespace 的一個實體分離，并將該行程與該型別 namespace 的另一個實體重新關聯，

fd 表示要加入的 namespace 的檔案描述符，可以通過打開其中一個符號鏈接來獲取，也可以通過打開 bind mount 到其中一個鏈接的檔案來獲取，
nstype 讓呼叫者可以去檢查 fd 指向的 namespace 型別，值可以設定為前文提到的常量 CLONE_NEW*，填 0 表示不檢查，如果呼叫者已經明確知道自己要加入了 namespace 型別，或者不關心 namespace 型別，就可以使用該引數來自動校驗，

結合 setns() 和 execve() 可以實作一個簡單但非常有用的功能：將某個行程加入某個特定的 namespace，然后在該 namespace 中執行命令，直接來看例子：

/* ns_exec.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Join a namespace and execute a command in the namespace
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

int
main(int argc, char *argv[])
{
    int fd;

    if (argc < 3) {
        fprintf(stderr, "%s /proc/PID/ns/FILE cmd [arg...]\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    fd = open(argv[1], O_RDONLY);   /* 獲取想要加入的 namespace 的檔案描述符 */
    if (fd == -1)
        errExit("open");

    if (setns(fd, 0) == -1)         /* 加入該 namespace */
        errExit("setns");

    execvp(argv[2], &argv[2]);      /* 在加入的 namespace 中執行相應的命令 */
    errExit("execvp");
}

該程式運行需要兩個或兩個以上的命令列引數，第一個引數表示特定的 namespace 符號鏈接的路徑（或者 bind mount 到這些符號鏈接的檔案路徑）；第二個引數表示要在該符號鏈接相對應的 namespace 中執行的程式名稱，以及執行這個程式所需的命令列引數，關鍵步驟如下：

fd = open(argv[1], O_RDONLY);   /* 獲取想要加入的 namespace 的檔案描述符 */

setns(fd, 0);                   /* 加入該 namespace */

execvp(argv[2], &argv[2]);      /* 在加入的 namespace 中執行相應的命令 */

還記得我們之前已經通過 bind mount 將 demo_uts_namespaces 創建的 UTS namespace 掛載到 ~/uts 中了嗎？可以將本例中的程式與之結合，讓新行程可以在該 UTS namespace 中執行 shell：

    $ ./ns_exec ~/uts /bin/bash     # ~/uts 被 bind mount 到了 /proc/27514/ns/uts
    My PID is: 28788

驗證新的 shell 是否與 demo_uts_namespaces 創建的子行程處于同一個 UTS namespace：

$ hostname
bizarro
$ readlink /proc/27514/ns/uts
uts:[4026532338]
$ readlink /proc/$$/ns/uts      # $$ 表示當前 shell 的 PID
uts:[4026532338]

在早期的內核版本中，不能使用 setns() 來加入 mount namespace、PID namespace 和 user namespace，從 3.8 版本的內核開始，setns() 支持加入所有的 namespace，

util-linux 包里提供了nsenter 命令，其提供了一種方式將新創建的行程運行在指定的 namespace 里面，它的實作很簡單，就是通過命令列（-t 引數）指定要進入的 namespace 的符號鏈接，然后利用 setns() 將當前的行程放到指定的 namespace 里面，再呼叫 clone() 運行指定的執行檔案，我們可以用 strace 來看看它的運行情況：

# strace nsenter -t 27242 -i -m -n -p -u /bin/bash
execve("/usr/bin/nsenter", ["nsenter", "-t", "27242", "-i", "-m", "-n", "-p", "-u", "/bin/bash"], [/* 21 vars */]) = 0
…………
…………
pen("/proc/27242/ns/ipc", O_RDONLY)    = 3
open("/proc/27242/ns/uts", O_RDONLY)    = 4
open("/proc/27242/ns/net", O_RDONLY)    = 5
open("/proc/27242/ns/pid", O_RDONLY)    = 6
open("/proc/27242/ns/mnt", O_RDONLY)    = 7
setns(3, CLONE_NEWIPC)                  = 0
close(3)                                = 0
setns(4, CLONE_NEWUTS)                  = 0
close(4)                                = 0
setns(5, CLONE_NEWNET)                  = 0
close(5)                                = 0
setns(6, CLONE_NEWPID)                  = 0
close(6)                                = 0
setns(7, CLONE_NEWNS)                   = 0
close(7)                                = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4deb1faad0) = 4968

4. unshare()

最后一個要介紹的系統呼叫是 unshare()，它的原型如下：

int unshare(int flags);

unshare() 與 clone() 類似，但它運行在原先的行程上，不需要創建一個新行程，即：先通過指定的 flags 引數 CLONE_NEW* 創建一個新的 namespace，然后將呼叫者加入該 namespace，最后實作的效果其實就是將呼叫者從當前的 namespace 分離，然后加入一個新的 namespace，

Linux 中自帶的 unshare 命令，就是通過 unshare() 系統呼叫實作的，使用方法如下：

$ unshare [options] program [arguments]

options 指定要創建的 namespace 型別，

unshare 命令的主要實作如下：

/* 通過提供的命令列引數初始化 'flags' */

unshare(flags);

/* Now execute 'program' with 'arguments'; 'optind' is the index
   of the next command-line argument after options */

execvp(argv[optind], &argv[optind]);

unshare 命令的完整實作如下：

/* unshare.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   A simple implementation of the unshare(1) command: unshare
   namespaces and execute a command.
*/

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static void
usage(char *pname)
{
    fprintf(stderr, "Usage: %s [options] program [arg...]\n", pname);
    fprintf(stderr, "Options can be:\n");
    fprintf(stderr, "    -i   unshare IPC namespace\n");
    fprintf(stderr, "    -m   unshare mount namespace\n");
    fprintf(stderr, "    -n   unshare network namespace\n");
    fprintf(stderr, "    -p   unshare PID namespace\n");
    fprintf(stderr, "    -u   unshare UTS namespace\n");
    fprintf(stderr, "    -U   unshare user namespace\n");
    exit(EXIT_FAILURE);
}

int
main(int argc, char *argv[])
{
    int flags, opt;

    flags = 0;

    while ((opt = getopt(argc, argv, "imnpuU")) != -1) {
        switch (opt) {
        case 'i': flags |= CLONE_NEWIPC;        break;
        case 'm': flags |= CLONE_NEWNS;         break;
        case 'n': flags |= CLONE_NEWNET;        break;
        case 'p': flags |= CLONE_NEWPID;        break;
        case 'u': flags |= CLONE_NEWUTS;        break;
        case 'U': flags |= CLONE_NEWUSER;       break;
        default:  usage(argv[0]);
        }
    }

    if (optind >= argc)
        usage(argv[0]);

    if (unshare(flags) == -1)
        errExit("unshare");

    execvp(argv[optind], &argv[optind]);  
    errExit("execvp");
}

下面我們執行 unshare.c 程式在一個新的 mount namespace 中執行 shell：

$ echo $$                             # 顯示當前 shell 的 PID
8490
$ cat /proc/8490/mounts | grep mq     # 顯示當前 namespace 中的某個掛載點
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
$ readlink /proc/8490/ns/mnt          # 顯示當前 namespace 的 ID 
mnt:[4026531840]
$ ./unshare -m /bin/bash              # 在新創建的 mount namespace 中執行新的 shell
$ readlink /proc/$$/ns/mnt            # 顯示新 namespace 的 ID 
mnt:[4026532325]

對比兩個 readlink 命令的輸出，可以知道兩個shell 處于不同的 mount namespace 中，改變新的 namespace 中的某個掛載點，然后觀察兩個 namespace 的掛載點是否有變化：

$ umount /dev/mqueue                  # 移除新 namespace 中的掛載點
$ cat /proc/$$/mounts | grep mq       # 檢查是否生效
$ cat /proc/8490/mounts | grep mq     # 查看原來的 namespace 中的掛載點是否依然存在?
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

可以看出，新的 namespace 中的掛載點 /dev/mqueue 已經消失了，但在原來的 namespace 中依然存在，

5. 總結

本文仔細研究了 namespace API 的每個組成部分，并將它們結合起來一起使用，后續的文章將會繼續深入研究每個單獨的 namespace，尤其是 PID namespace 和 user namespace，

參考鏈接

Namespaces in operation, part 2: the namespaces API
Docker 基礎技術：Linux Namespace（上）

微信公眾號

掃一掃下面的二維碼關注微信公眾號，在公眾號中回復?加群?即可加入我們的云原生交流群，和孫宏亮、張館長、陽明等大佬一起探討云原生技術

轉載請註明出處，本文鏈接：https://www.uj5u.com/qita/31168.html

標籤：其他

上一篇：Unity論壇問答-Spherize(球面化)效果

下一篇：kubernetes（二）二進制安裝-環境準備