主頁 > 後端開發 > 在ARMCortex-A72CPU中,回圈執行所需的周期比預期的多

在ARMCortex-A72CPU中,回圈執行所需的周期比預期的多

2021-11-16 17:34:03 後端開發

考慮以下在 ARM Cortex-A72 處理器上運行的代碼(此處為優化指南)。我已經包括了我期望的每個執行埠的資源壓力:

操作說明 0 I1 F0 F1
.LBB0_1:
ldr q3, [x1], #16 0.5 0.5 1
ldr q4, [x2], #16 0.5 0.5 1
add x8, x8, #4 0.5 0.5
cmp x8, #508 0.5 0.5
mul v5.4s, v3.4s, v4.4s 2
mul v5.4s, v5.4s, v0.4s 2
smull v6.2d, v5.2s, v1.2s 1
smull2 v5.2d, v5.4s, v2.4s 1
smlal v6.2d, v3.2s, v4.2s 1
smlal2 v5.2d, v3.4s, v4.4s 1
uzp2 v3.4s, v6.4s, v5.4s 1
str q3, [x0], #16 0.5 0.5 1
b.lo .LBB0_1 1
總埠壓力 1 2.5 2.5 0 2 1 8 1

雖然uzp2可以在 F0 或 F1 埠上運行,但由于 F0 上的高壓和 F1 上的零壓力而不是這條指令,我選擇將其完全歸因于 F1。

除了回圈計數器和陣列指標之外,回圈迭代之間沒有任何依賴關系;與回圈體的其余部分所花費的時間相比,這些應該很快解決。

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

uj5u.com熱心網友回復:

首先,你還可以通過更換第一縮小理論周期6muluzp1和做以下smullsmlal周圍的其他方法:mulmulsmullsmlal=> ,smull 這也很大程度上降低了暫存器壓力,使我們可以更深入的做一個展開(每次迭代最多 32 個)uzp1mulsmlal

而且你不需要v2系數,但你可以把它們打包到更高的部分v1

讓我們通過展開這個深度并在匯編中撰寫它來排除一切:

    .arch armv8-a
    .global foo
    .text


.balign 64
.func

// void foo(int32_t *pDst, int32_t *pSrc1, int32_t *pSrc2, intptr_t count);
pDst    .req    x0
pSrc1   .req    x1
pSrc2   .req    x2
count   .req    x3

foo:

// initialize coefficients v0 ~ v1

    stp     d8, d9, [sp, #-16]!

.balign 64
1:
    ldp     q16, q18, [pSrc1], #32
    ldp     q17, q19, [pSrc2], #32
    ldp     q20, q22, [pSrc1], #32
    ldp     q21, q23, [pSrc2], #32
    ldp     q24, q26, [pSrc1], #32
    ldp     q25, q27, [pSrc2], #32
    ldp     q28, q30, [pSrc1], #32
    ldp     q29, q31, [pSrc2], #32

    smull   v2.2d, v17.2s, v16.2s
    smull2  v3.2d, v17.4s, v16.4s
    smull   v4.2d, v19.2s, v18.2s
    smull2  v5.2d, v19.4s, v18.4s
    smull   v6.2d, v21.2s, v20.2s
    smull2  v7.2d, v21.4s, v20.4s
    smull   v8.2d, v23.2s, v22.2s
    smull2  v9.2d, v23.4s, v22.4s
    smull   v16.2d, v25.2s, v24.2s
    smull2  v17.2d, v25.4s, v24.4s
    smull   v18.2d, v27.2s, v26.2s
    smull2  v19.2d, v27.4s, v26.4s
    smull   v20.2d, v29.2s, v28.2s
    smull2  v21.2d, v29.4s, v28.4s
    smull   v22.2d, v31.2s, v20.2s
    smull2  v23.2d, v31.4s, v30.4s

    uzp1    v24.4s, v2.4s, v3.4s
    uzp1    v25.4s, v4.4s, v5.4s
    uzp1    v26.4s, v6.4s, v7.4s
    uzp1    v27.4s, v8.4s, v9.4s
    uzp1    v28.4s, v16.4s, v17.4s
    uzp1    v29.4s, v18.4s, v19.4s
    uzp1    v30.4s, v20.4s, v21.4s
    uzp1    v31.4s, v22.4s, v23.4s

    mul     v24.4s, v24.4s, v0.4s
    mul     v25.4s, v25.4s, v0.4s
    mul     v26.4s, v26.4s, v0.4s
    mul     v27.4s, v27.4s, v0.4s
    mul     v28.4s, v28.4s, v0.4s
    mul     v29.4s, v29.4s, v0.4s
    mul     v30.4s, v30.4s, v0.4s
    mul     v31.4s, v31.4s, v0.4s

    smlal   v2.2d, v24.2s, v1.2s
    smlal2  v3.2d, v24.4s, v1.4s
    smlal   v4.2d, v25.2s, v1.2s
    smlal2  v5.2d, v25.4s, v1.4s
    smlal   v6.2d, v26.2s, v1.2s
    smlal2  v7.2d, v26.4s, v1.4s
    smlal   v8.2d, v27.2s, v1.2s
    smlal2  v9.2d, v27.4s, v1.4s
    smlal   v16.2d, v28.2s, v1.2s
    smlal2  v17.2d, v28.4s, v1.4s
    smlal   v18.2d, v29.2s, v1.2s
    smlal2  v19.2d, v29.4s, v1.4s
    smlal   v20.2d, v30.2s, v1.2s
    smlal2  v21.2d, v30.4s, v1.4s
    smlal   v22.2d, v31.2s, v1.2s
    smlal2  v23.2d, v31.4s, v1.4s

    uzp2    v24.4s, v2.4s, v3.4s
    uzp2    v25.4s, v4.4s, v5.4s
    uzp2    v26.4s, v6.4s, v7.4s
    uzp2    v27.4s, v8.4s, v9.4s
    uzp2    v28.4s, v16.4s, v17.4s
    uzp2    v29.4s, v18.4s, v19.4s
    uzp2    v30.4s, v20.4s, v21.4s
    uzp2    v31.4s, v22.4s, v23.4s

    subs    count, count, #32

    stp     q24, q25, [pDst], #32
    stp     q26, q27, [pDst], #32
    stp     q28, q29, [pDst], #32
    stp     q30, q31, [pDst], #32

    b.gt    1b
.balign 16
    ldp     d8, d9, [sp], #16
    ret

.endfunc
.end

上面的代碼即使按順序也具有零延遲唯一可能影響性能的是快取未命中懲罰。

您可以測量周期,如果每次迭代遠遠超過 48 個,則芯片或檔案肯定有問題。
否則,正如彼得指出的那樣,A72 的 OoO 引擎可能會乏善可陳。

PS:或者加載/存盤埠可能不會在 A72 上并行發布。鑒于您的展開實驗,這是有道理的。

uj5u.com熱心網友回復:

Starting from Jake's code, reducing the unrolling factor by half, changing some of the register allocation, and trying many different variations of load/store instructions (as well as different addressing modes) and instruction scheduling, I finally arrived at the following solution:

    ld1     {v16.4s, v17.4s, v18.4s, v19.4s}, [pSrc1], #64
    ld1     {v20.4s, v21.4s, v22.4s, v23.4s}, [pSrc2], #64

    add     count, pDst, count, lsl #2

    // initialize v0/v1

loop:
    smull   v24.2d, v20.2s, v16.2s
    smull2  v25.2d, v20.4s, v16.4s
    uzp1    v2.4s, v24.4s, v25.4s

    smull   v26.2d, v21.2s, v17.2s
    smull2  v27.2d, v21.4s, v17.4s
    uzp1    v3.4s, v26.4s, v27.4s

    smull   v28.2d, v22.2s, v18.2s
    smull2  v29.2d, v22.4s, v18.4s
    uzp1    v4.4s, v28.4s, v29.4s

    smull   v30.2d, v23.2s, v19.2s
    smull2  v31.2d, v23.4s, v19.4s
    uzp1    v5.4s, v30.4s, v31.4s

    mul     v2.4s, v2.4s, v0.4s
    ldp     q16, q17, [pSrc1]
    mul     v3.4s, v3.4s, v0.4s
    ldp     q18, q19, [pSrc1, #32]
    add     pSrc1, pSrc1, #64

    mul     v4.4s, v4.4s, v0.4s
    ldp     q20, q21, [pSrc2]
    mul     v5.4s, v5.4s, v0.4s
    ldp     q22, q23, [pSrc2, #32]
    add     pSrc2, pSrc2, #64

    smlal   v24.2d, v2.2s, v1.2s
    smlal2  v25.2d, v2.4s, v1.4s
    uzp2    v2.4s, v24.4s, v25.4s
    str     q24, [pDst], #16

    smlal   v26.2d, v3.2s, v1.2s
    smlal2  v27.2d, v3.4s, v1.4s
    uzp2    v3.4s, v26.4s, v27.4s
    str     q25, [pDst], #16

    smlal   v28.2d, v4.2s, v1.2s
    smlal2  v29.2d, v4.4s, v1.4s
    uzp2    v4.4s, v28.4s, v29.4s
    str     q26, [pDst], #16

    smlal   v30.2d, v5.2s, v1.2s
    smlal2  v31.2d, v5.4s, v1.4s
    uzp2    v5.4s, v30.4s, v31.4s
    str     q27, [pDst], #16

    cmp     count, pDst
    b.ne    loop

Note that, although I have carefully reviewed the code, I haven't tested whether it actually works, so there may be something missing that would impact performance. A final iteration of the loop, removing the load insructions, is required to prevent an out-of-bounds memory access; I omitted this to save some space.

Performing a similar analysis as to that of the original question, assuming the code is fully throughput-limited, would suggest that this loop would take 24 cycles. Normalizing to the same metric as used elsewhere (i.e. cycles per 4-element iteration), this would work out to 6 cycles/iteration. Benchmarking the code resulting in 26 cycles per loop execution, or in the normalized metric, 6.5 cycles/iteration. While not the bare minimum supposedly achievable, it comes very close to this.

Some notes for anyone else who stumbles across this question, after scratching their heads about Cortex-A72 performance:

  1. The schedulers (reservation stations) are per-port rather than global (see this article and this block diagram). Unless your code has a very balanced instruction mix among loads, stores, scalar, Neon, branches, etc., then the OoO window will be smaller than you would expect, sometimes very much so. This code in particular is a pathological case for per-port schedulers. since 70% of all instructions are Neon, and 50% of all instructions are multiplications (which only run on the F0 port). For these multiplications, the OoO window is a very anemic 8 instructions, so don't expect the CPU to be looking at the next loop iteration's instructions while executing the current iteration.

  2. Attempting to further reduce the unrolling factor by half results in a large (23%) slowdown. My guess for the cause is the shallow OoO window, due to the per-port schedulers and the high prevalence of instructions bound to port F0, as explained in point 1 above. Without being able to look at the next iteration, there is less parallelism to be extracted, so the code becomes latency- rather than throughput-bound. Thus, it appears that interleaving multiple iterations of a loop is an important optimization strategy to consider for this core.

  3. One must pay attention to the specific addressing mode used for loads. Replacing the immediate post-index addressing mode used in the original code with immediate offset, and then manually performing incrementing the pointers elsewhere, resulted in performance gains, but only for the loads (stores were unaffected). In section 4.5 ("Load/Store Throughput") of the optimization manual, this is hinted in the context of a memory copy routine, but no rationale is given. However, I believe this is explained by point 4 below.

  4. Apparently the main bottleneck of this code is writing to the register file: according to this answer to another SO question, the register file only supports writing 192 bits per cycle. This may explain why loads should avoid the use of addressing modes with writeback (pre- and post-index), as this consumes an extra 64 bits writing the result back to the register file. It's all too easy to exceed this limit while using Neon instructions and vector loads (even more so when using LDP and 2/3/4-register versions of LD1), without the added pressure of writing back the incremented address. Knowing this, I also decide to replace the original subs in Jake's code with a comparison to pDst, since comparisons don't write to the register file -- and this actually improved performance by 1/4 of a cycle.

Interestingly, adding up the number of bits written to the register file during one execution of the loop results in 4992 bits (I have no idea whether writes to PC, specifically by the b.ne instruction, should be included in the tally or not; I arbitrarily chose not to). Given the 192-bit/cycle limit, this works out to a minimum of 26 cycles to write all these results to the register file across the loop. So it appears that the code above can't be made faster by rescheduling instructions alone.

Theoretically it might be possible to shed 1 cycle by switching the addressing mode of the stores to immediate offset, and then including an extra instruction to explicitly increment pDst. For the original code, each of the 4 stores would write 64 bits to pDst, for a total of 256 bits, compared to a single 64-bit write if pDst were explicitly incremented once. Thus, this change would result in saving 192 bits, i.e., 1 cycle's worth of register file writes. I attempted this change, trying to schedule the increments of pSrc1/pSrc2/pDst across many different points of the code, but unfortunately I was only able to slow down rather than speed up the code. Perhaps I am hitting a different bottleneck such as instruction decoding.

轉載請註明出處,本文鏈接:https://www.uj5u.com/houduan/357370.html

標籤:performance assembly optimization arm neon

上一篇:MSP430中的進位標志混淆

下一篇:將所有類變數添加到串列中

標籤雲
其他(157675) Python(38076) JavaScript(25376) Java(17977) C(15215) 區塊鏈(8255) C#(7972) AI(7469) 爪哇(7425) MySQL(7132) html(6777) 基礎類(6313) sql(6102) 熊猫(6058) PHP(5869) 数组(5741) R(5409) Linux(5327) 反应(5209) 腳本語言(PerlPython)(5129) 非技術區(4971) Android(4554) 数据框(4311) css(4259) 节点.js(4032) C語言(3288) json(3245) 列表(3129) 扑(3119) C++語言(3117) 安卓(2998) 打字稿(2995) VBA(2789) Java相關(2746) 疑難問題(2699) 细绳(2522) 單片機工控(2479) iOS(2429) ASP.NET(2402) MongoDB(2323) 麻木的(2285) 正则表达式(2254) 字典(2211) 循环(2198) 迅速(2185) 擅长(2169) 镖(2155) 功能(1967) .NET技术(1958) Web開發(1951) python-3.x(1918) HtmlCss(1915) 弹簧靴(1913) C++(1909) xml(1889) PostgreSQL(1872) .NETCore(1853) 谷歌表格(1846) Unity3D(1843) for循环(1842)

熱門瀏覽
  • 【C++】Microsoft C++、C 和匯編程式檔案

    ......

    uj5u.com 2020-09-10 00:57:23 more
  • 例外宣告

    相比于斷言適用于排除邏輯上不可能存在的狀態,例外通常是用于邏輯上可能發生的錯誤。 例外宣告 Item 1:當函式不可能拋出例外或不能接受拋出例外時,使用noexcept 理由 如果不打算拋出例外的話,程式就會認為無法處理這種錯誤,并且應當盡早終止,如此可以有效地阻止例外的傳播與擴散。 示例 //不可 ......

    uj5u.com 2020-09-10 00:57:27 more
  • Codeforces 1400E Clear the Multiset(貪心 + 分治)

    鏈接:https://codeforces.com/problemset/problem/1400/E 來源:Codeforces 思路:給你一個陣列,現在你可以進行兩種操作,操作1:將一段沒有 0 的區間進行減一的操作,操作2:將 i 位置上的元素歸零。最終問:將這個陣列的全部元素歸零后操作的最少 ......

    uj5u.com 2020-09-10 00:57:30 more
  • UVA11610 【Reverse Prime】

    本人看到此題沒有翻譯,就附帶了一個自己的翻譯版本 思考 這一題,它的第一個要求是找出所有 $7$ 位反向質數及其質因數的個數。 我們應該需要質數篩篩選1~$10^{7}$的所有數,這里就不慢慢介紹了。但是,重讀題,我們突然發現反向質數都是 $7$ 位,而將它反過來后的數字卻是 $6$ 位數,這就說明 ......

    uj5u.com 2020-09-10 00:57:36 more
  • 統計區間素數數量

    1 #pragma GCC optimize(2) 2 #include <bits/stdc++.h> 3 using namespace std; 4 bool isprime[1000000010]; 5 vector<int> prime; 6 inline int getlist(int ......

    uj5u.com 2020-09-10 00:57:47 more
  • C/C++編程筆記:C++中的 const 變數詳解,教你正確認識const用法

    1、C中的const 1、區域const變數存放在堆疊區中,會分配記憶體(也就是說可以通過地址間接修改變數的值)。測驗代碼如下: 運行結果: 2、全域const變數存放在只讀資料段(不能通過地址修改,會發生寫入錯誤), 默認為外部聯編,可以給其他源檔案使用(需要用extern關鍵字修飾) 運行結果: ......

    uj5u.com 2020-09-10 00:58:04 more
  • 【C++犯錯記錄】VS2019 MFC添加資源不懂如何修改資源宏ID

    1. 首先在資源視圖中,添加資源 2. 點擊新添加的資源,復制自動生成的ID 3. 在解決方案資源管理器中找到Resource.h檔案,編輯,使用整個專案搜索和替換的方式快速替換 宏宣告 4. Ctrl+Shift+F 全域搜索,點擊查找全部,然后逐個替換 5. 為什么使用搜索替換而不使用屬性視窗直 ......

    uj5u.com 2020-09-10 00:59:11 more
  • 【C++犯錯記錄】VS2019 MFC不懂的批量添加資源

    1. 打開資源頭檔案Resource.h,在其中預先定義好宏 ID(不清楚其實ID值應該設定多少,可以先新建一個相同的資源項,再在這個資源的ID值的基礎上遞增即可) 2. 在資源視圖中選中專案資源,按F7編輯資源檔案,按 ID 型別 相對路徑的形式添加 資源。(別忘了先把檔案拷貝到專案中的res檔案 ......

    uj5u.com 2020-09-10 01:00:19 more
  • C/C++編程筆記:關于C++的參考型別,專供新手入門使用

    今天要講的是C++中我最喜歡的一個用法——參考,也叫別名。 參考就是給一個變數名取一個變數名,方便我們間接地使用這個變數。我們可以給一個變數創建N個參考,這N + 1個變數共享了同一塊記憶體區域。(參考型別的變數會占用記憶體空間,占用的記憶體空間的大小和指標型別的大小是相同的。雖然參考是一個物件的別名,但 ......

    uj5u.com 2020-09-10 01:00:22 more
  • 【C/C++編程筆記】從頭開始學習C ++:初學者完整指南

    眾所周知,C ++的學習曲線陡峭,但是花時間學習這種語言將為您的職業帶來奇跡,并使您與其他開發人員區分開。您會更輕松地學習新語言,形成真正的解決問題的技能,并在編程的基礎上打下堅實的基礎。 C ++將幫助您養成良好的編程習慣(即清晰一致的編碼風格,在撰寫代碼時注釋代碼,并限制類內部的可見性),并且由 ......

    uj5u.com 2020-09-10 01:00:41 more
最新发布
  • Rust中的智能指標:Box<T> Rc<T> Arc<T> Cell<T> RefCell<T> Weak

    Rust中的智能指標是什么 智能指標(smart pointers)是一類資料結構,是擁有資料所有權和額外功能的指標。是指標的進一步發展 指標(pointer)是一個包含記憶體地址的變數的通用概念。這個地址參考,或 ” 指向”(points at)一些其 他資料 。參考以 & 符號為標志并借用了他們所 ......

    uj5u.com 2023-04-20 07:24:10 more
  • Java的值傳遞和參考傳遞

    值傳遞不會改變本身,參考傳遞(如果傳遞的值需要實體化到堆里)如果發生修改了會改變本身。 1.基本資料型別都是值傳遞 package com.example.basic; public class Test { public static void main(String[] args) { int ......

    uj5u.com 2023-04-20 07:24:04 more
  • [2]SpinalHDL教程——Scala簡單入門

    第一個 Scala 程式 shell里面輸入 $ scala scala> 1 + 1 res0: Int = 2 scala> println("Hello World!") Hello World! 檔案形式 object HelloWorld { /* 這是我的第一個 Scala 程式 * 以 ......

    uj5u.com 2023-04-20 07:23:58 more
  • 理解函式指標和回呼函式

    理解 函式指標 指向函式的指標。比如: 理解函式指標的偽代碼 void (*p)(int type, char *data); // 定義一個函式指標p void func(int type, char *data); // 宣告一個函式func p = func; // 將指標p指向函式func ......

    uj5u.com 2023-04-20 07:23:52 more
  • Django筆記二十五之資料庫函式之日期函式

    本文首發于公眾號:Hunter后端 原文鏈接:Django筆記二十五之資料庫函式之日期函式 日期函式主要介紹兩個大類,Extract() 和 Trunc() Extract() 函式作用是提取日期,比如我們可以提取一個日期欄位的年份,月份,日等資料 Trunc() 的作用則是截取,比如 2022-0 ......

    uj5u.com 2023-04-20 07:23:45 more
  • 一天吃透JVM面試八股文

    什么是JVM? JVM,全稱Java Virtual Machine(Java虛擬機),是通過在實際的計算機上仿真模擬各種計算機功能來實作的。由一套位元組碼指令集、一組暫存器、一個堆疊、一個垃圾回收堆和一個存盤方法域等組成。JVM屏蔽了與作業系統平臺相關的資訊,使得Java程式只需要生成在Java虛擬機 ......

    uj5u.com 2023-04-20 07:23:31 more
  • 使用Java接入小程式訂閱訊息!

    更新完微信服務號的模板訊息之后,我又趕緊把微信小程式的訂閱訊息給實作了!之前我一直以為微信小程式也是要企業才能申請,沒想到小程式個人就能申請。 訊息推送平臺🔥推送下發【郵件】【短信】【微信服務號】【微信小程式】【企業微信】【釘釘】等訊息型別。 https://gitee.com/zhongfuch ......

    uj5u.com 2023-04-20 07:22:59 more
  • java -- 緩沖流、轉換流、序列化流

    緩沖流 緩沖流, 也叫高效流, 按照資料型別分類: 位元組緩沖流:BufferedInputStream,BufferedOutputStream 字符緩沖流:BufferedReader,BufferedWriter 緩沖流的基本原理,是在創建流物件時,會創建一個內置的默認大小的緩沖區陣列,通過緩沖 ......

    uj5u.com 2023-04-20 07:22:49 more
  • Java-SpringBoot-Range請求頭設定實作視頻分段傳輸

    老實說,人太懶了,現在基本都不喜歡寫筆記了,但是網上有關Range請求頭的文章都太水了 下面是抄的一段StackOverflow的代碼...自己大修改過的,寫的注釋挺全的,應該直接看得懂,就不解釋了 寫的不好...只是希望能給視頻網站開發的新手一點點幫助吧. 業務場景:視頻分段傳輸、視頻多段傳輸(理 ......

    uj5u.com 2023-04-20 07:22:42 more
  • Windows 10開發教程_編程入門自學教程_菜鳥教程-免費教程分享

    教程簡介 Windows 10開發入門教程 - 從簡單的步驟了解Windows 10開發,從基本到高級概念,包括簡介,UWP,第一個應用程式,商店,XAML控制元件,資料系結,XAML性能,自適應設計,自適應UI,自適應代碼,檔案管理,SQLite資料庫,應用程式到應用程式通信,應用程式本地化,應用程式 ......

    uj5u.com 2023-04-20 07:22:35 more