在將 CSV 檔案匯入 Oracle 之前,我正在使用 PowerShell 對它們進行一些資料修改。我在行程運行時觀察了資源監視器,并且行程正在消耗服務器上所有 20 GB 的可用記憶體。我的一個 CSV 大約有 90 MB,有近 200 列和 100K 行。生成的 CSV 約為 120 MB。這是我目前正在使用的代碼:
# Process Configuration File
$path = $PSScriptRoot "\"
#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"
Import-Csv -Path ($path 'documents.csv') -Encoding UTF8 |
# Convert Date Time values that are always populated
% {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
% {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
% {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
# Convert DateTime values that may be blank
% {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
% {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
# Fix Multi-Select Picklist fields, replacing value divider with "|"
% {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
% {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
# Trim values that can be too long
% {$_.product__v = ($_.product__v)[0..254] -join "";$_} |
# Add ExtractDate Column
Select-Object *,@{Name='Extract_Date';Expression={$date}} |
#Export Results
Export-Csv ($path 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8
有沒有比我目前正在做的更有效的方法來使用 PowerShell 修改大型 CSV 檔案?該程序大約需要 10 分鐘才能完成。我絕不是 PowerShell 專家,我根據來自該站點的資訊和 MS PowerShell 檔案構建了我的腳本。任何建議將不勝感激。
這是創建帶有單個記錄的示例documents.csv 的資料:
allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,
uj5u.com熱心網友回復:
PowerShell 的Import-Csvcmdlet 是一個已知的記憶體占用者,主要是由于[pscustomobject]它構建的實體的高記憶體要求 - 請參閱GitHub 問題 #7603。
有幾種緩解策略,按復雜性升序排列:
在您的
ForEach-Object(%) 腳本塊中(您應該將單獨的%呼叫合并為一個),強制垃圾收集,例如,1000 個物件以減輕記憶體壓力。正如Santiago Squarzon指出的那樣,從
ForEach-ObjectPowerShell 7.2.x 開始,效率低下的實作,請參閱GitHub 問題 #10982 - 在記憶體消耗和運行時間方面都加劇了問題。請參閱下面的代碼,它將定期垃圾收集
. { process { ... } }與ForEach-Object. [1]
使用自定義 PowerShell
class來表示您的 CSV 行,但請注意,這會增加執行時間。有關示例,請參見此答案。
GitHub 問題 #8862建議將此功能構建到
Import-Csv中,以便使其構造給定型別的實體以代替[pscustomobject]s 開始。
如果上述方法太慢,您將需要求助于第三方 .NET 決議器庫,例如
CSVHelper.請參閱此博客文章,其中包含指向許多庫的鏈接的比較基準,以及此 SO 問題的答案(專注于 C#)。
不幸的是,從 v7.2.x 開始,在 PowerShell 中使用通用 .NET NuGet 包很麻煩。這個答案顯示了當前需要什么。GitHub 問題 #6724要求未來
Add-Type改進直接支持 NuGet 包。
這是實作定期垃圾收集以減輕記憶體壓力的代碼的簡化公式:
# Process Configuration File
$path = $PSScriptRoot '\'
#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"
# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path 'documents.csv') -Encoding UTF8 | . {
process {
# Perform garbage collection every 1000 objects
# in order to relieve memory pressure.
if ( $i % 1000 -eq 0) { [GC]::Collect() }
# Convert Date Time values that are always populated
$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
# Convert DateTime values that may be blank
if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
# Fix Multi-Select Picklist fields, replacing value divider with "|"
$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
$_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
# Trim values that can be too long
$_.product__v = ($_.product__v)[0..254] -join ""
# Finally add an ExtractDate Column and output the modified object
# (-PassThru) - this obviates the need for a separate Select-Object call.
Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
}
} |
Export-Csv ($path 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8
[1] 請注意,變體& { process { ... } },即在子范圍內執行,可以加快執行速度(請參閱此答案以獲取解釋),但又會增加記憶體消耗,這就是此處不使用它的原因。
轉載請註明出處,本文鏈接:https://www.uj5u.com/gongcheng/515554.html
標籤:电源外壳CSV
