我正在嘗試使用apache_beam.dataframe.io.read_csv函式來閱讀在線資源,但沒有成功。如果檔案托管在谷歌存盤上'gs://bucket/source.csv'但無法從'https://github.com/../source.csv'類似來源獲取檔案,則一切正常。
from apache_beam.dataframe.io import read_csv
url = 'https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv'
with beam.Pipeline() as pipeline:
original_collection = pipeline | read_csv(path=url)
original_collection = original_collection[:5]
original_collection | beam.Map(print)
給我
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: https://github.com/datablist/sample-csv-files/raw/main/files/people/people-100.csv
有人可以給我提示嗎?
uj5u.com熱心網友回復:
Beam 只能從檔案系統(如 gcs、hdfs 等)讀取檔案,而不能讀取任意 URL(難以并行讀取)。本地檔案也適用于直接運行器。
或者,你可以做類似的事情
def parse_csv(contents):
[use pandas, the csv module, etc. to parse the contents string into rows]
with beam.Pipeline() as pipeline:
urls = pipeline | beam.Create(['https://github.com/datablist/sample-csv-files/...'])
contents = urls | beam.Map(lambda url: urllib.request.urlopen(url).read())
rows = contents | beam.FlatMap(parse_csv)
將檔案保存到適當的檔案系統并閱讀它可能更容易......
uj5u.com熱心網友回復:
我認為不可能在Beam.
您可以考慮另一個行程或服務,而不是Beam將您的外部檔案復制到Cloud Storage存盤桶(例如使用gsutil cp)。
然后在您的作業中,您可以毫無問題Dataflow地讀取檔案。GCS
轉載請註明出處,本文鏈接:https://www.uj5u.com/caozuo/534887.html
