I have a csv file, I know how to achieve this by using pandas
, basically read csv as a df
-> group the data by field `'aaa', 'bbb'and then construct a new 'id'.
My question is how can I achieve the same with Apache Beam
, I've never used it before, I'm trying to use Beam to read this csv file and group multiple records, but the exact same functionality I used with pandas doesn't support Beam, the following is my current code:
import apache_beam as beam
from apache_beam.dataframe.io import read_csv
pipeline = beam.Pipeline()
csv_lines = (pipeline | 'ReadFile' >> beam.io.ReadFromText('xxx.csv')
| ???? )
My question is:
beam.io.ReadFromText()
Hope this makes sense, I'm new to Beam, any help would be appreciated. Thanks.
If you're able to do it with Pandas, you should be able to do it with the Beam Dataframes API in exactly the same way.
To do this with Beam directly, you can map your rows into 2-tuples, where the first element is the key ('postcode', 'property_type','old/new','PAON'
in your case) and then use GroupByKey to group things together. For example, suppose your element is a dictionary with the above keys, i.e. the elements of your input PCollection are dictionaries
{'postcode': ..., 'property_type': ..., 'old/new': ..., 'PAON': ..., ...}
Then you could write
def get_key(property_dict):
return (
property_dict['postcode'],
element['property_type'],
element['old/new'],
element['PAON'])
grouped = (
input_pcoll
| beam.Map(lambda element: (get_key(element), element))
| beam.GroupByKey())
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments