-
-
Notifications
You must be signed in to change notification settings - Fork 19
Spark connector #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a very cool suggestion. Any updates? |
I'm waiting for that because Spark distributions usually use Scala 2.11 If you think that this is interesting and people will use it, I can work on that. What do you think? |
I think it's good to have. If it can support 2.11 & 2.13 that would be better. Currently, it's not convenient for us to ingest osm data through spark. The databricks-xml may loss some data when reading big xml files, the spark-osm-datasource and oms-parquetizer are not maintained for a long time. If we could make use of osm2scala and connect it to spark, it would be very cool. |
Any updates? Interested in the follow-ups. |
Hi @ericsun95 I'm sorry, but I was on holiday. Tomorrow, I will be able to expend time on this task and I will ping you back. |
Hi @ericsun95 case class OsmSqlEntity(
id: Long,
`type`: OSMTypes.Value,
latitude: Double,
longitude: Double,
nodes: Seq[Long],
relations: Seq[RelationMemberEntity],
tags: Map[String, String]
) |
I think this is an interesting topic. If you check the https://github.com/woltapp/spark-osm-datasource#schema, it use a similar structure which also includes common entity fields (but it's weird it set visible for way & relation as false). And for java based like https://github.com/adrianulbona/osm-parquetizer/tree/master/src/main/java/io/github/adrianulbona/osm/parquet/convertor, it follows a structure with common abstract class and following three types entities (node & way & relation). I see, it seems osm4scala current didn't include those common fields (user, uid, visible, changsetId). Which may need in some cases. My preference is adding the common fields in your current One thing I am truly interested here, did pbf block store some relevance of entities (like way/relation has most reference node in one block)? If yes, could we get a partition-wise strategy in spark after reading as dataframe so that we can avoid too much shuffle in future? If super relation exists, how could we still partition the data in a wise way? I think that's very important if using spark to process osm data. Thanks for your time. |
For naming of those fields, I personally prefer following https://github.com/openstreetmap/osmosis/tree/master/osmosis-core/src/main/java/org/openstreetmap/osmosis/core/domain/v0_6. As most osm projects followed the same style. While, it depends on you. Clients can always rename. |
There is no documentation about how to write Spark Connectors, so I was doing a little bit of reverse engineering. We are lucky because usually, all this stuff is opensource: Spark build-in Connector, Cassandra connector, etc. This task is a little bit complex, so I'm going to split it in different step:
@ericsun95 in relation to
The osm.pbf format does not have this information, but maybe will be possible to collect this type of metric at parsing time. |
Let's comment every task in the right issue so we can focus. |
Hi @ericsun95 osm4scala and spark are using the Google Protobuf library, but two different versions that are not compatible. The solution is to shade the library in a fat-jar. To do this, there are three options:
Tbh, I don't like any option. Your opinion? From the user's point of view, it is fine for you to do a fat jar or it is better to use a fat library? |
Just curious, were there any other ways to fix the conflicts? Like forcing to pick the latest version? I prefer the option 2. Option 1 seems to be more complicated to person not so familiar with osm4scala. Anyway, nice job! |
Hi @ericsun95 Enjoy |
Cool. Nice job!! |
It is possible to create a connector to easy access from Spark.
After that, it will not be necessary to put blocks in hdfs and work directly with the pbf file.
Articles related:
Other connectors as example:
The text was updated successfully, but these errors were encountered: