Skip to main content

Posts Tagged ‘dataset’

Pens 1080451 1280

Spark: RDD vs DataFrame vs Dataset

In the context of Apache Spark, RDD, DataFrame, and Dataset are different abstractions for working with structured and semi-structured data. Here’s a brief definition of each: RDD (Resilient Distributed Dataset): RDD is the basic abstraction in Spark. It represents an immutable, distributed collection of objects that can be processed in parallel across a cluster. RDDs […]

Build Framework for Data Strategy

Data Strategy Framework: Handle with CARE

As Gartner predicted, artificial intelligence is emerging as a core business and analytical competency. Although information is still not recognized as a line item in a corporate balance sheet (data as an asset), it is still a strategic asset that can drive business value.  Therefore, it’s important to have data strategy framework in place. On […]

Google Big Dataset: Wikilinks Corpus

A few days ago when I was browsing some information categorized in data mining and machine learning, I heard that Google had released a large dataset called Wikilinks Corpus which contains 40 million mentions over 3 million entities. What does mention and entity mean here? Apple is also rumored to be working on a new, […]