GitHub - doanduyhai/Cassandra-Spark-Demo: Demo for the Spark Cassandra connector

This is a Spark/Cassandra demo using the open-source Spark Cassandra Connector

There are 2 packages with 2 distinct demos

us.unemployment.demo

Ingestion
1. FromCSVToCassandra: read US employment data from CSV file into Cassandra
2. FromCSVCaseClassToCassandra: read US employment data from CSV file, create case class and insert into Cassandra
Read
1. FromCassandraToRow: read US employment data from Cassandra into CassandraRow low-level object
2. FromCassandraToCaseClass: read US employment data from Cassandra into custom Scala case class, leveraging the built-in object mapper
3. FromCassandraToSQL: read US employment data from Cassandra using SparkSQL a the connector integration

twitter.stream

TwitterStreaming: demo of Twitter stream saved back to Cassandra (stream IN). To make this demo work, you need to start the job with the following info:

    <ol>
        <li>-Dtwitter4j.oauth.consumerKey="value"</li>
        <li>-Dtwitter4j.oauth.consumerSecret="value"</li>
        <li>-Dtwitter4j.oauth.accessToken="value"</li>
        <li>-Dtwitter4j.oauth.accessTokenSecret="value"</li>
    </ol>
    
    If you don't have a Twitter app credentials, create a new apps at <a href="https://apps.twitter.com/" target="_blank">https://apps.twitter.com/</a>

analytics.music

Data preparation
1. Go to the folder main/data
2. Execute $CASSANDRA_HOME/bin/cqlsh -f music.cql from this folder. It should create the keyspace spark_demo and some tables
3. the script will then load into Cassandra the content of performers.csv and albums.csv
Scenarios

All examples extend the `BaseExample` class which configures a SparkContext and truncate some tables automatically for you so that the example can be executed several times and be consistent
1. Example1 : in this example, we read data from the `performers` table to extract performers and styles into the `performers_by_style` table
2. Example2 : in this example, we read data from the `performers` table, group styles by performer for aggregation. The results are saved back into the `performers_distribution_by_style` table
3. Example3 : similar to Example2 we only want to extract the top 10 styles for artists and groups and save the results into the `top10_styles` table
4. Example4 : in this example, we want to know, for each decade, the number of albums released by each artist, group by their origin country. For this we join the table `performers` with `albums`. The results are saved back into the `albums_by_decade_and_country` table
5. Example5 : similar to Example4, we perform the join using the SparkSQL language. We also filter out low release count countries. The results are saved back into the `albums_by_decade_and_country_sql` table

usecases

Scenarios
1. CrossClusterDataMigration : this is a sample code to show how to perform effective cross cluster operations. DO NOT EXECUTE IT
2. CrossDCDataMigration : this is a sample code to show how to perform effective cross data-centers operations. DO NOT EXECUTE IT
3. DataCleaningForPerformers : in this scenario, we read data from the ```performers``` table to clean up empty _country_ and reformatting the _born_ and _died_ dates, if present. The data are saved back into Cassandra, thus achieving perfect data locality
4. DisplayPerformersData : an utility class to show data before and after the cleaning
5. MigrateAlbumnsData : in this scenario, we read source date from `albums` and save them back into a new table `albums_by_country` purposedly built for fast query on contry and year

weather.data.demo

Data preparation
1. Go to the folder main/data
2. Execute $CASSANDRA_HOME/bin/cqlsh -f weather_data_schema.cql from this folder. It should create the keyspace spark_demo and some tables
3. Download the Weather_Raw_Data_2014.csv.gz from here (>200Mb)
4. Unzip it somewhere on your disk
Ingestion
1. WeatherDataIntoCassandra: read all the Weather_Raw_Data_2014.csv file (30.10⁶ lines) and insert the data into Cassandra. It may take some time before the ingestion is done so go take a long coffee ( < 1 hour on my MacBookPro 15") Please do not forget set the path to this file by changing the WeatherDataIntoCassandra.WEATHER_2014_CSV value
This step should take a while since there are 30.10⁶ lines to be inserted into Cassandra
Read
1. WeatherDataFromCassandra: read all raw weather data plus all weather stations details, filter the data by French station and take data only between March and June 2014. Then compute average on temperature and pressure
This step should take a while since there are 30.10⁶ lines to be read from Cassandra

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

doanduyhai/Cassandra-Spark-Demo

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages