Access Tweet Data
-- Tweets data will be kept available in the following bucket during the time of Hackathon for testing. https://console.cloud.google.com/storage/browser/bucket-covid/TweetData/COVID-19-TweetIDs-master
-- If above location not accessible, simply download Tweet Ids from - https://github.com/echen102/COVID-19-TweetIDs and then hydrate using this tool https://github.com/DocNow/hydrator
Run Notebook in Databricks
-- (1) For small amount of data, upload hydrated tweets into Databricks Notebook
-- (2) Remove Google Cloud Hadoop Storage class from Spark Config as we shall directly access files from dbfs:// or /FileStore/
Run Notebooks in a Jupyter Instance
-- (1) One can create Free Jupyter Notebook Instance in Google VM with 60G RAM, 400G HDD https://cloud.google.com/ai-platform/notebooks/docs/create-new
-- (2) Just simply do the local port forwarding gcloud compute ssh --project --zone -b -- -L 8088:localhost:8080
-- (3) Access the Notebook http://localhost:8088/lab?
-- (4) Ensure google-hadoop jar is present in spark home so that tweets from buckets can be accessed easily gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar /opt/conda/lib/python3.7/site-packages/pyspark/jars/
-- (5) scripts for installing java and pyspark specified in the corresponding notebooks
Log in or sign up for Devpost to join the conversation.