posted an update

Access Tweet Data

-- Tweets data will be kept available in the following bucket during the time of Hackathon for testing. https://console.cloud.google.com/storage/browser/bucket-covid/TweetData/COVID-19-TweetIDs-master

-- If above location not accessible, simply download Tweet Ids from - https://github.com/echen102/COVID-19-TweetIDs and then hydrate using this tool https://github.com/DocNow/hydrator

Run Notebook in Databricks

-- (1) For small amount of data, upload hydrated tweets into Databricks Notebook

-- (2) Remove Google Cloud Hadoop Storage class from Spark Config as we shall directly access files from dbfs:// or /FileStore/

Run Notebooks in a Jupyter Instance

-- (1) One can create Free Jupyter Notebook Instance in Google VM with 60G RAM, 400G HDD https://cloud.google.com/ai-platform/notebooks/docs/create-new

-- (2) Just simply do the local port forwarding gcloud compute ssh --project --zone -b -- -L 8088:localhost:8080

-- (3) Access the Notebook http://localhost:8088/lab?

-- (4) Ensure google-hadoop jar is present in spark home so that tweets from buckets can be accessed easily gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop2-latest.jar /opt/conda/lib/python3.7/site-packages/pyspark/jars/

-- (5) scripts for installing java and pyspark specified in the corresponding notebooks

Log in or sign up for Devpost to join the conversation.