NaturalViz

Inspiration

Our inspiration for NaturalViz stems from the recognition of challenges in traditional data visualization tools. We observed barriers in accessibility and the reliance on specialized knowledge, hindering the broader audience from exploring and understanding datasets effectively. Inspired by the potential of language models, particularly OpenAI LLMs, we envisioned a solution that could make data exploration more inclusive, user-friendly, and dynamic. The goal was to create a tool that goes beyond conventional data visualization methods, leveraging natural language processing to enable users to interact with data in a conversational manner.

We believe there is a fundamental flaw in the current practice of having an analyst or middleman present data and findings to an audience. As professor Catherine D'Ignazio has mentioned in their paper "Data Visualization and the Politics of Illusion", data visualizations are not neutral. They are created by people who make choices about what data to include, how to represent it, and what stories to tell. Instead, the audience should be able to interact directly with the data. The current barrier to this interaction is the audience's potential lack of technical knowledge. However, if we can provide a natural language interface using Large Language Models (that can understand human input, generate dynamic code, and create visualizations on the fly, suddenly all of this becomes possible.

What it does

NaturalViz is a groundbreaking solution that combines the power of OpenAI LLMs with data visualization capabilities to redefine how users explore and gain insights from datasets. The chatbot functionality allows users to pose natural language queries about the dataset, and, in response, the system generates Python code for analysis and visualization. This enables dynamic and infinite visualizations, empowering users to explore data from various perspectives effortlessly. NaturalViz promotes transparency by including the generated code in the analysis, making it accessible for users and judges to assess the technical proficiency and creativity involved.

How we built it

The development of NaturalViz involved a multi-faceted approach. We integrated OpenAI's language models with langchain , to process user queries and generate code snippets. The frontend utilizes Streamlit for creating an interactive and user-friendly interface. Matplotlib and Altair were employed for dynamic visualization, offering a diverse range of plots and charts. The project embraces an experimental mindset, refraining from extensive data cleaning to showcase the capabilities of language models in handling raw data.

The chatbot works like this:

It contains an agent wrapper with a list of tools. Main agent followers,

User Input: The input question you must answer
Thought: You should always think about what to do
Action: The action to take, should be one of [{tool_names}]
Action Input: The input to the action
Observation: The result of the action
... (This Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: The final answer to the original input question

And we give a couple of different actions or tools to the agent like these:

Tool(
    name="answer_qa",
    func=csv_agent,
    description="Use this tool to query the dataset. Input to this tool should be a standalone question. Include the correct row titles that are needed. Example: How many rows are there in the dataset, which Facebook page has the highest Facebook Follower #",
),
Tool(
    name="smalltalk",
    func=get_chatresponse,
    description="Use this tool to create a response to smalltalk user inputs. Input to this tool is the User Input you need to respond to. Example: Hello, Thank you",
    return_direct=True,
),
Tool(
    name="create_simple_plot",
    func=parsing_input,
    description="""Use this tool to create x vs y plots. Input is a comma-separated list of selected_option, x_column, y_column
    Example Inputs: 
    bar,Name (English),X (Twitter) Follower #
    line,Parent entity (English),Facebook Follower #
    line,Language,Facebook Follower #
    scatter_matrix,X (Twitter) Follower #,Instagram Follower #
    box,Region of Focus,TikTok Subscriber #
    heatmap,Language,Region of Focus
    These examples are only to show the input format. You can decide plot type based on the user input.
    """,
),
Tool(
    name="Calculator",
    func=math_tool,
    description="Useful when you need to do calculations. Example input: 21^0.43"
)

Both the CSV agent and plotting tool execute dynamic Python code to generate the output, while we rely on the main agent to provide a meaningful final answer. We can, of course, add additional tools such as Google Search API to provide reinforced answers with real-time information, etc.

and for the direct NLP to visuualization we follow a template like this.

Direct NLP to Visualization Option Template

template = """You are an expert in data visualization who can create suitable visualizations to find the required information. You have access to a dataset (dataset.csv) and you are given a question. Generate a Python code with st.altair_chart to find the answer.

First 3 rows of the dataset:
Name (English),Name (Chinese),Region of Focus,Language,Entity owner (English),Entity owner (Chinese),Parent entity (English),Parent entity (Chinese),X (Twitter) handle,X (Twitter) URL,X (Twitter) Follower #,Facebook page,Facebook URL,Facebook Follower #,Instragram page,Instagram URL,Instagram Follower #,Threads account,Threads URL,Threads Follower #,YouTube account,YouTube URL,YouTube Subscriber #,TikTok account,TikTok URL,TikTok Subscriber #
Yang Xinmeng (Abby Yang),杨欣萌,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央宣传部,_bubblyabby_,https://twitter.com/_bubblyabby_,1678.00,itsAbby-103043374799622,https://www.facebook.com/itsAbby-103043374799622,1387432.00,_bubblyabby_,https://www.instagram.com/_bubblyabby_/,9507.00,_bubblyabby_,https://www.threads.net/@_bubblyabby_,197.00,itsAbby,https://www.youtube.com/itsAbby,4680.00,_bubblyabby_,https://www.tiktok.com/@_bubblyabby_,660.00
CGTN Culture Express,,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央宣传部,_cultureexpress,https://twitter.com/_cultureexpress,2488.00,,,,_cultureexpress/,https://www.instagram.com/_cultureexpress/,635.00,,,,,,,,,

=============
Question:
{question}

Example:
import altair as alt
import pandas as pd
import streamlit as st
df = pd.read_csv('dataset.csv')
# your code here
st.altair_chart(chart, use_container_width=True)

Generated Python Code:"""

if the generated code fails we have a second template that tries to fix it by using the error message received.

retry_template = """Current code attempts to create a visualization of dataset.csv to meet the objective. but it has encountered the given error. provide a corrected code.

First 3 rows of the dataset:
Name (English),Name (Chinese),Region of Focus,Language,Entity owner (English),Entity owner (Chinese),Parent entity (English),Parent entity (Chinese),X (Twitter) handle,X (Twitter) URL,X (Twitter) Follower #,Facebook page,Facebook URL,Facebook Follower #,Instragram page,Instagram URL,Instagram Follower #,Threads account,Threads URL,Threads Follower #,YouTube account,YouTube URL,YouTube Subscriber #,TikTok account,TikTok URL,TikTok Subscriber #
Yang Xinmeng (Abby Yang),杨欣萌,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央宣传部,_bubblyabby_,https://twitter.com/_bubblyabby_,1678.00,itsAbby-103043374799622,https://www.facebook.com/itsAbby-103043374799622,1387432.00,_bubblyabby_,https://www.instagram.com/_bubblyabby_/,9507.00,_bubblyabby_,https://www.threads.net/@_bubblyabby_,197.00,itsAbby,https://www.youtube.com/itsAbby,4680.00,_bubblyabby_,https://www.tiktok.com/@_bubblyabby_,660.00
CGTN Culture Express,,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央宣传部,_cultureexpress,https://twitter.com/_cultureexpress,2488.00,,,,_cultureexpress/,https://www.instagram.com/_cultureexpress/,635.00,,,,,,,,,

=============
Objective: {question}

Current Code:
{error_code}

Error Message:
{error_message}

Corrected Code:"""

first few rows are included in templates so that LLM can get a footing on the dataset structure.

An Example Conversation

> Entering new AgentExecutor chain...

Thought: Greet the customer
Action: smalltalk
Action Input: Hello AI

> Entering new LLMChain chain...
Prompt after formatting:
You are an AI Assitant that can assist explore datasets. offer brief and poliet smalltalk.   

First 3 rows of the dataset:
Name (English),Name (Chinese),Region of Focus,Language,Entity owner (English),Entity owner (Chinese),Parent entity (English),Parent entity (Chinese),X (Twitter) handle,X (Twitter) URL,X (Twitter) Follower #,Facebook page,Facebook URL,Facebook Follower #,Instragram page,Instagram URL,Instagram Follower #,Threads account,Threads URL,Threads Follower #,YouTube account,YouTube URL,YouTube Subscriber #,TikTok account,TikTok URL,TikTok Subscriber #
Yang Xinmeng (Abby Yang),杨欣萌,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央 
宣传部,_bubblyabby_,https://twitter.com/_bubblyabby_,1678.00,itsAbby-103043374799622,https://www.facebook.com/itsAbby-103043374799622,1387432.00,_bubblyabby_,https://www.instagram.com/_bubblyabby_/,9507.00,_bubblyabby_,https://www.threads.net/@_bubblyabby_,197.00,itsAbby,https://www.youtube.com/itsAbby,4680.00,_bubblyabby_,https://www.tiktok.com/@_bubblyabby_,660.00
CGTN Culture Express,,Anglosphere,English,China Media Group (CMG),中央广播电视总台,Central Publicity Department,中共中央宣传部,_cultureexpress,https://twitter.com/_cultureexpress,2488.00,,,,_cultureexpress/,https://www.instagram.com/_cultureexpress/,635.00,,,,,,,,,

====
if the question is not related to the dataset, polietly inform you can only answer questions related to the dataset.
====
conversation history:

====
User's New Input: Hello AI
====

AI:

> Finished chain.

Observation: Hello! I am an AI Assistant created for the CANIS Data Visualization and Foreign Interference competition. I can help 
you with any questions or information related to the dataset. How can I assist you today?


> Finished chain.


> Entering new AgentExecutor chain...
Thought: I need to query the dataset
Action: answer_qa
Action Input: How many rows are there in the dataset

> Entering new AgentExecutor chain...

Invoking: `python_repl_ast` with `{'query': 'df.shape[0]'}`


758There are 758 rows in the dataset.

> Finished chain.

Observation: There are 758 rows in the dataset.
Thought: I now know the final answer
Final Answer: There are 758 rows in the dataset.

> Finished chain.

Accomplishments that we're proud of

NaturalViz has achieved significant milestones. The chatbot's ability to understand nuanced queries and generate relevant code for dynamic visualizations showcases its potential to revolutionize data exploration. The emphasis on inclusivity, transparency through code sharing, and the experimental approach to handling raw data highlight the project's commitment to innovation and pushing the boundaries of traditional data analysis methods.

What we learned

The development of NaturalViz has been a learning journey. The project underscored the potential of language models in enhancing user interaction with data. The dynamic nature of the chatbot and its adaptability to various queries emphasized the importance of user-centric design. The project also highlighted the possibilities and challenges of minimal human interference in the data exploration process, relying on the capabilities of language models for insights generation.

What's next for NaturalViz

Looking ahead, the NaturalViz team envisions further refinement and expansion of the chatbot's capabilities. This includes exploring additional language models and incorporating user feedback to enhance the chatbot's understanding and response accuracy. The project aims to establish partnerships for real-world applications, potentially integrating NaturalViz into existing data analysis workflows. Continuous experimentation and improvement remain central to the project's future, with a commitment to shaping the next generation of data exploration tools.