Tackling the Bull: Data Council Assembles the Braintrust to Tackle the Biggest Problems in Data and AI
A recap of this years Data Council conference and the Lightning Track that Vectara hosted
10-minute read timeIn a sea of vendor-driven conference agendas, Data Council is a practitioner-focused conference covering a wide breadth of trends in data and AI, including data governance, RAG, disaggregated storage querying, and shifting ecosystems around popular data science tooling like Python and R. However, it’s not just a bunch of PowerPoint presentations. There are just as many workshops and speaker office hours where attendees can dive even deeper into these topics directly with the experts. You can just tell that the conversations in the hallways are the potential beginnings of some storied solutions and products. The key difference between Data Council and your average data conference is the level of focus on community building and problem-solving. Even just as an attendee, I felt like I knew many of the participants by the end. Additionally, there is an investor participation and presence that helps connect these innovative startups with the teams that will help them take their projects to the market.
I have a personal affinity for the spirit of the conference, as I was given the opportunity to not only attend the year before but also host a stable of brilliant lightning talk speakers. I was thrilled when the Data Council team reached out and offered me not only a chance to host again this year but also to collect and curate this year’s submissions. I wanted to include some of the speakers and industry friends who I knew could deliver content that would keep the audience locked in. On top of that, we received some amazing submissions from all kinds of builders. In the end, the selection and schedule curation ended up being no easy task.
One might think that only 15 minutes should be easy for any speaker to put together. I would argue it’s considerably more difficult. You have to often distill very complex topics into a short, concise, and engaging story, making sure that you drive the key points home in the short time allotted. We knew we were asking a lot of the speakers, but luckily, they still had until Thursday to figure it out.
Day 1: Data Engineering, Data Science, and Data Communities
Nestled in the bustling University of Texas campus, the conference parameter was perfectly peaceful, with just about the best weather you can experience in Texas. Day 1, walking in, I was excited to do a bit of drifting between the data science, data engineering, and data community tracks as I am a hobbyist in a couple of areas across the full data lifecycle. The Data Engineering track captured the center stage and was thoughtfully curated by the team at Brooklyn Data and hosted leaders from Okta, Linkedin, NextData, and eBay. I first made a landing in the Data Science track, where it was standing room only for Wes Mickinney’s talk “The Future Roadmap for the Composable Data Stack,” which gave me a ton of insight into the thinking around the future and rebranding of his current company Posit (previously R) directly from the leader of projects like Pandas and Apache Arrow. Next, I headed over and spent some time in the Data Community track, which was hosted and orchestrated by the head of community at Acryl Data, Maggie Hayes. I caught a great session by CJ Jameson from Monte Carlo, who talked about “Empowering Data Teams: A Step-by-Step Playbook for Leads and Managers.” I learned about the concepts of rock and pebble people and projects and how data teams can benefit from both types of contributors.
I stuck around for the next panel on modern data management cheekily labeled “WTF are we doing?”, which included data leaders like Chad Sanderson, Benn Stancil, and Caitlin Hudon and was moderated expertly by track host Maggie Hays. Some great points were made about the reality of AI and the implementation of data governance and contracts. I also learned that, apparently, we are in the “post” modern data stack, which to me is like the original modern data stack, now with nicer furniture. I had to end the day with the very saucy-titled “Is Kubernetes a Database?” and as most of these brain-rattling concepts work out, the answer is, of course, maybe.
Day 2: MLOps and Platforms, Data Streaming, and Building Data Products
On day 2, I was saddled for more drinking from the firehose. DJ Patil returned to Data Council following his keynote last year to participate in a Q&A with an associate professor at UC Berkeley. He gave his perspective on the fast-paced changes coming with the introduction of generative AI into the mainstream and how its role is all about unlocking the core value of data. He also spoke about how he views data solutions as an investor.
Then, I ducked into a great talk by Daniel Olmedilla from Linkedin, where he discussed the concept of fairness and how his company measures and decides upon acceptable rates of fairness in implementing AI models. This concept of fairness and bias is emerging in the AI community, and I am excited to be a part of a company whose mission is to highlight these things better when they occur. The afternoon keynote was a roundtable with leaders from Pinecone, Anthropic, and Together.ai, as well as our previous podcast guest Ram Sriharsha. The panel touched on the pivotal importance of data when it comes to LLMs, both quality and reliability. Also surfaced was the role of vector databases and their significance as context windows become longer and RAG pipelines become more prevalent. I was able to jump into a couple of workshop sessions and also try and catch up on some of my tasks for day 3.
Day 3: Lightning Talks
As I mentioned earlier, I was thrilled to be invited to curate and MC the lightning track for the second year in a row. This year, we had 14 speakers, most of whom were new to the Data Council speaking circuit. There was a mix of topics around data management, object storage, LLM observability, infrastructure, and even Rust and Blockchain. Many of the talks included interactive demos, which are often tricky in short-format speaking engagements. I was blown away by the caliber of the content and wanted to give you my quick breakdown. Please click through and check out the videos of the full sessions.
My partner in the podcast and Head of Developer Relations at Vectara, Ofer Mendelevitch, showed us how to build an end-to-end RAG application even if you aren’t an ML engineer. Ofer went over the fundamental requirements for a RAG solution and then showed a short video of dragging documents and creating a quick and easy UI to serve up generative answers. Check out Ofer’s video and you can sign up to try it for yourself for free.
Jordan Volz introduced us to his new open-source project called Lolpop aimed at bridging the gaps between data science and software development principles. His project aims to provide the same level of repeatable and disciplined code delivery and apply it to the world of data science and ML development. Jordan also showed a quick code example of the Lolpop framework at work. Check out his project.
Next, Erick Enriquez details his migration to a datalakehouse that he set forth at Doordash and is hardening into a solution that includes benchmarking and workload parity. Erick gives some great tips on avoiding some of the pitfalls he encountered.
Amber Roberts taught us about the pillars of LLM observability, inclusive of evaluations, traces and spans, search and retrieval, and fine-tuning. Amber discussed solutions like OpenAI’s Eval framework and the Arize Pheonix platform for evaluating LLMs. Amber also talked about RAG as a solution to give LLMs better content.
Andrew Dworschak explains how to keep brands safe with data indexing in Blockchain. Andrew talked about how his team chose between three models for an indexer and created a brand protection platform.
Gilberto Hernandez covers Snowflake Container Services for Snowpark and LLM integration with Cortex. Gilberto even showed a demo using LLM’s to answer questions on structured data in Snowflake utilizing the Snowflake Container Services.
Slater Stich provides a very powerful reason to consider Rust, the short of it? You can write imperfect code and generally, it will work. Specifically for advanced ML and data science workloads, Rust should be considered, especially for teams looking to get product pilots out the door quickly. Find out why.
Sai Krishna Srirampur talked about the novel challenges with data movement into Postgres. Sai has been steeped for years in the Postgres community, even hosting the Postgres user meetup at the Data Council. Sai talks about why his company focused on solving this problem specifically for Postgres users.
Brenna Buuck gives two solutions for moving to disaggregated storage for database workloads. A hot topic at the conference was the notion of disaggregating storage from the query engines and even producing queries directly on unstructured object storage as a primary database. Brenna went through two examples of when this approach should be considered.
Kirit Basu shows the audience what they got when they thought they were getting a data catalog. Many organizations view the data catalog as a chore, but data teams know the impact dynamic data catalog can provide. Kirit takes aim at some of the common myths and missteps in implementing an enterprise data catalog.
Ashwin Ramesh talks about LLM-powered copilots and how they can impact your data team. Ashwin categorizes the solutions into categories such as reasoning engines, knowledge bases, and even agents. Ashwin goes into depth about the current constraints to building an LLM co-pilot.
This one was especially enlightening to me. Tobias Lunt talked to us about how they are using LLMs to combat social divisions and break down caste systems using text data. Tobias explained how cultural norms are measured and used to implement a response evaluation model to focus on the internal consistency between the excerpts and the encoded norms variables.
Aaron Funk Taylor demonstrated live a data streaming and schema management solution he developed for a large DevOps platform. It was true modern data management at the edge with a hardware example using a rotary phone that created events in Apache Kafka. Check out the video of the live demo.
Taking us home was Hope Wang who explained why you have to tackle the IO challenges before you can master a data lake. Hope presses on a common obstacle that most companies overlook when planning out their data lake migration. She outlined the roles each takes in an MDS architecture.
Wrapping Up
This year’s Data Council brought forth exceptional talent, a collaborative atmosphere, and the endless potential to make data projects a reality. Huge kudos to the team at Zero Prime Investments for their hard work in putting the conference together. The team at Vectara was honored to be part of the show. My personal experience as the track host gave me a deep sense of purpose that these design patterns we wax poetic about have the real potential to change the world. I know that my noble track speakers will be at the forefront of some of those exciting innovations, and I can’t wait to see what they create.