Big Data Ldn

Table of Contents

Big Data Ldn

Big Data Ldn was the first time I attended a conference. I was expecting a lot and it didn’t quite live up to the hype but I had fun anyway! I thought it would be interesting to document my experience to do two things:

  • (a) Have a breakdown of the talks I attended, and hopefully provide some insight to anyone who missed the talks.
  • (b) Provide some thoughts and feelings I had whilst there. I struggle with anxious thoughts and therefore coming to an event like this on my own was tough. I’m hoping there are other Data Practitioners who share this experience and this post might help them feel at ease.

Day 1 (20th September 2023)

Initial thoughts

  • Conference is very cool, lots of interesting companies.
  • It’s definitely more aligned towards data engineering in terms of the businesses present - lots of ETL & Kafka wrapper companies.
  • Talks are more suited towards data science - which is good!

Chatting with companies

  • I spoke to a few companies - they are all friendly and happy to chat - don’t be afraid to chat with them!
  • Ask questions - even if you think you are not suited to the product or the service they are offering.
    • I didn’t ask enough questions - I mainly just listened to the speech. No question is too dumb!
  • Two interesting companies I found: Hopsworks and ArangoDB.
    • Hopsworks is a cool platform similar to Weights & Biases with more features. It also handles the data engineering side of things - such as version controlled feature stores.
    • ArangoDB is a graph database. Whether it improves on Neo4j I’m not sure, however KGs are an alternative to vector stores for GenAI RAG.

Talks

Mmmm, I love RAGu! A lot of the talks today were on RAG & GenAI. It’s definitely on everyones mind - and RAG is a easy way of showing it off.

Scaling Data Quality with Unsupervised ML Methods

Data quality refers to the content actually inside your data. Ensuring high data quality usually requires an SME to say what the data should look like and when it should look like this.

There are 3 methods to monitor data quality:

  • Validation Rules - build rules to confirm that data conforms to these. This is things such as “This column must contain less than 3% null values”.
  • Metric Anomalies - Check over time that data is within some threshold. This creates a lot of false positives.
  • “Unsupervised Data Monitoring” - more on this below.

So this “Unsupervised Data Monitoring” - it wasn’t really unsupervised learning. It was unsupervised in the fact that the platform did it for you - but it was still using a supervised ML algorithm. The algorithm works as follows:

  • Every time new data comes in they build multiple datasets. One example would be to build a dataset for today and yesterday.
  • Yesterdays data is labelled with 0, and todays data is labelled with 1.
  • XGBoost is then used to predict whether the data is from today or yesterday.
  • If the model can accurately predict (i.e. loss decreases) we know that the data has changed.
  • Then SHAP outputs are used to determine the exact feature that caused this change.

If the data is significantly different enough (i.e the model is accurately predicting when the data came from) - automated root cause analysis is performed to determine other factors that may have lead to this decreased in data quality.

Some challenges

  • Time-associated columns (such as ID & Date) need to be removed from the dataset.
  • Seasonality. Data may always be different between a “Monday” and a “Sunday”.
    • This is why multiple datasets and models are required.

Interesting stuff and definitely useful for data that is changing a lot!

Bridging the Gap: Integrating Knowledge Graphs and Large Language Models

There’s a paper on this topic to be found here. I will summarise the talk here.

He described two topics where LLMs are useful in building KGs. The first is in the creation of the KG itself. The second is in the querying of the KG.

Creating the KG

Use the LLM to extract entities and relations from the text. A KG can be built using this data. My opinion of this is I would have to see it to believe it. Sure, it’ll work with cherry picked examples but with complex long unstructured text I think it would struggle. This would definitely be an augmentation thing - rather than “LLMs build KG from scratch???” clickbait.

Querying the KG

This talk was the first talk to mention RAG - however, don’t be afraid, it was the buzzword on everyones lips today. I think because it’s such a low-risk, high-impact way of using this fantastic new technology, everyone wants to talk about it.

Anyway, at least this was a slightly different kind of RAG. Rather than using vector databases and word embeddings - this talk used the KG as the knowledge base to pull out the relevant entities and relations. This is “better” because you can maintain the relationships between different entities, and in theory this is better than a similarity search…

LLMOps: Everything you need to know to manage LLMs

This was again part of the RAGu. This time it used DataBricks to handle your LLMOps. I’m not 100% sure if she actually said anything about LLMOps - because if you are just using an OpenAI wrapper - do you even need LLMOps? Model drift, model responses (RLHF) is all handled by them. If it’s not - DataBricks doesn’t even handle that.

If this talk had the name “Using DataBricks for a Documentation RAG solution” - it would be very good. Essentially - there is now built in tech inside DataBricks to use LangChain as the chainer in a RAG solution. The most interesting point was they used an embedding model e5-large-v2 separate to the LLM. This means they had to be storing the original chunks and make sure this was indexed within the vector database.

The closest thing to LLMOps mentioned was MLFlow. This is a platform useful for the whole ML lifecycle. DataBricks offer an enterprise version of this. You can find it here.

Using Generative AI to summarize Unstructured Data

The pot of RAGu is overflowing. This talk was from Elsevier, which is company that has petabytes of scientific paper data. They perform their RAG using the following steps:

  • Use a human to structure the data - yep, again they have to make it somewhat structured before it’s at all useful.
  • Embed as vectors using the structured chunks.
  • LLM querying and summarization.

Interestingly, they did the paper complexity reduction - reducing them from PhD level to “5-year-old”, “Teenager”, etc… This is cool, and very useful for teaching.

Overall, the guy was informative and the talk was cool - however it is just RAG.

The IQ of AI: Measuring Intelligence In AI Models

This talk was delivered by a psychologist. Super interesting - essentially counteracting the points made in this paper here. This paper made the claim that GPT-4 showed signed of AGI by passing a set of tests. However, she argued that GPT-4 didn’t really pertain to these traits and the “tests” that GPT4 solves are really included in it’s training data.

She argued that one aspect of intelligence is understanding that other entities have their own intelligence. The prompt used is “I place my key on the 3rd shelf. I leave for the weekend. During the weekend, my partner takes the key and places it on the 4th shelf. When I come back on Monday, where do I expect the key to be?”. GPT-4 correctly answers the 3rd shelf.

However, if you replace the 3rd sentence with “During the weekend, my cat plays with the keys and they end up on the floor.”. GPT4 incorrectly answers “you would expect the key to be on the floor”. From this, we can infer that the original sentence was in the training data, yet the ‘odd’ cat sentence wasn’t.

GPT-4 can’t even solve any of Francois Chollet’s AI tests here. We could infer that GPT-4 is actually not that intelligent, and just stitches pieces of its training data together.

Applications and Misapplications of A.I and NLP in Text Analytics: Some Lessons Learned

This talk was about this guys use of NLP over the last 30 years and some tips and tricks. I was hoping for more NLP stuff however, the main take away from this talk was “don’t reinvent the wheel”.

He looked at the recent advances in word embeddings, word2vec, BERT, etc.. and compared them to his companies way of performing sentiment classification or text classification. He found that PCA worked much better than anything BERT could produce!

I do think this might be comparative to the gZip nearest neighbour algorithm as a text classifier - which out performed certain deep learning models on a text classification task.

Two points of note:

  • “Every science has its limits”
  • “Scientists often neglect to examine past solutions”

Post-Day Networking Event

I was exhausted by this point and since I was on my own I struggled with going up to talk to anyone. I ended up going home as I didn’t know how to start a conversation. However, I think the key here is a) ask people about what they are doing and b) have something you are really passionate about so you can talk about it if the conversation is dry.

Day 2 (Thursday 21st September 2023)

Chatting with Companies

  • Microsoft - What is Microsoft Fabric? - The “Unified data platform for the era of AI”. Find it here
  • Dataiku - I spoke to them about the lack of good code editor inside Dataiku - he showed me some plugins to get VSCode to work with Dataiku.
  • Giskard - Link - This is super cool ML testing framework. It’s open source too - so I’m definitely have a look at using this.
  • I also spoke to a sustainability consulting company - but it was kind of bs. I think the only way to be sustainable is more efficiency. We don’t need regulation or anything like that. Just make your code go brrr

Talks

Accelerating Data-Driven Workflows at ARM

First talk of day 2, and it was a pretty pointless one.

Not much talk on how ARM actually uses data to improve the chip design process - more about how ARM cloud instances are better than everyone elses - felt like a sales pitch from DataBricks & ARM. Not much to say on this one.

Architecting a Cloud Data Mesh With DataBase and Event Change Streams For Real Time Gen AI Applications

The RAGu is tasty today!

A data mesh according to Alok is “a ecosystem of products that play nicely”. I like this definition and it essentially means something that can pull in data from anywhere in order to act on that data in some way. Striim is a data mesh for streaming data.

The demo was a RAG system to allow for natural language querying of a e-commerce website. Two interesting points of note:

  • Embeddings were performed using Google Vertex AI (as the LLM used was PaLM).
  • the Vector database was actually PostgreSQL - they have a product called pgvector.

Pretty interesting talk - but nothing really “new” to add to the RAGu.

Building Scalable Data Products To Fuel Data Science Outcomes

  • Keith Lawrence, Red Hat
  • Andy Mott, Starburst

This talk was about Red Hat Data Science - “a ML platform for the hybrid cloud”. Starburst is a ‘data mesh’, as it essentially allows you to query data from anywhere without even thinking about it. This talk felt again like a sales pitch, however I think Starburst is a pretty cool product. It’s very useful for when you want to quickly start using data, and scale quickly. Other than that, nothing really to add from this talk. I thought that Dataiku looks much better than this Red Hat Data Science - so I would continue using that.

A History of Language Models: Explained By Generating Session Submissions for Big Data Ldn

Whilst Advancing Analytics may have good, informative youtube videos - found here. This talk was pretty poor. He essentially admitted it was written by ChatGPT and then spent 20 minutes talk about a super basic overview of the history of LMs - which anyone with any interest in LMs would know already. I think what would have been nice would be to take more time going into the details of the architecture of some of those and really give your opinions on them.

He mentioned Retentive Networks - here. I had only heard of this in passing as it’s a relatively new paper - but I will have a read of this as it seems interesting.

At least this LSTM generated film was cool: https://www.youtube.com/watch?v=LY7x2Ihqjmc.

How to choose the right infrastructure and solutions for your ML projects

I think this talk summed up my view of the day - the title of the talks was much more interesting than the actual talk themselves. I was hoping for some more technical details on the aspects to take into considerations when choosing a product - but I guess that’s specific from use case to use case. His points essentially boiled down to - do you want to build your own product suited completely towards you or do you want a flexible product you can buy / use (open-source) that might not be as good but works.

Nothing really to say on this one.

Overall Thoughts

Day 1 better than Day 2 - more interesting technical talks in my opinion. However, both days I had fun chatting to the vendors and getting some free t-shirts! I’m also excited to try our Giskard - it seems like the most useful product of the ones I saw / spoke to. Perhaps maybe Hopsworks too.

I often found myself getting anxious in the crowd, and trying to chat to vendors. However, if you really looked around and studied other people there - you could also see some anxious looks on their faces too. I think you have to remember that no-one is going to look at you twice - people are stuck inside their own heads. Have fun, and be curious and you’ll get the best out of these conferences - make sure to go and ask for the swag, they don’t just offer it out!