How to Communicate with Data Engineers and Other Data People
Learning Dataspeak - Like a Tourist
Every professional field has its own language. When we started working with fundraising data, I had to figure out what caging was. (In case you aren’t a fundraiser, it doesn’t involve chickens.) I bet you didn’t know that a diplopodologist is someone who studies millipedes. And why would you? That’s not information you need to do your job. (Although we do love learning about other people’s jobs from Allie Ward at Ologies. Shout out to Jenn Taylor for sharing that podcast with us). A psychologist might diagnose you with anatidaephobia (but only if you confessed you were worried about waterfowl spying on you). If a gardener complains about weeds, it’s probably not the same weed being sold down by the bus station. The word “hot” is a good example of language diversity; context and our unique lingual experience determines if we mean stolen, electrified, attractive, spicy, a color in the red quadrant of the color wheel, or running a temperature. No matter what field you work in, or what family and culture you are part of, you’ve probably absorbed its unique language and speak it like a pro without thinking about it much.
When it comes to data - talking about data, thinking about data, working with data - non-data folks sometimes get intimidated. Yet, if you want to work with data engineers, data analysts, programmers, or other specialists from the data world, we need to find some shared language to get onto the same page when collaborating on projects. Data engineers don’t talk like R2D2, but they do have their own language. Here are some terms that aren’t so hard to learn, and that will make working with data engineers much easier.
In this blog post, we will talk about some of the terms that are being used a lot lately (sometimes incorrectly) so you can feel more comfortable hearing them or using them.
Artificial Intelligence and Machine Learning
Machines, including computers, are not sentient beings except in some science fiction. (See here, and here for some entertaining examples.) Artificial intelligence is the overall (but false) appearance of being smart. It’s essentially a system that enables a machine to imitate human-like behavior like problem-solving, learning, and planning. Machine learning is a type of AI which allows a machine to automatically learn from past data without programming it explicitly to do so. AI and ML can be useful when dealing with huge amounts of data, but these systems also occasionally suffer from biases unintentionally added during programming that, not being human, the computer can’t recognize as irrational or biased.
Data Lakes, Warehouses, and Pipelines
We all collect a lot of data. “Less is more” said no fundraiser or data analyst ever. All that data has to get stored somewhere to ensure it will be within reach when we want to use it. A Data Lake is a big pool of raw data, usually fed by multiple data sources. The term Data Lake was borrowed from the lakes of pre-refinery crude oil where derricks dump the oil they extract. Data Lakes usually contain unstructured data and are hosted in the Cloud.
Data Warehouses are repositories for structured, filtered data that has already been processed for a specific purpose. The Data Warehouse takes data from the Data Lake, cleans it up, filters it, and stores it as structured data that can be shared and used. This data is sometimes called “refined” data, keeping to the theme of the oil industry.
A Data Pipeline is the pathway that data flows from one spot to another, similar to how a water pipe brings water into your kitchen or a downspout might direct rainwater from your roof into a storm drain. Every data system has multiple data Pipelines to move data in and out of the Lakes, Warehouses, and applications that comprise your data system.
Another useful analogy for understanding the relationship between Lakes, Warehouses, and Pipelines is your supermarket: the back storage room, where all inventory is first delivered and stored until needed, is like a Data Lake. Nothing in there needs to be sorted or made presentable to customers. Eventually, though, stock is moved onto the supermarket shelves in an orderly fashion, just as raw data is moved to the Data Warehouse. Here shoppers can find chicken in the cold case, spices and flour in the baking aisle, and flowers and candy near the tabloids. In the Warehouse, as in the public areas of the market, ideally everything you see is grouped logically, cleaned and available, and ready for use. And just as you transport your groceries to your vehicle with a shopping cart, you’ll stream your data to its next destination with a Pipeline.
APIs- Application Program Interfaces
An API is a software package that gives a remote third party access to your data. An API defines the protocols that a programmer can use to request data from an application and defines the standards on how to send data back to the application. APIs are useful for data sharing between collaborating organizations or funding partners like philanthropic foundations and their beneficiaries.
Imagine a warehouse, where deliveries are received, and items are loaded and unloaded, and there is also a storefront. The API is the worker who receives deliveries, makes sure they are the right things, and makes sure customers get shipped the items that they ordered. APIs essentially help applications communicate more smoothly.
Now that you’ve learned some dataspeak, how about speaking with Inciter? Click here to book a meeting and start a conversation with us - we’d love to hear about your data, and help you clear up any confusion you might have about these concepts.