As a platform company powering businesses all over the world, Stripe processes payments, runs marketplaces, detects fraud, helps entrepreneurs start an internet business from anywhere in the world. Stripe’s Data Infrastructure Engineers build the platform, tooling, and pipelines that manage that data.
At Stripe, decisions are driven by data. Because every record in our data warehouse can be vitally important for the businesses that use Stripe, we’re looking for people with a strong background in big data systems to help us build tools to scale while maintaining correct and complete data. You’ll be creating best in class libraries to help our users fully leverage open source frameworks like Spark and Scalding. You’ll be working with a variety of teams, some engineering and some business, to provide tooling and guidance to solve their data needs. Your work will allow teams to move faster, and ultimately help Stripe serve our customers more effectively.
You will: * Create libraries and tooling that make distributed batch computation easy to create and test for all users across Stripe * Become expert in and contribute to open source frameworks such as Scalding, Spark to address issues our users at Stripe encounter * Create APIs to help teams materialize data models from production services into readily consumable formats for all downstream data consumption * Create libraries that enable engineers at Stripe to easily interact with various serialization frameworks (e.g. thrift, bson, protobuf) * Create observability tooling to help our users easily debug, understand, and tune their Spark / Scalding jobs * Leveraging batch computation frameworks and our workflow management platform (Airflow) to assist other teams in building out their data pipelines * Own and evolve the most critical upstream datasets
We’re looking for someone who has: * A strong engineering background and are interested in data. You’ll be writing production Scala and Python code. * Experience developing and maintaining distributed systems built with open source tools. * Experience building libraries and tooling that provide beautiful abstractions to users * Experience optimizing the end-to-end performance of distributed systems. * Experience in writing and debugging ETL jobs using a distributed data framework (Spark/Hadoop MapReduce etc…)
Nice to haves: * Experience with Scala * Experience with Spark or Scalding * Experience with Airflow or other similar scheduling tools * It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar.
Some things you might work on: * Create libraries that enable engineers at Stripe to easily interact with various serialization frameworks (e.g. thrift, bson, protobuf) * Write a unified user data model that gives a complete view of our users across a varied set of products like Stripe Connect and Stripe Atlas * Continuing to lower the latency and bridge the gap between our production systems and our data warehouse by rethinking and optimizing our core data pipeline jobs * Pair with user teams to optimize and rewrite business critical batch processing jobs in Spark * Create robust and easy to use unit testing infrastructure for batch processing pipelines * Build a framework and tools to re-architect data pipelines to run more incrementally