Software Developer 2 at Gupshup

7/1/2024 - PresentBangalore, India

Role Overview

As a Software Developer 2 at Gupshup, I transitioned to the analytics team where I focus on building and optimizing large-scale data processing pipelines. This role has deepened my understanding of distributed systems and data analytics, presenting unique challenges different from traditional application development.

Key Achievements

Architected and implemented analytics pipelines capable of processing 100 million events per day using Apache Flink
Developed both streaming and batch processing solutions to handle diverse analytical workloads
Optimized data storage and processing patterns through careful consideration of serialization and compression techniques
Implemented efficient data storage solutions using Apache Parquet and query capabilities with Amazon Athena
Gained deep insights into distributed systems and their specific challenges in analytics contexts

Technical Challenges Overcome

Adapted Java development practices to meet the unique requirements of distributed processing
Optimized object creation and management for better performance in Flink pipelines
Mastered the complexities of stream processing and state management
Implemented efficient serialization and compression strategies for large-scale data handling

Technologies Used

Processing Framework: Apache Flink
Programming: Java for Distributed Systems
Storage: Apache Parquet
Analytics: Amazon Athena
Cloud Infrastructure: AWS Services
Data Processing: Batch and Stream Processing Pipelines

Impact

My work has enabled the processing of massive data volumes efficiently, providing valuable insights for business decisions while maintaining system performance and reliability in a distributed environment.

Diving Deep into Analytics: My Journey from Services to Streams

When I got promoted to Software Developer 2 at Gupshup and moved to the analytics team, I quickly realized that this wasn't just another project switch – it was entering a completely different realm of distributed computing. The transition from traditional service-based architecture to stream processing opened my eyes to new ways of thinking about data and systems at scale.

The Paradigm Shift

The first eye-opening moment came when I realized that my usual Java coding patterns weren't going to work in analytics. In the world of Apache Flink and stream processing, every line of code needs to be thought through differently. The same structured code that worked perfectly in REST APIs could become a performance bottleneck or even a show-stopper in a streaming environment.

Scaling to 100 Million Events

Our primary challenge was handling 100 million events per day efficiently. This required a deep understanding of:

Data Processing Fundamentals

The importance of proper serialization and compression
Memory management in distributed systems
Stateful vs. stateless processing
The critical difference between streaming and batch pipelines

Technical Implementation

We implemented both streaming and batch pipelines using Apache Flink with Java. Some key learning points were:

Object Creation: Every Java object needs careful consideration. In a high-throughput environment, even small inefficiencies get magnified millions of times.
Simplicity is Key: Complex objects and processing patterns that work fine in traditional applications can become bottlenecks in stream processing. We learned to keep things as simple as possible.
Storage Optimization: Working with Apache Parquet and AWS Athena taught us the importance of proper data storage formats and query optimization.

The Infrastructure Challenge

One of the most challenging aspects was setting up the environment with Kubernetes and the Flink Kubernetes operator. This required:

Understanding how Flink jobs deploy and scale on Kubernetes
Managing state backup and recovery
Handling job upgrades without data loss
Ensuring proper resource allocation and utilization

Cost Optimization Success

A significant achievement was participating in a cost optimization initiative that resulted in a 40% reduction in infrastructure costs. This involved:

Analyzing resource usage patterns
Right-sizing our Kubernetes clusters
Optimizing data storage and processing patterns
Implementing efficient scaling policies

Documentation: The Unsung Hero

One learning that stands out is the importance of documentation. In the world of analytics, where problems can be complex and solutions non-obvious, maintaining detailed documentation of issues and solutions became crucial. This helped:

Speed up problem resolution
Share knowledge across the team
Maintain system reliability
Reduce operational overhead

Key Learnings

Think Distributed: Every piece of code needs to be thought of in terms of how it will behave when distributed across multiple nodes.
Performance is Key: In analytics, performance isn't just about response time – it's about processing massive amounts of data efficiently.
Resource Awareness: Understanding resource utilization is crucial when dealing with big data processing.
Simplicity Wins: The simpler your code and architecture, the easier it is to maintain and scale.

Looking Forward

This transition to analytics has been a transformative experience. It's shown me how different aspects of software engineering – from code organization to infrastructure management – need to be approached differently when dealing with big data and stream processing.

The challenges of handling 100 million events daily have taught me the importance of:

Thinking at scale from day one
Understanding the entire data pipeline
Keeping performance in mind at every step
Maintaining robust documentation

For any developer looking to move into analytics, my advice would be to be prepared to challenge your existing assumptions about software development. The rules are different here, but the opportunities to learn and grow are immense.

Vishva Mahadevan