About the role

Aplyr's Quick Take

This role is focused on developing and optimizing real-time video generation models for interactive AI characters. It's a hands-on position that involves training models, improving data pipelines, and ensuring low-latency performance for user-facing applications. You'll spend most of your time on model training and optimization, with some work on product integration.

Good fit

Ideal candidates will have at least 2 years of experience in machine learning, particularly with video models. A strong background in both research and engineering, along with a collaborative mindset, will help you thrive in this role.

Worth noting

The salary range is quite high for this type of position, reflecting the specialized skills required. The role combines applied research with practical engineering, which may appeal to those who enjoy bridging the gap between theory and product.

About Cantina:

Cantina Labs is a social AI company, developing a suite of advanced real-time models that push the boundaries of expression, personality, and realism. We bring characters to life, transforming how people tell stories, connect, and create. We build and power ecosystems. Cantina, our flagship social AI platform, is just the beginning.

If you're excited about the potential AI has to shape human creativity and social interactions, join us in building the future!

About the Role:
We’re looking for a Member of Technical Staff with hands‑on experience building large‑scale video generation models—from data and training to distillation and acceleration into a fast, production‑ready model. Our models are human‑centric and product‑oriented: think interactive characters that can respond to text/audio/image inputs and generate video with very low latency.

This is an applied research + engineering role: you’ll work on training runs, data, model optimization, and the “make it fast” path that turns a capable research model into a real‑time experience.

Typical time split (roughly):

60–75% training / fine‑tuning / distillation of large video models
15–25% inference optimization (latency/memory/cost), model runtime work
10–15% prototyping + product integration (demos → shipped features)

What You’ll Do:

Train and scale video generation models: run large‑scale training/fine‑tuning on multi‑GPU (and when needed multi‑node) setups; own the training loop, stability, checkpoints, and iteration speed.
Own data for video modeling: build and improve video datasets/pipelines (decode/sampling, filtering/quality, conditioning alignment, storage formats), and keep the pipeline fast and reliable at scale.
Distill and compress big models into fast ones: teacher→student distillation, step reduction, architectural simplifications, and quality/speed trade‑offs to hit real‑time constraints.
Make models run in real time: profiling, memory optimizations, quantization-aware tactics where appropriate, kernel/runtime improvements, and practical throughput/latency wins.
Build the bridge to product: package models into simple inference APIs and prototypes; collaborate with product to turn research progress into user-facing experiences (interactive characters, conversational video).
Evaluate what matters: set up evaluation harnesses that track perceptual quality + temporal consistency + identity/character fidelity + latency/cost.

What You’ll Bring:

2+ years building and shipping ML systems (or equivalent), with clear ownership and delivery.
Strong PyTorch + Python, comfortable touching both training and inference code.
Hands‑on experience training or scaling generative models, ideally video generation (diffusion/transformers/VAEs or similar), not just using pre‑trained checkpoints.
Experience with distributed training and large runs (e.g., DDP/FSDP/DeepSpeed‑style workflows), and the practical debugging that comes with them.
Proven ability to improve performance in practice: latency/memory/cost optimizations, profiling, and shipping measurable wins.
Product mindset: can move from research ideas → robust implementation → iterating against real constraints.

Bonus Points For:

Experience with multimodal conditioning: audio‑to‑video, text+audio+image control, lip‑sync / gesture / character animation constraints.
End‑to‑end distillation experience (teacher/student design, eval strategy, failure analysis).
Familiarity with acceleration toolchains (Torch compile, Triton, TensorRT, ONNX, custom kernels) or model compression (quantization, pruning) where applicable.
Experience with real‑time streaming / WebRTC prototypes or low‑latency media delivery (helpful, but not the core of the role).

Technical Stack You’ll Work With:

ML: PyTorch (training + inference)
Models: large video generation (diffusion/transformers/VAEs), multimodal conditioning
Optimization: distillation, inference acceleration, multi‑GPU strategies
Product: rapid prototyping, lightweight inference APIs
Infra (supporting, not primary): Docker; cloud basics (AWS‑like services)

Location:

This role can be performed remotely in Europe, within GMT +/- 2 hours.

Compensation:

The anticipated annual base salary range for this role is between €190,000-€225,000, plus bonus. When determining compensation, a number of factors will be considered, including skills, experience, job scope, location, and competitive compensation market data.

Skills & Tags

node python aws docker ai data product design

Aplyr's read

Cantina is a creative agency blending design and technology, attracting talent in machine learning and digital experiences.
Synthesized from recent postings & public sources

What's promising

•Cantina offers diverse roles in cutting-edge fields like machine learning and video technology.
•The company emphasizes innovation in digital experiences, appealing to tech-savvy creatives.
•Cantina's focus on design and branding provides opportunities for creative professionals.

What to watch

•Limited public information about Cantina's financial stability and growth prospects.
•The niche focus may limit opportunities for professionals outside creative and tech roles.
•Potentially high competition for roles due to the specialized nature of the work.

Why Cantina

•Cantina uniquely integrates machine learning with creative digital solutions.
•The agency's emphasis on real-time video generation sets it apart in the digital space.
•Cantina's blend of design and technology attracts a diverse range of tech and creative talent.

Aplyr’s read is generated by AI from public sources. Was it useful?