TecAlliance


Principal Software Engineer

I was the Tech Lead for the TecDoc Catalog Backend team for ~9 years. I built and operated webservices and websites handling thousands of requests per second deployed into multiple AWS regions around the world.

TecAlliance?

TecDoc Logo

TecAlliance is a ~1,000-person worldwide company headquartered in Germany that provides automotive data and digital solutions for the global automotive aftermarket industry. Their shareholders are aftermarket parts manufacturers (e.g., Bosch, Bosal, Continental, Delphi, Mahle, etc.).

TecDoc Logo

TecAlliance owns the TecDoc Data Standard, which is the primary standard used outside of North America for describing automotive aftermarket part data. They are also THE data repository for TecDoc data. If you are dealing with aftermarket automotive parts in Europe (or many other places in the world) then you are probably a customer of TecAlliance (either directly or indirectly).

Re-building the TecDoc Webservice

The first major project at TecAlliance was to re-architect and re-implement the TecDoc Peagsus V3 Webservice. This was an existing webservice used both internally to power the Web Catalog 2.5 and also by external customers. It had performance problems. It had stability problems. It had downtime problems. Downtime was measured in days and customers were not happy.

Old Architecture

The existing architecture consisted of a Java SOAP service application sitting in front of a DB2 database that held all of the product information as well as another DB2 database that contained shared customer configuration information. There were 6 production “lines” that consisted of a set of independent servers that were load balanced plus a shared DB2 database that all of the Java Application Servers talked directly to.

Original TecDoc Webservice Architecture

Each “line” consisted of:

  1. A Proxy Server – I never got clarity on what this was. Maybe a server running Apache HTTPD + mod_proxy_ajp? I have no idea why that needed to be on a separate server.
  2. The App Server – Ran the Java SOAP Service
  3. DB2 Product Database – Contained all of the Part and Fitment data. This was read-only data updated once a month.

That’s a total of 18 production servers all with decent CPU and RAM specs! These were all managed dedicated servers at an ATOS datacenter in Frankfurt.

Problems with the Old Architecture

The performance problems came from using very complex SQL queries to implement all of the TecDoc data model business logic that was required. We are talking SQL queries that are hundreds of lines long if not more. There were joins and filters across dozens of tables with lots of copy & pasted SQL queries. I found some examples in their source control history where a bug that was fixed in one place also had to be fixed in 6 other places. This reminded me a lot of what the original SecondSpace setup looked like which had the exact same performance and maintenance problems.

DB2 and Oracle!

The other problem with the old architecture was cost! IBM DB2 databases were used to hold the product data and shared customer data. A separate Oracle Database was used to load and transform the raw data before dumping it and loading it into the final DB2 Databases.

The process seemed overly complex to me and required too many expensive database solutions.

New Architecture

The TecDoc Data Model consists of two major categories of data:

  1. Reference Data – Data supplied by TecAlliance that included things like Part Categories, Translations, Vehicles, Vehicle Configurations, etc.
  2. Supplier Data - Part and Fitment data supplied by the part manufacturers

Back in 2015 both sets of data were only updated monthly. Later the update frequency was increased to weekly. In both cases it was mostly static data.

Reference Data in Hash Tables

The reference data was mostly looked up by a primary key so this fit nicely into in-memory hash tables via fastutil. There were a few larger datasets (e.g. license plate to vehicle mappings for some countries) that I didn’t want to keep in the JVM heap so those were kept in on-disk LMDB tables.

Supplier Data in Apache Solr

The supplier data consisted of part data and fitment data. I needed to be able to perform complex filtering and sorting on this data so it was indexed into Apache Solr.

Business Logic in Scala

The final piece (and probably the largest) was implementing the business logic within the application. The old version of the webservice had almost all of the business logic within the SQL queries. This meant that the same business logic (e.g. country restrictions) were duplicated in many places. When looking through their old Git history I came across instances where the same bug had to be fixed in six different places!

I modeled the business logic within Scala classes and traits. Everything was de-duplicated and had clean APIs to work with. If you needed to apply country restriction logic on Parts or Vehicles you would just ask the Parts or Vehicle classes via a common API that used the same underlying logic for both cases. If there was a business logic bug then you only needed the fix it in once place.

Keeping things DRY seemed obvious to me, especially given how complex the business logic was.

Architecture

The original architecture of the new TecDoc Web Service was very simple. In our primary AWS region we had:

  1. MySQL RDS Master – Writes to MySQL happened via an admin website running in this region.
  2. Apache Solr Master – Indexing happened on this server and was then replicated out each region.
  3. S3 Bucket For Images – All of our product images were stored in S3 and served via CloudFront.

The MySQL database had customer configuration info which was very low traffic and easily cacheable. Apache Solr contained all of our Part and Fitment data. Each region had read replicas of both MySQL and Apache Solr that was used to serve production traffic.

The individual regions had:

  1. MySQL Read Replica
  2. EC2 Servers running:
    • HAProxy – SSL Termination (originally using Let’s Encrypt certificates).
    • Scala APP – The actual SOAP/JSON Web Service.
    • Apache Solr – Read replica of data from the global master.

The setup looked something like:

Initial New TecDoc Web Service Architecture

Initially we deployed into three AWS regions and each region only needed two EC2 servers to handle all of the load. But we could easily scale up by adding more EC2 servers and/or more regions.

Launch and Customer Migration

We had our first customers using the new version of the web service within a few months. These were mostly new customers outside of Europe who found the old service too slow to access. Over time, more and more customers switched to the newer version of the service to enjoy the higher performance and availability.

It took ~2 years to fully migrate everyone over to the new version of the service. The stragglers to migrate usually fell into two categories:

  1. There was some edge case that wasn’t yet implemented in the new version of the service. Sometimes this was buggy behavior of the old service that they relied upon.
  2. If it ain’t broke, don’t fit it. These customers either ignored our requests to switch to the new service or just didn’t want to change anything since the old service worked well enough for them.

It took a while to implement everything for customers that fell into the first bucket but we eventually did. For the customers in the second bucket we eventually forced them to migrate to the new service using a combination of DNS and HAProxy.

Forced Migration

The old and new versions of the webservice used different hostnames which let customers opt-in to the new version of the service. At a high level the setup looked like this:

Pegasus New and Old Routing

Eventually we needed to switch all customers over to the new version of the service. Ideally we wanted to “force migrate” selective users to the new version of the service while letting some users continue to use the old version of the service.

I came up with a plan to update DNS to route all traffic to the new version of the service and then use HAProxy to perform additional logic for customers using the old hostname. The customer information was embedded in the body of the request so I had to use HAProxy content inspection to look at the request body. Once I knew who the customer was HAProxy would either:

  1. Send the traffic to the new service if the customer was on the “forced migrate” list
  2. Proxy the traffic back to the old version of the service.

It looked something like this:

Pegasus Forced Migration

The content inspection to parse out the customer information was based on Regular Expressions. And you know what they say…

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

And of course it took some work to craft the correct regular expressions that worked with both our SOAP request and JSON requests. But eventually it did work and we were able to slowly force migrate all customers over time to the new version of the service.

Further Work

There were many other service and products worked on during my time at TecAlliance. Over time, I will document them and add stories, but for now I will just outline some of them:

  • Enhanced TecDoc Web Service Methods – I added a few web service methods (e.g. getArticles and getLinkageTargets) that replaced the functionality of several doezen other methods into a simpler API.
  • OptiCat Web Service – Similar to the TecDoc Web Service but providing North American data which uses the Auto Care standard.
  • OptiCat OnLine 2.0 – An upgrade to the OptiCat OnLine Catalog Website.
  • Truckissimo – Custom e-commerce catalog solution for A.D. France. Another multi-year project.
  • “Global” Web Service – Web Service unifying the TecDoc and AutoCare data into a single web service. For markets like Mexico where part manufacturers use both the TecDoc and AutoCare standards.
  • Web Catalog 3.0 Web Service – Backend Web Service supporting the TecAlliance Web Catalog.
  • Web Catalog 3.0 Admin Web Service – Backend Web Service providing administration APIs for managing TecAlliance Web Catalog configuration.
  • Endless performance improvements
    • Eliminating CPU hot spots
    • Reducing JVM Heap memory allocations
    • Using FlatBuffers, LMDB and JVM Direct Byte Buffers to eliminate JVM Heap allocations when dealing with fitments for a part.
  • Infrastructure as Code – I led the effort to switch as much as possible over to using IaC via the CDK.
  • Cloud Native – I also led the the effort to switch over to a Cloud Native architecture including:
    • Using ECS for the Web Service and Apache Solr to providing auto scaling and easier deployments.
    • Observability via X-Ray, CloudWatch, CloudWatch Metrics, OpenSearch, etc.
  • DynamoDB – DynamoDB Global Tables was used for customer data that needed to be updated quickly within a single region and eventually replicated across the rest of the regions.
  • Global Accelerator – Using Global Accelerator to avoid problems with Route 53 Latency Based Routing.
  • Instant Data Processing for TecDoc – Near real-time load and updating of supplier data for TecDoc.
  • Instant Data Processing for AutoCare – A demo built for AAPEX/SEMA.
  • Scala 2.12 to 2.13/3.X Upgrade – A massive undertaking to upgrade the entire code base.
  • …and many more…

Some Misc Images

Truckissimo Home Page
Home Page of Truckissimo
Truckissimo Home Page
TecDoc Webservice
Truckissimo Home Page
Opticat Webservice
TecAlliance booth at AAPEX 2017
TecAlliance Booth at AAPEX 2017 with a demo of the original TecDoc Web Catalog 3.0

Leaving TecAlliance

After almost 10 years doing work for TecAlliance, it was time to move on. I had outgrown the position. There wasn’t really anywhere else to go within the company and it didn’t help that I was located in the Seattle, WA area and nearly everybody else was in Europe.

So I made the decision to leave and take some time off of work. Of course, I got bored after about a week and started learning the Rust programming lanauge. More to come…